https://github.com/thinhlx1993/vietnamese_asr
I collect data from many different sources for training. The training data contains over 10k hours of speech data from the sources below
Tokenizer SentencePieceTokenizer initialized with 128 tokens
121 M Total params
Name | Type | Params | |
---|---|---|---|
0 | preprocessor | AudioToMelSpectrogramPreprocessor | 0 |
1 | encoder | ConformerEncoder | 121 M |
2 | decoder | ConvASRDecoder | 66.2 K |
3 | loss | CTCLoss | 0 |
4 | spec_augmentation | SpectrogramAugmentation | 0 |
5 | wer | WER | 0 |
WER | CER | |
---|---|---|
without ngram LM | 10.71 | 12.21 |
with ngram LM | 9.15 | 10.2 |
https://drive.google.com/drive/folders/1SVNibfeMshfVkmatIU90LYok_Mf0zMD0?usp=sharing
https://github.com/NVIDIA/NeMo
I created a free-to-use API server to submit the inference data
The file input should have a bitrate of 16000 to avoid hidden bugs
File duration must be lower than 10s
import subprocess
command = [
"curl", "--location", "https://api.voicesplitter.com/api/v1/uploads",
"--form", 'file=@"/path/to/your/wav_file.wav"'
]
result = subprocess.run(command, capture_output=True, text=True)
print(result.stdout.encode('utf-8').decode('unicode_escape'))
thinhle.ict@gmail.com | Thinh Le’s LinkedIn Profile |