vietnamese_asr

The best Vietnamese speech recognition using Conforme-CTC 2024

https://github.com/thinhlx1993/vietnamese_asr

Training data

I collect data from many different sources for training. The training data contains over 10k hours of speech data from the sources below

Common Voice dataset
VIVOS dataset
(AN4) database audio files
Vietnamese Speech recognition
Youtube public dataset
Vietnamese Dialogue Telephony speech dataset
Travel Call Center Speech Data
LibriSpeech

Models setup

Tokenizer SentencePieceTokenizer initialized with 128 tokens

121 M Total params

Name	Type	Params
0	preprocessor	AudioToMelSpectrogramPreprocessor	0
1	encoder	ConformerEncoder	121 M
2	decoder	ConvASRDecoder	66.2 K
3	loss	CTCLoss	0
4	spec_augmentation	SpectrogramAugmentation	0
5	wer	WER	0

Benchmark WER result

	WER	CER
without ngram LM	10.71	12.21
with ngram LM	9.15	10.2

How to use

Download model weight here

https://drive.google.com/drive/folders/1SVNibfeMshfVkmatIU90LYok_Mf0zMD0?usp=sharing

Install Nemo Frameworks

https://github.com/NVIDIA/NeMo

You can try demo in the example folder or this one

I created a free-to-use API server to submit the inference data

The file input should have a bitrate of 16000 to avoid hidden bugs

File duration must be lower than 10s

import subprocess

command = [
    "curl", "--location", "https://api.voicesplitter.com/api/v1/uploads",
    "--form", 'file=@"/path/to/your/wav_file.wav"'
]

result = subprocess.run(command, capture_output=True, text=True)
print(result.stdout.encode('utf-8').decode('unicode_escape'))

Contact

thinhle.ict@gmail.com

Thinh Le’s LinkedIn Profile

This site is open source. Improve this page.