Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models
IndicST is a new dataset tailored for training and evaluating Speech LLMs for AST tasks (including ASR and TTS). It features meticulously curated, automatically, and manually verified synthetic data, offering 10.8k hours of training data and 1.13k hours of evaluation data.
Training data: We utilized ASR data from 14 open-source datasets available publicly, collectively 10.8k hrs spread over nine languages, and Table 1 provides more details. Each dataset consists of input speech audio along with transcription.
To synthetically generate the translation for input speech audio and transcription, we used IndicTrans2 tool. we consider two translation directions: one-to-many, where English (source) transcription is translated to text in 8 Indian languages (target), represented as en โ X, and many-to-one, where transcription in 8 Indian languages (source) is translated to English (target), represented as X
Datasets | en | hi | mr | gu | bn | ta | te | ml | kn | Duration (k hrs) |
---|---|---|---|---|---|---|---|---|---|---|
Spring Labs | โ | โ | โ | โ | โ | โ | โ | โ | โ | 2.2 |
Common accent | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.01 |
MUCS | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.22 |
CMU | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.06 |
CommonVoice | โ | โ | โ | โ | โ | โ | โ | โ | โ | 1.6 |
Gramavaani | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.095 |
Vaani | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.074 |
Lahaja | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.011 |
Shrutilipi | โ | โ | โ | โ | โ | โ | โ | โ | โ | 5.319 |
Google Corpus | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.034 |
Google Fleurs | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.087 |
Microsoft Speech Corpus | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.12 |
IISc MILE | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.45 |
IndicVoices | โ | โ | โ | โ | โ | โ | โ | โ | โ | 0.52 |
Duration | 1.4 | 3 | 1.1 | 0.5 | 1.7 | 1.4 | 0.5 | 0.4 | 0.8 | 10.8k hrs |
Test set : For evaluation, we created a test set for tSTwo scenarios
2. No input speech audio is available: For this case, we used the AI4Bharat Conv text-to-text translation dataset and speech audio for the source text pair generated using the TTS model. The duration of this test set is available in Table III. More details about this dataset can be found in IndicST paper. For TTS, you can check out the StyleTTS2 GitHub repository.
Languages | Duration (hrs) |
---|---|
hi | 137.1 |
mr | 166.5 |
gu | 116.2 |
bn | 104.2 |
ta | 166.3 |
te | 139.2 |
ml | 132.2 |
kn | 149.2 |
Languages | Duration (mins) |
---|---|
en | 28.9 |
hi | 36.1 |
mr | 40 |
gu | 36 |
bn | 44.3 |
ta | 39.9 |
te | 45.2 |
ml | 33.1 |
kn | 35.3 |
We have benchmarked the dataset for ASR and AST tasks using audio-llm (whisper + llama based LLM). We use whisper-large-v2 as a baseline for both the tasks. Results are given in Table IV and V, respectively, for ASR and AST tasks.
Languages | Baseline | M1 (TP1) | M2 (TP1) | ||||||
---|---|---|---|---|---|---|---|---|---|
Generic-ASR | Svarah | Kathbath | Generic-ASR | Svarah | Kathbath | Generic-ASR | Svarah | Kathbath | |
en | 23.3 | 25.6 | - | 17.7 | 32 | - | 16.5 | 26.4 | - |
hi | 63.7 | - | 44.5 | 34.3 | - | 14.6 | 27.3 | - | 9.9 |
mr | 99.7 | - | 91 | 29.5 | - | 31.9 | 24.2 | - | 29.7 |
gu | 109.4 | - | 109.9 | 56.3 | - | 34.2 | 41.3 | - | 25.9 |
bn | 116.6 | - | 110.9 | 69.4 | - | 26.8 | 63.2 | - | 26.9 |
ta | 66.6 | - | 59.1 | 37.1 | - | 39.3 | 38 | - | 34.6 |
te | 111.3 | - | 112.7 | 75.4 | - | 51.1 | 68.5 | - | 37.1 |
ml | 111.7 | - | 117.5 | 47.6 | - | 47.2 | 47.4 | - | 46.6 |
kn | 87.7 | - | 82.4 | 56.9 | - | 44.2 | 42.1 | - | 30.4 |
Models | Datasets | โ Indic | โ English | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
enโhi | enโmr | enโgu | enโbn | enโta | enโte | enโml | enโkn | hiโen | mrโen | guโen | bnโen | taโen | teโen | mlโen | knโen | ||
Baseline | Generic-AST | - | - | - | - | - | - | - | - | 16.9 | 13.1 | 10.7 | 7.7 | 11 | 7.7 | 11.9 | 8.1 |
Svarah | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |
Kathbath | - | - | - | - | - | - | - | - | 28.1 | 13.9 | 16.8 | 11.8 | 11.1 | 12.8 | 17.6 | 10.1 | |
AI4B | - | - | - | - | - | - | - | - | 28.8 | 17.1 | 19.3 | 19.7 | 14.5 | 17.1 | 15.7 | 12.7 | |
M1 (TP2) | Generic-AST | 30.2 | 19.9 | 25.1 | 24.4 | 18.5 | 19 | 16.7 | 18.8 | 29.2 | 32.4 | 30 | 13 | 24.2 | 14.6 | 29 | 23.8 |
Svarah | 20.9 | 10.6 | 14.9 | 14.5 | 7.9 | 10.2 | 7.4 | 11.5 | - | - | - | - | - | - | - | - | |
Kathbath | - | - | - | - | - | - | - | - | 36.6 | 22.3 | 25.3 | 20.8 | 17.7 | 19 | 22 | 15.9 | |
AI4B | 8.8 | 3.8 | 7.2 | 5.3 | 0.9 | 1.9 | 0.6 | 0.8 | 26.2 | 18.9 | 19.5 | 21.4 | 14.7 | 16.3 | 15.9 | 12.1 | |
M2 (TP2) | Generic-AST | 35.6 | 22.1 | 29 | 27.8 | 21.6 | 25 | 20 | 23.9 | 31 | 32 | 30.3 | 14.7 | 24.6 | 15 | 29.6 | 24.2 |
Svarah | 28.9 | 15.1 | 17.7 | 19.2 | 11 | 14.2 | 10.6 | 11 | - | - | - | - | - | - | - | - | |
Kathbath | - | - | - | - | - | - | - | - | 37.2 | 23.9 | 25.1 | 20.6 | 17.2 | 19.1 | 22.4 | 16.8 | |
AI4B | 13.4 | 6.9 | 9.5 | 6.3 | 1.6 | 2.1 | 1.2 | 1.2 | 26.7 | 19.2 | 19.4 | 22.1 | 14.7 | 17.4 | 16 | 13 | |
M2 (TP3) | Generic-AST | 37 | 22.6 | 30.8 | 28.6 | 23 | 25.4 | 20.6 | 23.7 | 30.2 | 33 | 32.3 | 15.4 | 24.4 | 16.2 | 30.5 | 26.2 |
Svarah | 23.9 | 14.7 | 19.3 | 18.9 | 11.8 | 14.5 | 10.1 | 15.2 | - | - | - | - | - | - | - | - | |
Kathbath | - | - | - | - | - | - | - | - | 38 | 24.2 | 25.6 | 22.3 | 18.4 | 20.2 | 22.5 | 17.3 | |
AI4B | 14.9 | 7.3 | 11.7 | 8.7 | 1.6 | 2.9 | 1.2 | 1.3 | 26.1 | 19.6 | 18.8 | 21.2 | 14 | 17.1 | 16.5 | 12.9 |
To download the dataset, visit the IndicST Hugging Face Repo.
The AST (Audio Speech Transcription) dataset is designed for training and evaluating speech-to-text models. This dataset consists of audio transcriptions and is split into three sets: development (dev), test, and training (train). Each set contains JSON files with audio file paths and corresponding transcriptions.
configs:
- config_name: ast-data
data_files:
- split: train
path: ast.zip/ast/train.json
- split: dev
path: ast.zip/ast/dev.json
- split: test
path: ast.zip/ast/test.json
- config_name: asr-data
data_files:
- split: train
path: asr.zip/asr/train.json
- split: dev
path: asr.zip/asr/dev.json
- split: test
path: asr.zip/asr/test.json
To use this dataset in your project, you can load it using a custom data loading script or directly access the files if integrated with a library that supports JSON. Example usage in Python:
import json
def load_dataset(file_path):
with open(file_path, 'r') as file:
data = json.load(file)
return data
train_data = load_dataset('path/to/ast/train.json')
This dataset is licensed under the Krutrim Community License. Contributions are welcome! Submit a pull request on GitHub.
@inproceedings{
sanket2025IndicST,
title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
booktitle={Proc. ICASSP},
year={2025}
}