IndicST-Dataset

Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models

Speech-to-Text and Automatic Speech Translation Dataset

10.8k hrs training, 1.13k hrs evaluation

Description

IndicST is a new dataset tailored for training and evaluating Speech LLMs for AST tasks (including ASR and TTS). It features meticulously curated, automatically, and manually verified synthetic data, offering 10.8k hours of training data and 1.13k hours of evaluation data.

Use Cases

ASR (Speech-to-Text)

Transcribing Indic languages
Handling accents and noisy environments
Supporting low-resource language ASR

Automatic Speech Translation (AST)

Speech-to-speech and speech-to-text translation
Real-time multilingual communication

Dataset Details

Training data: We utilized ASR data from 14 open-source datasets available publicly, collectively 10.8k hrs spread over nine languages, and Table 1 provides more details. Each dataset consists of input speech audio along with transcription.

To synthetically generate the translation for input speech audio and transcription, we used IndicTrans2 tool. we consider two translation directions: one-to-many, where English (source) transcription is translated to text in 8 Indian languages (target), represented as en → X, and many-to-one, where transcription in 8 Indian languages (source) is translated to English (target), represented as X

Summary of ASR datasets for various Indian Languages

Datasets	en	hi	mr	gu	bn	ta	te	ml	kn	Duration (k hrs)
Spring Labs	✔	✔	✘	✘	✘	✘	✘	✘	✘	2.2
Common accent	✔	✘	✘	✘	✘	✘	✘	✘	✘	0.01
MUCS	✘	✔	✔	✘	✔	✘	✘	✘	✘	0.22
CMU	✘	✔	✔	✔	✔	✔	✔	✘	✔	0.06
CommonVoice	✘	✔	✔	✘	✔	✔	✔	✔	✘	1.6
Gramavaani	✘	✔	✘	✘	✘	✘	✘	✘	✘	0.095
Vaani	✘	✔	✔	✔	✔	✔	✔	✘	✔	0.074
Lahaja	✘	✔	✘	✘	✘	✘	✘	✘	✘	0.011
Shrutilipi	✘	✔	✔	✔	✔	✔	✔	✔	✔	5.319
Google Corpus	✘	✘	✔	✔	✘	✔	✔	✔	✔	0.034
Google Fleurs	✘	✘	✔	✔	✔	✔	✔	✔	✔	0.087
Microsoft Speech Corpus	✘	✘	✘	✔	✘	✔	✔	✘	✘	0.12
IISc MILE	✘	✘	✘	✘	✘	✔	✘	✘	✔	0.45
IndicVoices	✘	✔	✔	✔	✔	✔	✔	✔	✔	0.52
Duration	1.4	3	1.1	0.5	1.7	1.4	0.5	0.4	0.8	10.8k hrs

Test set : For evaluation, we created a test set for tSTwo scenarios

1. Input Speech audio available: We used the Kathbath ASR dataset for this scenario to get the X → en translation pair (more details in Table 2) and the Svarah dataset en → X translation pair. You can find more about them here: Kathbath and Svarah.

2. No input speech audio is available: For this case, we used the AI4Bharat Conv text-to-text translation dataset and speech audio for the source text pair generated using the TTS model. The duration of this test set is available in Table III. More details about this dataset can be found in IndicST paper. For TTS, you can check out the StyleTTS2 GitHub repository.

Language-wise duration(hrs) of the audio in Kathbath

Languages	Duration (hrs)
hi	137.1
mr	166.5
gu	116.2
bn	104.2
ta	166.3
te	139.2
ml	132.2
kn	149.2

Language-wise duration(hrs) of the audio in AI4Bharat

Languages	Duration (mins)
en	28.9
hi	36.1
mr	40
gu	36
bn	44.3
ta	39.9
te	45.2
ml	33.1
kn	35.3

Evaluation Results

We have benchmarked the dataset for ASR and AST tasks using audio-llm (whisper + llama based LLM). We use whisper-large-v2 as a baseline for both the tasks. Results are given in Table IV and V, respectively, for ASR and AST tasks.

Performance Metric with TP1 across different models on in-domain generic-ASR and out-of-domain Svarah and Kathbath test sets.

Languages	Baseline			M1 (TP1)			M2 (TP1)
Languages	Generic-ASR	Svarah	Kathbath	Generic-ASR	Svarah	Kathbath	Generic-ASR	Svarah	Kathbath
en	23.3	25.6	-	17.7	32	-	16.5	26.4	-
hi	63.7	-	44.5	34.3	-	14.6	27.3	-	9.9
mr	99.7	-	91	29.5	-	31.9	24.2	-	29.7
gu	109.4	-	109.9	56.3	-	34.2	41.3	-	25.9
bn	116.6	-	110.9	69.4	-	26.8	63.2	-	26.9
ta	66.6	-	59.1	37.1	-	39.3	38	-	34.6
te	111.3	-	112.7	75.4	-	51.1	68.5	-	37.1
ml	111.7	-	117.5	47.6	-	47.2	47.4	-	46.6
kn	87.7	-	82.4	56.9	-	44.2	42.1	-	30.4

Peformance Metric with TP2(AST-only) and TP3(ASR + AST) across different models on in-domain generic-AST and out-of-domain Svarah, Kathbath and AI4Bharat test sets.

Models	Datasets	→ Indic								→ English
Models	Datasets	en→hi	en→mr	en→gu	en→bn	en→ta	en→te	en→ml	en→kn	hi→en	mr→en	gu→en	bn→en	ta→en	te→en	ml→en	kn→en
Baseline	Generic-AST	-	-	-	-	-	-	-	-	16.9	13.1	10.7	7.7	11	7.7	11.9	8.1
	Svarah	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-	-
	Kathbath	-	-	-	-	-	-	-	-	28.1	13.9	16.8	11.8	11.1	12.8	17.6	10.1
	AI4B	-	-	-	-	-	-	-	-	28.8	17.1	19.3	19.7	14.5	17.1	15.7	12.7
M1 (TP2)	Generic-AST	30.2	19.9	25.1	24.4	18.5	19	16.7	18.8	29.2	32.4	30	13	24.2	14.6	29	23.8
	Svarah	20.9	10.6	14.9	14.5	7.9	10.2	7.4	11.5	-	-	-	-	-	-	-	-
	Kathbath	-	-	-	-	-	-	-	-	36.6	22.3	25.3	20.8	17.7	19	22	15.9
	AI4B	8.8	3.8	7.2	5.3	0.9	1.9	0.6	0.8	26.2	18.9	19.5	21.4	14.7	16.3	15.9	12.1
M2 (TP2)	Generic-AST	35.6	22.1	29	27.8	21.6	25	20	23.9	31	32	30.3	14.7	24.6	15	29.6	24.2
	Svarah	28.9	15.1	17.7	19.2	11	14.2	10.6	11	-	-	-	-	-	-	-	-
	Kathbath	-	-	-	-	-	-	-	-	37.2	23.9	25.1	20.6	17.2	19.1	22.4	16.8
	AI4B	13.4	6.9	9.5	6.3	1.6	2.1	1.2	1.2	26.7	19.2	19.4	22.1	14.7	17.4	16	13
M2 (TP3)	Generic-AST	37	22.6	30.8	28.6	23	25.4	20.6	23.7	30.2	33	32.3	15.4	24.4	16.2	30.5	26.2
	Svarah	23.9	14.7	19.3	18.9	11.8	14.5	10.1	15.2	-	-	-	-	-	-	-	-
	Kathbath	-	-	-	-	-	-	-	-	38	24.2	25.6	22.3	18.4	20.2	22.5	17.3
	AI4B	14.9	7.3	11.7	8.7	1.6	2.9	1.2	1.3	26.1	19.6	18.8	21.2	14	17.1	16.5	12.9

Dataset Download

To download the dataset, visit the IndicST Hugging Face Repo.

Data Structure

The AST (Audio Speech Transcription) dataset is designed for training and evaluating speech-to-text models. This dataset consists of audio transcriptions and is split into three sets: development (dev), test, and training (train). Each set contains JSON files with audio file paths and corresponding transcriptions.

configs:
  - config_name: ast-data
    data_files:
      - split: train
        path: ast.zip/ast/train.json
      - split: dev
        path: ast.zip/ast/dev.json
      - split: test
        path: ast.zip/ast/test.json
  - config_name: asr-data
    data_files:
      - split: train
        path: asr.zip/asr/train.json
      - split: dev
        path: asr.zip/asr/dev.json
      - split: test
        path: asr.zip/asr/test.json

How to Use and Run

To use this dataset in your project, you can load it using a custom data loading script or directly access the files if integrated with a library that supports JSON. Example usage in Python:

import json

def load_dataset(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

train_data = load_dataset('path/to/ast/train.json')

License and Contributions

This dataset is licensed under the Krutrim Community License. Contributions are welcome! Submit a pull request on GitHub.

Citation

@inproceedings{
  sanket2025IndicST,
  title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
  author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
  booktitle={Proc. ICASSP},
  year={2025}
}