Back

IndicST-Dataset

Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models

Speech-to-Text and Automatic Speech Translation Dataset
10.8k hrs training, 1.13k hrs evaluation

Description

IndicST is a new dataset tailored for training and evaluating Speech LLMs for AST tasks (including ASR and TTS). It features meticulously curated, automatically, and manually verified synthetic data, offering 10.8k hours of training data and 1.13k hours of evaluation data.

Use Cases

ASR (Speech-to-Text)

  • Transcribing Indic languages
  • Handling accents and noisy environments
  • Supporting low-resource language ASR

Automatic Speech Translation (AST)

  • Speech-to-speech and speech-to-text translation
  • Real-time multilingual communication

Dataset Details

Training data: We utilized ASR data from 14 open-source datasets available publicly, collectively 10.8k hrs spread over nine languages, and Table 1 provides more details. Each dataset consists of input speech audio along with transcription.

To synthetically generate the translation for input speech audio and transcription, we used IndicTrans2 tool. we consider two translation directions: one-to-many, where English (source) transcription is translated to text in 8 Indian languages (target), represented as en โ†’ X, and many-to-one, where transcription in 8 Indian languages (source) is translated to English (target), represented as X

Summary of ASR datasets for various Indian Languages

DatasetsenhimrgubntatemlknDuration (k hrs)
Spring Labsโœ”โœ”โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜2.2
Common accentโœ”โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜0.01
MUCSโœ˜โœ”โœ”โœ˜โœ”โœ˜โœ˜โœ˜โœ˜0.22
CMUโœ˜โœ”โœ”โœ”โœ”โœ”โœ”โœ˜โœ”0.06
CommonVoiceโœ˜โœ”โœ”โœ˜โœ”โœ”โœ”โœ”โœ˜1.6
Gramavaaniโœ˜โœ”โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜0.095
Vaaniโœ˜โœ”โœ”โœ”โœ”โœ”โœ”โœ˜โœ”0.074
Lahajaโœ˜โœ”โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜โœ˜0.011
Shrutilipiโœ˜โœ”โœ”โœ”โœ”โœ”โœ”โœ”โœ”5.319
Google Corpusโœ˜โœ˜โœ”โœ”โœ˜โœ”โœ”โœ”โœ”0.034
Google Fleursโœ˜โœ˜โœ”โœ”โœ”โœ”โœ”โœ”โœ”0.087
Microsoft Speech Corpusโœ˜โœ˜โœ˜โœ”โœ˜โœ”โœ”โœ˜โœ˜0.12
IISc MILEโœ˜โœ˜โœ˜โœ˜โœ˜โœ”โœ˜โœ˜โœ”0.45
IndicVoicesโœ˜โœ”โœ”โœ”โœ”โœ”โœ”โœ”โœ”0.52
Duration1.431.10.51.71.40.50.40.810.8k hrs

Test set : For evaluation, we created a test set for tSTwo scenarios

1. Input Speech audio available: We used the Kathbath ASR dataset for this scenario to get the X โ†’ en translation pair (more details in Table 2) and the Svarah dataset en โ†’ X translation pair. You can find more about them here: Kathbath and Svarah.

2. No input speech audio is available: For this case, we used the AI4Bharat Conv text-to-text translation dataset and speech audio for the source text pair generated using the TTS model. The duration of this test set is available in Table III. More details about this dataset can be found in IndicST paper. For TTS, you can check out the StyleTTS2 GitHub repository.

Language-wise duration(hrs) of the audio in Kathbath

LanguagesDuration (hrs)
hi137.1
mr166.5
gu116.2
bn104.2
ta166.3
te139.2
ml132.2
kn149.2

Language-wise duration(hrs) of the audio in AI4Bharat

LanguagesDuration (mins)
en28.9
hi36.1
mr40
gu36
bn44.3
ta39.9
te45.2
ml33.1
kn35.3

Evaluation Results

We have benchmarked the dataset for ASR and AST tasks using audio-llm (whisper + llama based LLM). We use whisper-large-v2 as a baseline for both the tasks. Results are given in Table IV and V, respectively, for ASR and AST tasks.

Performance Metric with TP1 across different models on in-domain generic-ASR and out-of-domain Svarah and Kathbath test sets.

LanguagesBaselineM1 (TP1)M2 (TP1)
Generic-ASRSvarahKathbathGeneric-ASRSvarahKathbathGeneric-ASRSvarahKathbath
en23.325.6-17.732-16.526.4-
hi63.7-44.534.3-14.627.3-9.9
mr99.7-9129.5-31.924.2-29.7
gu109.4-109.956.3-34.241.3-25.9
bn116.6-110.969.4-26.863.2-26.9
ta66.6-59.137.1-39.338-34.6
te111.3-112.775.4-51.168.5-37.1
ml111.7-117.547.6-47.247.4-46.6
kn87.7-82.456.9-44.242.1-30.4

Peformance Metric with TP2(AST-only) and TP3(ASR + AST) across different models on in-domain generic-AST and out-of-domain Svarah, Kathbath and AI4Bharat test sets.

ModelsDatasetsโ†’ Indicโ†’ English
enโ†’hienโ†’mrenโ†’guenโ†’bnenโ†’taenโ†’teenโ†’mlenโ†’knhiโ†’enmrโ†’enguโ†’enbnโ†’entaโ†’enteโ†’enmlโ†’enknโ†’en
BaselineGeneric-AST--------16.913.110.77.7117.711.98.1
Svarah----------------
Kathbath--------28.113.916.811.811.112.817.610.1
AI4B--------28.817.119.319.714.517.115.712.7
M1 (TP2)Generic-AST30.219.925.124.418.51916.718.829.232.4301324.214.62923.8
Svarah20.910.614.914.57.910.27.411.5--------
Kathbath--------36.622.325.320.817.7192215.9
AI4B8.83.87.25.30.91.90.60.826.218.919.521.414.716.315.912.1
M2 (TP2)Generic-AST35.622.12927.821.6252023.9313230.314.724.61529.624.2
Svarah28.915.117.719.21114.210.611--------
Kathbath--------37.223.925.120.617.219.122.416.8
AI4B13.46.99.56.31.62.11.21.226.719.219.422.114.717.41613
M2 (TP3)Generic-AST3722.630.828.62325.420.623.730.23332.315.424.416.230.526.2
Svarah23.914.719.318.911.814.510.115.2--------
Kathbath--------3824.225.622.318.420.222.517.3
AI4B14.97.311.78.71.62.91.21.326.119.618.821.21417.116.512.9

Dataset Download

To download the dataset, visit the IndicST Hugging Face Repo.

Data Structure

The AST (Audio Speech Transcription) dataset is designed for training and evaluating speech-to-text models. This dataset consists of audio transcriptions and is split into three sets: development (dev), test, and training (train). Each set contains JSON files with audio file paths and corresponding transcriptions.

configs:
  - config_name: ast-data
    data_files:
      - split: train
        path: ast.zip/ast/train.json
      - split: dev
        path: ast.zip/ast/dev.json
      - split: test
        path: ast.zip/ast/test.json
  - config_name: asr-data
    data_files:
      - split: train
        path: asr.zip/asr/train.json
      - split: dev
        path: asr.zip/asr/dev.json
      - split: test
        path: asr.zip/asr/test.json

How to Use and Run

To use this dataset in your project, you can load it using a custom data loading script or directly access the files if integrated with a library that supports JSON. Example usage in Python:

import json

def load_dataset(file_path):
    with open(file_path, 'r') as file:
        data = json.load(file)
    return data

train_data = load_dataset('path/to/ast/train.json')

License and Contributions

This dataset is licensed under the Krutrim Community License. Contributions are welcome! Submit a pull request on GitHub.

Citation

@inproceedings{
  sanket2025IndicST,
  title={{IndicST}: Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models},
  author={Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga},
  booktitle={Proc. ICASSP},
  year={2025}
}