Dhwani-1

SoTA Speech LLM trained on Krutrim-1

Speech-to-Text Translation Model

Krutrim

🤗 Hugging Face

Github

Overview

Playground

Description

Dhwani, India's first end-to-end trained speech LLM, is powered by our Krutrim-1 LLM. This enables our LLM to directly understand speech without a separate speech-to-text (ASR) model, thereby avoiding any ASR errors cascading into LLM. As part of this release, we are open-sourcing the speech-to-text translation capabilities of our Dhwani model. It supports translation between Indic Languages and English. The supported Indic languages are English, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, and Telugu.

Use-Cases

Multilingual Communication – Enables seamless conversations between people speaking different Indic languages.
Media & Entertainment – Translates movies, TV shows, and podcasts for Indic audiences.
Education & E-learning – Helps in translating lectures, training materials, and online courses.
Travel & Tourism – Assists travelers with translation of speech.
Customer Support – Enhances multilingual customer service through automated translations.
Healthcare – Facilitates doctor-patient communication in different languages.
Business & Corporate – Enables meetings and document translation for multinational teams.
Legal & Government – Supports courtroom interpretation and official document translation.

Evaluation Results

English → Indic BLEU scores

Metric	en_hin	en_guj	en_mar	en_ben	en_tam	en_tel	en_mal	en_kan
Avg	57.7	44.3	43.3	49.0	47.0	40.8	39.0	47.0

Indic → English BLEU scores

Metric	hin_en	guj_en	mar_en	ben_en	tam_en	tel_en	mal_en	kan_en
Avg	35.7	34.6	33.2	19.2	25.4	17.4	38.9	28.0

Model Architecture

Dual Encoder Structure:

Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.

Connection Module:

Window-Level Query Transformer (Q-Former): Acts as a bridge between the audio encoders and the Large Language Model (LLM). It segments the variable-length outputs from the encoders into fixed-size windows, converting them into a set of textual tokens that the LLM can process.

Large Language Model (LLM):

Krutrim LLM: A pre-trained text-based LLM that receives the processed tokens from the Q-Former, enabling it to handle and interpret audio-derived information.

Adaptation Mechanism:

Low-Rank Adaptation (LoRA): Applied to the Krutrim LLM to fine-tune its parameters, ensuring effective alignment between the audio-derived inputs and the model's output space.

Pre Training

Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.

Post Training

To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.

How to Access This Model

Hugging Face: Dhwani HF

Github: Dhwani GitHub

Krutrim Cloud: Krutrim Cloud

License

This code repository and the model weights are licensed under Krutrim Community License.

Publication

"IndicST Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models", Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga. ICASSP 2025.