Back

Dhwani-1

SoTA Speech LLM trained on Krutrim-1

Speech-to-Text Translation Model
Overview
Playground

Description

Dhwani, India's first end-to-end trained speech LLM, is powered by our Krutrim-1 LLM. This enables our LLM to directly understand speech without a separate speech-to-text (ASR) model, thereby avoiding any ASR errors cascading into LLM. As part of this release, we are open-sourcing the speech-to-text translation capabilities of our Dhwani model. It supports translation between Indic Languages and English. The supported Indic languages are English, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Tamil, and Telugu.

Hero Video

Use-Cases

  1. Multilingual Communication – Enables seamless conversations between people speaking different Indic languages.
  2. Media & Entertainment – Translates movies, TV shows, and podcasts for Indic audiences.
  3. Education & E-learning – Helps in translating lectures, training materials, and online courses.
  4. Travel & Tourism – Assists travelers with translation of speech.
  5. Customer Support – Enhances multilingual customer service through automated translations.
  6. Healthcare – Facilitates doctor-patient communication in different languages.
  7. Business & Corporate – Enables meetings and document translation for multinational teams.
  8. Legal & Government – Supports courtroom interpretation and official document translation.

Evaluation Results

English → Indic BLEU scores

Metricen_hinen_gujen_maren_benen_tamen_telen_malen_kan
Avg57.744.343.349.047.040.839.047.0

Indic → English BLEU scores

Metrichin_enguj_enmar_enben_entam_entel_enmal_enkan_en
Avg35.734.633.219.225.417.438.928.0

Model Architecture

  • Dual Encoder Structure:
    • Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
    • Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
  • Connection Module:
    • Window-Level Query Transformer (Q-Former): Acts as a bridge between the audio encoders and the Large Language Model (LLM). It segments the variable-length outputs from the encoders into fixed-size windows, converting them into a set of textual tokens that the LLM can process.
  • Large Language Model (LLM):
    • Krutrim LLM: A pre-trained text-based LLM that receives the processed tokens from the Q-Former, enabling it to handle and interpret audio-derived information.
  • Adaptation Mechanism:
    • Low-Rank Adaptation (LoRA): Applied to the Krutrim LLM to fine-tune its parameters, ensuring effective alignment between the audio-derived inputs and the model's output space.

Pre Training

  • Speech Encoder: Utilizes the Whisper model's speech encoder to process speech inputs.
  • Audio Encoder: Employs the BEATs audio encoder for non-speech audio inputs, such as environmental sounds and music.
  • Connection Module: Uses the Window-Level Query Transformer (Q-Former) to bridge the audio encoders and the Large Language Model (LLM).
  • Large Language Model (LLM): The Krutrim LLM receives the processed tokens, handling audio-derived information.
  • Adaptation Mechanism: Low-Rank Adaptation (LoRA) is applied to fine-tune the LLM to align the audio inputs with the model's output.

Post Training

To adapt Q-former and LoRA weights, we used techniques mentioned in the paper IndicST. Along with the IndicST translation dataset, we also used in-house-collected translation data to further improve the performance of translation results.

How to Access This Model

Hugging Face: Dhwani HF
Krutrim Cloud: Krutrim Cloud

License

This code repository and the model weights are licensed under Krutrim Community License.

Publication

"IndicST Indian Multilingual Translation Corpus For Evaluating Speech Large Language Models", Sanket Shah, Kavya Ranjan Saxena, Kancharana Manideep Bharadwaj, Sharath Adavanne, Nagaraj Adiga. ICASSP 2025.