India is one of the most vibrant and culturally diverse societies. Developing a general-purpose artificial intelligence system tailored for the Indian market presents unique challenges. These include accounting for the nation's cultural nuances, accommodating its linguistic diversity with numerous regional languages, adapting to the prominence of oral traditions, ensuring accessibility to relevant data sets, and achieving scalability to serve the vast population effectively. Careful consideration and innovative approaches are necessary to navigate these complexities successfully.
Existing foundation models for natural language tasks are predominantly trained on English data, limiting their effectiveness for languages native to India's over 1 billion citizens. Thousands of regional languages, dialects, language or code mixing pose representation challenges, exacerbated by sparse training data; Indic languages comprise just 1% of Common Crawl corpora despite India representing 18% of the global population. Consequently, lack of Indic language relevance and context representation leads current models to exhibit cultural and linguistic biases oriented towards Western contexts
We present Krutrim Large Language Model (LLM), a 2 trillion token multilingual foundation model designed to serve Indian demographic needs through equitable representation of the country's array of native tongues. Training data incorporates the largest known Indic language dataset, mitigating associated data scarcity obstacles that encumber model parity across dialects. Evaluations demonstrate Krutrim’s strong performance on Indic language benchmarks, surpassing or at par with state-of-the-art models despite being significantly smaller in training flops. Krutrim LLM also matches or exceeds standards set on English benchmarks by models trained on comparable flops (e.g. vs LLAMA-2 on 10 out of 16 tasks with average score of 0.57 vs 0.55 of LLAMA-2), evidencing flexible multilingual fluency. Through intentional design choices that redress endemic data imbalances, Krutrim LLM signifies meaningful progress in the pursuit of ethical, globally representative AI foundation models.
Krutrim-1 is natively multilingual and delivers promising performance on Indic language understanding and generation.
The model has been used as the backbone for our vision-language model - Chitrarth-1, as well as our speech-language model - Dhwani-1.
Cultural Sensitivity: Krutrim's ability to respect cultural practices. Krutrim is unbiased and does not favour one religion or caste.
Translation: Krutrim-1 demonstrates proficiency in translation tasks on Indic languages
The Krutrim model architecture draws from the standard decoder only transformer framework. We enlist key parameters of the model in the table below.
We trained the 7B parameter model on the context length of 4096 tokens. We use the ALiBi positional encoding method which helps in expanding the context length. We also leverage GQA for faster inference and lower KV cache memory footprint. We use clipping of QKV matrix values for stable training. The standard ReLU activation function is used.
# | Hyper parameter | Value |
---|---|---|
1 | Layers | 32 |
2 | Max sequence length | 4096 |
3 | Vocab size | 70K |
4 | Number of KV heads | 8 |
5 | Number of attention heads | 48 |
6 | Hidden dimension | 4608 |
Existing open source tokenizers do not perform well on Indic languages leading to high token to word ratio. A sub-optimal tokenizer leads to inferior training and inference performance in terms of speed and accuracy. In order to address that, we trained the tokenizer from scratch optimized for English and Indic languages.
We pre-trained Krutrim-1 LLM on a dataset of 2 Trillion tokens. This was followed by supervised finetuning for instruction following on various tasks covering translation, summarization, general knowledge, coding, safety and others.
We evaluate the Krutrim-1 fine-tuned model on various English benchmark tasks and compare against Llama-2 7B chat SFT.
Model | bn | gu | hi | kn | ml | mr | ta | te |
---|---|---|---|---|---|---|---|---|
IndicCOPA | ||||||||
Krutrim-1 | 0.89 | 0.83 | 0.86 | 0.88 | 0.88 | 0.87 | 0.89 | 0.89 |
GPT-3.5 | 0.77 | 0.73 | 0.77 | 0.74 | 0.75 | 0.70 | 0.72 | 0.75 |
Airawata | - | - | 0.74 | - | - | - | - | - |
Kan-LLaMA | - | - | - | 0.74 | - | - | - | - |
Tam-LLaMA | - | - | - | - | - | - | 0.77 | - |
IndicQA | ||||||||
Krutrim-1 | 0.65 | 0.64 | 0.64 | 0.60 | 0.66 | 0.58 | 0.75 | 0.83 |
Airawata | - | - | 0.62 | - | - | - | - | - |
Kan-LLaMA | - | - | - | 0.52 | - | - | - | - |
Tam-LLaMA | - | - | - | - | - | - | 0.35 | - |
IndicSentiment | ||||||||
Krutrim-1 | 0.95 | 0.96 | 0.96 | 0.95 | 0.96 | 0.97 | 0.94 | 0.95 |
GPT-3.5 | 0.50 | 0.81 | 0.96 | 0.60 | 0.75 | 0.88 | 0.51 | 0.53 |
Airawata | - | - | 0.84 | - | - | - | - | - |
Kan-LLaMA | - | - | - | 0.85 | - | - | - | - |
Tam-LLaMA | - | - | - | - | - | - | 0.78 | - |
IndicTranslation | ||||||||
Krutrim-1 | 0.88 | 0.89 | 0.95 | 0.88 | 0.89 | 0.92 | - | 0.88 |
Airawata | - | - | 0.94 | - | - | - | - | - |
Kan-LLaMA | - | - | - | 0.59 | - | - | - | - |
IndicXParaphrase | ||||||||
Krutrim-1 | 0.91 | - | 0.97 | 0.82 | 0.90 | 0.94 | - | 0.61 |
Airawata | - | - | 0.60 | - | - | - | - | - |
Kan-LLaMA | - | - | - | 0.59 | - | - | - | - |