Krutrim AI Labs

Description

India is one of the most vibrant and culturally diverse societies. Developing a general-purpose artificial intelligence system tailored for the Indian market presents unique challenges. These include accounting for the nation's cultural nuances, accommodating its linguistic diversity with numerous regional languages, adapting to the prominence of oral traditions, ensuring accessibility to relevant data sets, and achieving scalability to serve the vast population effectively. Careful consideration and innovative approaches are necessary to navigate these complexities successfully.

Existing foundation models for natural language tasks are predominantly trained on English data, limiting their effectiveness for languages native to India's over 1 billion citizens. Thousands of regional languages, dialects, language or code mixing pose representation challenges, exacerbated by sparse training data; Indic languages comprise just 1% of Common Crawl corpora despite India representing 18% of the global population. Consequently, lack of Indic language relevance and context representation leads current models to exhibit cultural and linguistic biases oriented towards Western contexts

We present Krutrim Large Language Model (LLM), a 2 trillion token multilingual foundation model designed to serve Indian demographic needs through equitable representation of the country's array of native tongues. Training data incorporates the largest known Indic language dataset, mitigating associated data scarcity obstacles that encumber model parity across dialects. Evaluations demonstrate Krutrim’s strong performance on Indic language benchmarks, surpassing or at par with state-of-the-art models despite being significantly smaller in training flops. Krutrim LLM also matches or exceeds standards set on English benchmarks by models trained on comparable flops (e.g. vs LLAMA-2 on 10 out of 16 tasks with average score of 0.57 vs 0.55 of LLAMA-2), evidencing flexible multilingual fluency. Through intentional design choices that redress endemic data imbalances, Krutrim LLM signifies meaningful progress in the pursuit of ethical, globally representative AI foundation models.

Use Cases

Multilingual Language Understanding and Generation

Krutrim-1 is natively multilingual and delivers promising performance on Indic language understanding and generation.

Backbone for Indic Multimodal Models

The model has been used as the backbone for our vision-language model - Chitrarth-1, as well as our speech-language model - Dhwani-1.

Demo

Cultural Sensitivity: Krutrim's ability to respect cultural practices. Krutrim is unbiased and does not favour one religion or caste.

Translation: Krutrim-1 demonstrates proficiency in translation tasks on Indic languages

Model Architecture and Training

The Krutrim model architecture draws from the standard decoder only transformer framework. We enlist key parameters of the model in the table below.

We trained the 7B parameter model on the context length of 4096 tokens. We use the ALiBi positional encoding method which helps in expanding the context length. We also leverage GQA for faster inference and lower KV cache memory footprint. We use clipping of QKV matrix values for stable training. The standard ReLU activation function is used.

#	Hyper parameter	Value
1	Layers	32
2	Max sequence length	4096
3	Vocab size	70K
4	Number of KV heads	8
5	Number of attention heads	48
6	Hidden dimension	4608

Tokenizer

Existing open source tokenizers do not perform well on Indic languages leading to high token to word ratio. A sub-optimal tokenizer leads to inferior training and inference performance in terms of speed and accuracy. In order to address that, we trained the tokenizer from scratch optimized for English and Indic languages.

Training

We pre-trained Krutrim-1 LLM on a dataset of 2 Trillion tokens. This was followed by supervised finetuning for instruction following on various tasks covering translation, summarization, general knowledge, coding, safety and others.

Evaluation

En Benchmarks

We evaluate the Krutrim-1 fine-tuned model on various English benchmark tasks and compare against Llama-2 7B chat SFT.

Indic Benchmarks

Model	bn	gu	hi	kn	ml	mr	ta	te
IndicCOPA
Krutrim-1	0.89	0.83	0.86	0.88	0.88	0.87	0.89	0.89
GPT-3.5	0.77	0.73	0.77	0.74	0.75	0.70	0.72	0.75
Airawata	-	-	0.74	-	-	-	-	-
Kan-LLaMA	-	-	-	0.74	-	-	-	-
Tam-LLaMA	-	-	-	-	-	-	0.77	-
IndicQA
Krutrim-1	0.65	0.64	0.64	0.60	0.66	0.58	0.75	0.83
Airawata	-	-	0.62	-	-	-	-	-
Kan-LLaMA	-	-	-	0.52	-	-	-	-
Tam-LLaMA	-	-	-	-	-	-	0.35	-
IndicSentiment
Krutrim-1	0.95	0.96	0.96	0.95	0.96	0.97	0.94	0.95
GPT-3.5	0.50	0.81	0.96	0.60	0.75	0.88	0.51	0.53
Airawata	-	-	0.84	-	-	-	-	-
Kan-LLaMA	-	-	-	0.85	-	-	-	-
Tam-LLaMA	-	-	-	-	-	-	0.78	-
IndicTranslation
Krutrim-1	0.88	0.89	0.95	0.88	0.89	0.92	-	0.88
Airawata	-	-	0.94	-	-	-	-	-
Kan-LLaMA	-	-	-	0.59	-	-	-	-
IndicXParaphrase
Krutrim-1	0.91	-	0.97	0.82	0.90	0.94	-	0.61
Airawata	-	-	0.60	-	-	-	-	-
Kan-LLaMA	-	-	-	0.59	-	-	-	-

How to Access the Model?

API Integration

Developers can integrate the model into their applications via the model API available on Krutrim cloud.

Run locally

Please visit the Krutrim-1 repository or Krutrim-1 HF. page for details on running the model locally.