Krutrim-2

SoTA Indic LLM outperforming 5x-10x bigger models on Indic tasks

Text Generation Model

12B params

Krutrim

🤗 Hugging Face

Github

Overview

Playground

Description

The plurality of Indian languages and culture poses unique challenges in building general purpose AI models for India. In the context of large language models, these include multilingual language understanding covering this linguistic diversity and the ability to respond back while honoring the cultural nuances and maintaining a personable conversational tone. Towards that we trained Krutrim-1 - our first multilingual foundational model for India in 2023 and released to the public in Jan 2024. The model delivered promising performance on multiple Indic benchmarks. However, owing to its smaller size (7B parameters) and training on relatively lesser FLOPs, it left a lot to be desired for its users. Given the prevalence of synthetic web data and lesser capacity to alignment - it had a tendency of confirmation bias leading to hallucinations like it was built by other AI labs.

Building upon our foundational work, we now present Krutrim-2, a best-in-class large language model for Indic. The model has been meticulously crafted to cater to various linguistic needs within India and beyond. Krutrim-2 is a 12 billion parameters dense transformer model, built on the Mistral-NeMo architecture. Our team ensured that Krutrim-2 received comprehensive training on a rich dataset encompassing English, Indic languages (hundreds of billions of tokens), code snippets, mathematical concepts, literary works, and high quality synthetically generated content. It is natively multilingual (English and 22 Indian languages) and supports a context window of 128K tokens.

The model delivers best-in-class performance across Indic tasks and a promising performance on English benchmarks equivalent to models 5-10x the size. We present details of the model architecture, pre-training, post-training and evaluation results. We also publicly release the post-trained versions of the model. We are continuously improving the model through post-training techniques such as RLVR.

At Krutrim, we have consistently strived towards creating advanced models capable of understanding, processing, and generating content in multiple languages. With Krutrim-2, our journey takes another significant leap and paves way forward towards our mission to build models for India.

Use Cases

Creative writing and More Relevant Responses in Indian languages

Krutrim-2 is natively multilingual, delivering state-of-the-art performance on Indic benchmarks. It also matches or outperforms models up to six times larger in multilingual tasks such as creative writing, summarization, and translation.

Long-form generation

The model supports a context window of 128K tokens enhancing its ability to handle extensive inputs and maintain context over longer interactions. This makes it well suited for long-form generation, multi-turn conversations, document translation, coding and others.

Multimodal applications

Its improved multilingual understanding and generation capabilities on Indian languages makes it the model of choice as a backbone in large multimodal models for visual understanding, captioning, and speech applications in the Indian context.

Cost efficient AI applications

With best-in-class performance on Indic, better than or competitive with that of models much larger in size in tasks like coding and instruction following, the model offers a significant cost advantage in integrating it in AI applications for India. Further, the enhanced multilingual understanding and generation capabilities of the model can be distilled into models much smaller in size.

Demo

Engages in multiturn conversation in Indic

Solves math problems in Indic

Explains code in your language

Indian context understanding

Support low-resource languages like Sanskrit

Use it in agentic applications

Model Architecture and Training

Krutrim-2 is a 12B parameter dense transformer model based on the Mistral-Nemo architecture. The model is pre-trained on high quality data comprising a curated mix of English, Indic, code, math, books, and synthetic data. It is natively multilingual (English and Indian languages) and supports a context window of 128K tokens. We followed a multi-stage training procedure, varying the data-mix, context size and batch size at every stage, leading to a stable and efficient model training.

After pre-training, the model underwent supervised training for cross-task instruction following and direct preference optimization for alignment.

#	Hyper parameter	Value
1	Layers	40
2	Max sequence length	128K
3	Vocab size	131K
4	Attention type	GQA (Group Query Attention)
5	Positional embeddings	RoPE

Evaluation

EN Benchmarks

We use the LM Evaluation Harness to evaluate our model on the En benchmarks tasks. Please note that at the time of writing this report, we were unable to use the evaluation framework for llama-3.3-70B, Gemini-1.5-flash and GPT-4o. We currency report the available published numbers for these models. We realise that the prompt templates and few-shot settings might vary and are working to make these evaluations consistent.

#	Benchmark	Metric	Krutrim-1 7B	MN-12B-Instruct	Krutrim-2 12B	llama-3.3-70B	Gemini-1.5 Flash	GPT-4o
1	Hellaswag (0-shot)	Common sense reasoning - Accuracy	0.74	0.82	0.83(0-shot)	0.95	0.87 (10-shot)	0.95 (10-shot)
2	Winogrande (0-shot)	Common sense reasoning and NLU - Accuracy	0.67	0.74	0.77(0-shot)	0.85 (5-shot)	-	0.88 (5-shot)
3	OpenBookQA (0-shot)	General knowledge, science - Accuracy	0.45	0.46	0.49(0-shot)	-	-	-
4	CommonSenseQA (0-shot)	Common sense reasoning - Accuracy	0.74	0.70	0.74(0-shot)	-	-	0.85
5	TruthfulQA (0-shot)	Factuality - Accuracy	0.49	0.54	0.59(0-shot)	-	-	0.59
6	MMLU (5-shot)	Language understanding - Accuracy	0.47	0.68	0.63(5-shot)	0.82	0.79	0.86
7	TriviaQA (5-shot)	Reading comprehension - EM	0.44	0.72	0.62(5-shot)	-	-	-
8	NaturalQuestions (5-shot)	EM	0.15	0.28	0.26(5-shot)	-	-	-
9	GSM8K (0-shot) Math	EM	0.07	0.74	0.71(0-shot)	0.93 (8-shot, CoT)	0.86 (11-shot)	0.89
12	ARC_Challenge (0-shot)	Knowledge reasoning - Accuracy	0.48	0.59	0.60(0-shot)	0.93 (25-shot)	-	0.50
13	ARC_Easy (0-shot)	Knowledge reasoning - Accuracy	0.73	0.80	0.82(0-shot)	-	-	-
14	HumanEvalCoding	Pass@10	0.00	0.23	0.80(Pass@10)	0.88	0.74 (0-shot)	0.90
16	IF_Eval (0-shot)	Inst. following - Accuracy	0.16	-	0.56(0-shot)	0.92	-	0.84

Indic Benchmarks

Average across 11 languages

#	Benchmark	Metric	Accuracy	Krutrim-1 7B	MN-12B-Instruct	Krutrim-2 12B	llama-3.3-70B	Gemini-1.5 Flash	GPT-4o
1	IndicSentiment (0-shot)	Text classification	0.65	0.70	0.95	-	0.96	0.99	0.98
2	IndicCOPA (0-shot)	Commonsense causal reasoning	0.51	0.58	0.80	0.83	0.88	0.91	-
3	IndicXParaphrase (0-shot)	Generation	0.67	0.74	0.88	0.87	0.89	TBD	-
4	IndicXNLI (0-shot)	Language understanding	0.47	0.54	0.55	TBD	TBD	0.67	-
5	IndicQA (0-shot)	Generation		0.90	0.91	TBD	TBD	TBD	-
6	CrossSumIN (1-shot)	Cross-lingual summarization		0.17	0.21	0.26	0.24	TBD	-
7	FloresIN Translation xx-en (1-shot)	Translation		0.50	0.58	0.60	0.62	0.63	-
8	FloresIN Translation en-xx (1-shot)	Translation		0.34	0.48	0.46	0.47	0.48	-
9	IN22 Translation xx-en (0-shot)	Translation		0.48	0.57	0.58	0.55	0.54	-
10	IN22 Translation en-xx (0-shot)	Translation		0.33	0.45	0.42	0.44	0.43	-

Bharat Bench

The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.

#	Benchmark	Metric	Score Type	Krutrim-1 7B	MN-12B-Instruct	Krutrim-2 12B	llama-3.1-70B-Instruct	Gemma-2-27B-Instruct	GPT-4o
1	Indian Cultural context (0-shot)	Generation	Bert Score	0.86	0.56	0.88	0.88	0.87	0.89
2	Grammar Correction (5-shot)	Language understanding	Bert Score	0.96	0.94	0.98	0.98	0.96	0.97
3	Multi Turn (0-shot)	Generation	Bert Score	0.88	0.87	0.91	0.90	0.89	0.92
4	Multi Turn Comprehension (0-shot)	Comprehension	Bert Score	0.90	0.89	0.92	0.93	0.91	0.94
5	Multi Turn Translation (0-shot)	Translation	Bert Score	0.85	0.87	0.92	0.91	0.91	0.92
6	Text Classification (5-shot)	Classification	Accuracy	0.61	0.71	0.76	0.88	0.86	0.89
7	Named Entity Recognition (5-shot)	NER	Accuracy	0.31	0.51	0.53	0.61	0.65	0.65

Qualitative Evaluation

Below are the results from manual evaluation of prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation.

How to access the model?

Chat Application

Users can directly access the model on our chat application here: chat.olakrutrim.com/home

API Integration

Developers can integrate the model into their applications via the model API available on Krutrim Cloud.

Run locally

Please visit the Krutrim-2 repository or Krutrim-2 HF page for details on running the model locally.