Back

Krutrim-2

SoTA Indic LLM outperforming 5x-10x bigger models on Indic tasks

Text Generation Model
12B params
Overview
Playground

Description

The plurality of Indian languages and culture poses unique challenges in building general purpose AI models for India. In the context of large language models, these include multilingual language understanding covering this linguistic diversity and the ability to respond back while honoring the cultural nuances and maintaining a personable conversational tone. Towards that we trained Krutrim-1 - our first multilingual foundational model for India in 2023 and released to the public in Jan 2024. The model delivered promising performance on multiple Indic benchmarks. However, owing to its smaller size (7B parameters) and training on relatively lesser FLOPs, it left a lot to be desired for its users. Given the prevalence of synthetic web data and lesser capacity to alignment - it had a tendency of confirmation bias leading to hallucinations like it was built by other AI labs.

Building upon our foundational work, we now present Krutrim-2, a best-in-class large language model for Indic. The model has been meticulously crafted to cater to various linguistic needs within India and beyond. Krutrim-2 is a 12 billion parameters dense transformer model, built on the Mistral-NeMo architecture. Our team ensured that Krutrim-2 received comprehensive training on a rich dataset encompassing English, Indic languages (hundreds of billions of tokens), code snippets, mathematical concepts, literary works, and high quality synthetically generated content. It is natively multilingual (English and 22 Indian languages) and supports a context window of 128K tokens.

The model delivers best-in-class performance across Indic tasks and a promising performance on English benchmarks equivalent to models 5-10x the size. We present details of the model architecture, pre-training, post-training and evaluation results. We also publicly release the post-trained versions of the model. We are continuously improving the model through post-training techniques such as RLVR.

At Krutrim, we have consistently strived towards creating advanced models capable of understanding, processing, and generating content in multiple languages. With Krutrim-2, our journey takes another significant leap and paves way forward towards our mission to build models for India.

Hero Video

Use Cases

Creative writing and More Relevant Responses in Indian languages

Krutrim-2 is natively multilingual, delivering state-of-the-art performance on Indic benchmarks. It also matches or outperforms models up to six times larger in multilingual tasks such as creative writing, summarization, and translation.

Long-form generation

The model supports a context window of 128K tokens enhancing its ability to handle extensive inputs and maintain context over longer interactions. This makes it well suited for long-form generation, multi-turn conversations, document translation, coding and others.

Multimodal applications

Its improved multilingual understanding and generation capabilities on Indian languages makes it the model of choice as a backbone in large multimodal models for visual understanding, captioning, and speech applications in the Indian context.

Cost efficient AI applications

With best-in-class performance on Indic, better than or competitive with that of models much larger in size in tasks like coding and instruction following, the model offers a significant cost advantage in integrating it in AI applications for India. Further, the enhanced multilingual understanding and generation capabilities of the model can be distilled into models much smaller in size.

Demo

Engages in multiturn conversation in Indic

Solves math problems in Indic

Model visualization

Explains code in your language

Model visualization

Indian context understanding

Model visualization

Support low-resource languages like Sanskrit

Model visualization

Use it in agentic applications

Model visualization

Model Architecture and Training

Krutrim-2 is a 12B parameter dense transformer model based on the Mistral-Nemo architecture. The model is pre-trained on high quality data comprising a curated mix of English, Indic, code, math, books, and synthetic data. It is natively multilingual (English and Indian languages) and supports a context window of 128K tokens. We followed a multi-stage training procedure, varying the data-mix, context size and batch size at every stage, leading to a stable and efficient model training.

After pre-training, the model underwent supervised training for cross-task instruction following and direct preference optimization for alignment.

#Hyper parameterValue
1Layers40
2Max sequence length128K
3Vocab size131K
4Attention typeGQA (Group Query Attention)
5Positional embeddingsRoPE

Evaluation

EN Benchmarks

We use the LM Evaluation Harness to evaluate our model on the En benchmarks tasks. Please note that at the time of writing this report, we were unable to use the evaluation framework for llama-3.3-70B, Gemini-1.5-flash and GPT-4o. We currency report the available published numbers for these models. We realise that the prompt templates and few-shot settings might vary and are working to make these evaluations consistent.

#BenchmarkMetricKrutrim-1 7BMN-12B-InstructKrutrim-2 12Bllama-3.3-70BGemini-1.5 FlashGPT-4o
1Hellaswag (0-shot)Common sense reasoning - Accuracy0.740.820.83(0-shot)0.950.87 (10-shot)0.95 (10-shot)
2Winogrande (0-shot)Common sense reasoning and NLU - Accuracy0.670.740.77(0-shot)0.85 (5-shot)-0.88 (5-shot)
3OpenBookQA (0-shot)General knowledge, science - Accuracy0.450.460.49(0-shot)---
4CommonSenseQA (0-shot)Common sense reasoning - Accuracy0.740.700.74(0-shot)--0.85
5TruthfulQA (0-shot)Factuality - Accuracy0.490.540.59(0-shot)--0.59
6MMLU (5-shot)Language understanding - Accuracy0.470.680.63(5-shot)0.820.790.86
7TriviaQA (5-shot)Reading comprehension - EM0.440.720.62(5-shot)---
8NaturalQuestions (5-shot)EM0.150.280.26(5-shot)---
9GSM8K (0-shot) MathEM0.070.740.71(0-shot)0.93 (8-shot, CoT)0.86 (11-shot)0.89
12ARC_Challenge (0-shot)Knowledge reasoning - Accuracy0.480.590.60(0-shot)0.93 (25-shot)-0.50
13ARC_Easy (0-shot)Knowledge reasoning - Accuracy0.730.800.82(0-shot)---
14HumanEvalCodingPass@100.000.230.80(Pass@10)0.880.74 (0-shot)0.90
16IF_Eval (0-shot)Inst. following - Accuracy0.16-0.56(0-shot)0.92-0.84

Indic Benchmarks

Average across 11 languages

#BenchmarkMetricAccuracyKrutrim-1 7BMN-12B-InstructKrutrim-2 12Bllama-3.3-70BGemini-1.5 FlashGPT-4o
1IndicSentiment (0-shot)Text classification0.650.700.95-0.960.990.98
2IndicCOPA (0-shot)Commonsense causal reasoning0.510.580.800.830.880.91-
3IndicXParaphrase (0-shot)Generation0.670.740.880.870.89TBD-
4IndicXNLI (0-shot)Language understanding0.470.540.55TBDTBD0.67-
5IndicQA (0-shot)Generation0.900.91TBDTBDTBD-
6CrossSumIN (1-shot)Cross-lingual summarization0.170.210.260.24TBD-
7FloresIN Translation xx-en (1-shot)Translation0.500.580.600.620.63-
8FloresIN Translation en-xx (1-shot)Translation0.340.480.460.470.48-
9IN22 Translation xx-en (0-shot)Translation0.480.570.580.550.54-
10IN22 Translation en-xx (0-shot)Translation0.330.450.420.440.43-

Bharat Bench

The existing Indic benchmarks are not natively in Indian languages, rather, they are translations of existing En benchmarks. They do not sufficiently capture the linguistic nuances of Indian languages and aspects of Indian culture. Towards that Krutrim released BharatBench - a natively Indic benchmark that encompasses the linguistic and cultural diversity of the Indic region, ensuring that the evaluations are relevant and representative of real-world use cases in India.

#BenchmarkMetricScore TypeKrutrim-1 7BMN-12B-InstructKrutrim-2 12Bllama-3.1-70B-InstructGemma-2-27B-InstructGPT-4o
1Indian Cultural context (0-shot)GenerationBert Score0.860.560.880.880.870.89
2Grammar Correction (5-shot)Language understandingBert Score0.960.940.980.980.960.97
3Multi Turn (0-shot)GenerationBert Score0.880.870.910.900.890.92
4Multi Turn Comprehension (0-shot)ComprehensionBert Score0.900.890.920.930.910.94
5Multi Turn Translation (0-shot)TranslationBert Score0.850.870.920.910.910.92
6Text Classification (5-shot)ClassificationAccuracy0.610.710.760.880.860.89
7Named Entity Recognition (5-shot)NERAccuracy0.310.510.530.610.650.65

Qualitative Evaluation

Below are the results from manual evaluation of prompt-response pairs across languages and task categories. Scores are between 1-5 (higher the better). Model names were anonymised during the evaluation.

Model visualization
Model visualization

How to access the model?

Chat Application

Users can directly access the model on our chat application here: chat.olakrutrim.com/home

API Integration

Developers can integrate the model into their applications via the model API available on Krutrim Cloud.

Run locally

Please visit the Krutrim-2 repository or Krutrim-2 HF page for details on running the model locally.