Chitrarth-1

A multilingual Vision Language Model (VLM) that integrates a state-of-the-art multilingual Large Language Model (LLM) with a vision module.

image-text-to-text

7.5B param

Krutrim

🤗 Hugging Face

Github

Overview

Playground

Description

Chitrarth (Chitra: Image; Artha: Meaning) is a multilingual Vision Language Model (VLM) that integrates a state-of-the-art multilingual Large Language Model (LLM) with a vision module. This model is trained primarily on multilingual image-text data and is designed to work across 10 prominent Indian languages, including Hindi, Bengali, Telugu, Tamil, Marathi, Gujarati, Kannada, Malayalam, Odia, and Assamese, as well as English.

Chitrarth is developed as India's own foundational multimodal model, specifically tailored to the Indian context, languages, and culture. The motivation behind this model is to ensure that datasets are "for our country, of our country, and for our citizens," fostering inclusivity and equitable AI advancements. By incorporating a diverse multilingual and culturally rich dataset, Chitrarth aims to minimize biases, enhance accessibility, and provide robust performance in both English and Indic languages. This approach ensures that AI technologies are representative, fair, and beneficial for a wide range of users across India and world.

Parameters and Architectures

Base Model: Chitrarth builds on Krutrim-7B (Large Language Model by Krutrim) as its backbone.
Vision Encoder: Uses a SIGLIP (siglip-so400m-patch14-384) model to extract visual features.
Architecture:
- A pretrained vision encoder (SIGLIP) extracts image features.
- These features are projected into the LLM’s token space using a trainable linear mapping layer.
- It is fine-tuned using instruction-following datasets with image-text pairs.

Data

Chitrarth undergoes a two-stage training process, leveraging a diverse and multilingual dataset to enhance its vision-language capabilities.

Stage 1: Adapter Pre-Training (PT)

For the first stage of training, it is pre-trained using a dataset selected based on superior performance in preliminary experiments compared to other pre-training datasets.

The dataset is translated into multiple Indic languages supported by the Krutrim LLM using an open-source model. The pre-training data maintains a balanced split between English and the Indic languages, ensuring linguistic diversity, computational efficiency, and robust English performance while also developing capabilities in Indic languages.

The dataset composition prevents biases toward any single language, ensuring equitable model performance across all supported languages.

Stage 2: Instruction Tuning (IT)

The second stage involves fine-tuning the model on a more complex instruction dataset to enhance multimodal reasoning capabilities.

The core dataset is an English version of a widely used instruction-tuning dataset. Additionally, an instruction dataset is translated into multiple languages using the same methodology as in Stage 1.

A vision-language dataset is incorporated, containing academic tasks with in-house multilingual translations. A large collection of culturally diverse images from India is included, featuring:

Prominent personalities
Monuments
Artwork
Culinary dishes

These images are transformed into pluralistic instruction-tuning data similar to open-source English-based instruction-tuning datasets. The dataset also includes high-quality, proprietary English text data.

The above two steps ensure balanced and diverse representation, supporting complex multimodal reasoning across various domains and visual scenarios.

Demo

Use Cases

Chitrarth could be used for any tasks which include image understanding and text generation on top of that. Some of the tasks include (not limited to) below:

1. Domain specific image caption generation

Writing the product description and finer attribute extraction for the ecommerce articles. Target customers could be e-comm players like Myntra, AJIO, Nyka etc.

2. UI/UX Screen Analysis and Interpretation

A use case where AI analyzes and explains application interface elements, layouts, and metrics for any digital product screen.

3. Monitoring and Anomaly detection

4. Creative writing

VLM can be used to generate poetic or narrative descriptions by interpreting visual elements in images.

Evaluation Benchmarks

Performance against SOTA VLMs on different academic multimodal tasks. Our model consistently outperforms IDEFICS 2 (7B) and PALO 7B on different benchmarks while remaining competitive on TextVQA and Vizwiz.

We introduce “BharatBench”, a comprehensive evaluation benchmark suite designed for 10 under-resourced Indic languages across 3 tasks. Performance of Chitrarth on BharatBench Evaluation framework. Our model is unique in its ability to handle all included languages, setting a baseline for future research.

Language	POPE	LLaVA-Bench	MMVet
Telugu	79.9	54.8	43.76
Hindi	78.68	51.5	38.85
Bengali	83.24	53.7	33.24
Malayalam	85.29	55.5	25.36
Kannada	85.52	58.1	46.19
Assamese	55.59	59.1	37.29
Tamil	83.28	58.3	34.31
Marathi	79.17	52.8	40.96
Gujarati	84.75	55.9	39.03
Odia	82.03	62.8	19.67
English	87.63	67.9	30.49

We also compared Chitrarth against llama 3.2 11B vision instruct in which we surpass it as show n below:

How to access this model (Hugging Face/Github/Krutrim Cloud)

1. Huggingface
2. Github

git clone https://github.com/ola-krutrim/Chitrarth.git  
conda create --name chitrarth python=3.10  
conda activate chitrarth 
cd Chitrarth

pip install -e .
python chitrarth/inference.py --model-path "krutrim-ai-labs/Chitrarth" --image-file "assets/govt_school.jpeg" --query "Explain the image. "

3. Krutrim Cloud

Publications

1. “Chitrarth: Bridging Vision and Language for a Billion People” published in NeurIPS Multimodal Algorithmic Reasoning, Link

2. “Chitranuvad: Adapting Multi-Lingual LLMs for Multimodal Translation” published in Proceedings of the Ninth Conference on Machine Translation. Link

License and Contributions

This code repository and the model weights are licensed under the Krutrim Community License. Chitrarth supports commercial use, allowing for any modifications and derivative works, including, but not limited to, fine-tuning and adaptation for specific NLP tasks. Link

Please note that:

The model has been optimized for Indic languages, leveraging parallel corpora, back-translation, and domain-specific data, ensuring improved performance for semantic search, chatbots, and recommendation systems.