A multilingual Vision Language Model (VLM) that integrates a state-of-the-art multilingual Large Language Model (LLM) with a vision module.
Chitrarth is developed as India's own foundational multimodal model, specifically tailored to the Indian context, languages, and culture. The motivation behind this model is to ensure that datasets are "for our country, of our country, and for our citizens," fostering inclusivity and equitable AI advancements. By incorporating a diverse multilingual and culturally rich dataset, Chitrarth aims to minimize biases, enhance accessibility, and provide robust performance in both English and Indic languages. This approach ensures that AI technologies are representative, fair, and beneficial for a wide range of users across India and world.
Chitrarth undergoes a two-stage training process, leveraging a diverse and multilingual dataset to enhance its vision-language capabilities.
For the first stage of training, it is pre-trained using a dataset selected based on superior performance in preliminary experiments compared to other pre-training datasets.
The dataset is translated into multiple Indic languages supported by the Krutrim LLM using an open-source model. The pre-training data maintains a balanced split between English and the Indic languages, ensuring linguistic diversity, computational efficiency, and robust English performance while also developing capabilities in Indic languages.
The dataset composition prevents biases toward any single language, ensuring equitable model performance across all supported languages.
The second stage involves fine-tuning the model on a more complex instruction dataset to enhance multimodal reasoning capabilities.
The core dataset is an English version of a widely used instruction-tuning dataset. Additionally, an instruction dataset is translated into multiple languages using the same methodology as in Stage 1.
A vision-language dataset is incorporated, containing academic tasks with in-house multilingual translations. A large collection of culturally diverse images from India is included, featuring:
These images are transformed into pluralistic instruction-tuning data similar to open-source English-based instruction-tuning datasets. The dataset also includes high-quality, proprietary English text data.
The above two steps ensure balanced and diverse representation, supporting complex multimodal reasoning across various domains and visual scenarios.
Chitrarth could be used for any tasks which include image understanding and text generation on top of that. Some of the tasks include (not limited to) below:
Writing the product description and finer attribute extraction for the ecommerce articles. Target customers could be e-comm players like Myntra, AJIO, Nyka etc.
A use case where AI analyzes and explains application interface elements, layouts, and metrics for any digital product screen.
VLM can be used to generate poetic or narrative descriptions by interpreting visual elements in images.
Performance against SOTA VLMs on different academic multimodal tasks. Our model consistently outperforms IDEFICS 2 (7B) and PALO 7B on different benchmarks while remaining competitive on TextVQA and Vizwiz.
We introduce “BharatBench”, a comprehensive evaluation benchmark suite designed for 10 under-resourced Indic languages across 3 tasks. Performance of Chitrarth on BharatBench Evaluation framework. Our model is unique in its ability to handle all included languages, setting a baseline for future research.
Language | POPE | LLaVA-Bench | MMVet |
---|---|---|---|
Telugu | 79.9 | 54.8 | 43.76 |
Hindi | 78.68 | 51.5 | 38.85 |
Bengali | 83.24 | 53.7 | 33.24 |
Malayalam | 85.29 | 55.5 | 25.36 |
Kannada | 85.52 | 58.1 | 46.19 |
Assamese | 55.59 | 59.1 | 37.29 |
Tamil | 83.28 | 58.3 | 34.31 |
Marathi | 79.17 | 52.8 | 40.96 |
Gujarati | 84.75 | 55.9 | 39.03 |
Odia | 82.03 | 62.8 | 19.67 |
English | 87.63 | 67.9 | 30.49 |
We also compared Chitrarth against llama 3.2 11B vision instruct in which we surpass it as show n below:
git clone https://github.com/ola-krutrim/Chitrarth.git
conda create --name chitrarth python=3.10
conda activate chitrarth
cd Chitrarth
pip install -e .
python chitrarth/inference.py --model-path "krutrim-ai-labs/Chitrarth" --image-file "assets/govt_school.jpeg" --query "Explain the image. "
This code repository and the model weights are licensed under the Krutrim Community License. Chitrarth supports commercial use, allowing for any modifications and derivative works, including, but not limited to, fine-tuning and adaptation for specific NLP tasks.
Please note that: