Back

Vyakyarth-1-Indic-Embedding

SoTA Indic text embedding model to help usec ases like RAG

text-to-embedding
270M params

Description

Current NLP models often struggle with cross-lingual understanding, particularly in Indic languages. Vyakyarth addresses this gap by leveraging contrastive learning on 10 major Indic languages, significantly improving semantic similarity and retrieval. This enables seamless multilingual communication, making it a powerful tool for AI-driven applications across diverse linguistic landscapes. This model is trained primarily on multilingual data and is designed to work across 10 prominent Indian languages, including Hindi, Bengali, Gujarati, Kannada, Malayalam, Marathi, Sanskrit, Tamil, Telugu as well as English.
Hero Video

Vyakyarth is a state-of-the-art multilingual sentence-transformer model designed for semantic textual similarity, search, clustering, and classification across over 100 languages. Built on the STSB-XLM-R-Multilingual architecture and fine-tuned using a contrastive loss objective, Vyakyarth efficiently maps sentences into a 768-dimensional dense vector space, making it ideal for cross-lingual NLP applications.

Vyakyarth is a state-of-the-art multilingual sentence-transformer model designed for semantic textual similarity, search, clustering, and classification across over 100 languages. Built on the STSB-XLM-R-Multilingual architecture and fine-tuned using a contrastive loss objective, Vyakyarth efficiently maps sentences into a 768-dimensional dense vector space, making it ideal for cross-lingual NLP applications.

Parameters & Architecture

  • Base Model: sentence-transformers/stsb-xlm-r-multilingual
  • Maximum Sequence Length: 128 tokens
  • Output Dimensionality: 768
  • Similarity Function: Cosine Similarity
  • Pooling Strategy: Mean Token Pooling
  • Loss Function: MultipleNegativesRankingLoss

Use Cases

1. Natural Language Understanding

Vyakyarth enhances virtual assistants, AI chatbots, and automated response systems by ensuring accurate intent recognition and multilingual interaction.

2. Cross-Lingual Semantic Search

Search engines and knowledge bases benefit from VyakyarthтАЩs ability to retrieve contextually relevant results across multiple languages, moving beyond traditional keyword-based search.

3. Multilingual Recommendation Systems

Vyakyarth powers content recommendations for e-commerce, OTT platforms, and news aggregators, enhancing engagement by understanding user preferences across languages.

4. AI-Powered Customer Support

Businesses can automate multilingual customer support with high intent accuracy, reducing the need for language-specific training.

5. Content Moderation & Sentiment Analysis

Vyakyarth ensures effective detection of toxic, misleading, or inappropriate content in multiple languages, making it essential for social media and content platforms.

Evaluation Benchmarks

Vyakyarth outperforms existing models on multiple Indic language benchmarks, achieving superior semantic similarity and retrieval accuracy. Sharing results on FLORES dataset.

Task: Retrieval (Indic to English): To evaluate the retrieval capabilities of models, we include the Indic parts of the FLORES 101/200 dataset (Goyal et al., 2022; Costa-juss├а et al., 2022) to IndicXTREME.

LanguageMuRILIndicBERTjina-embeddings-v3Vyakyarth
Bengali77.091.097.498.7
Gujarati67.092.497.398.7
Hindi84.290.598.899.9
Kannada88.489.196.899.2
Malayalam82.289.296.398.7
Marathi83.992.597.198.8
Sanskrit36.430.484.190.1
Tamil79.490.095.897.9
Telugu43.588.697.397.5

How to Use Vyakyarth

Vyakyarth can be easily integrated using Sentence Transformers.

Installation

pip install -U sentence-transformers

Usage

from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer("krutrim-ai-labs/Vyakyarth")


# Encode sentences 
sentences = ["рдореИрдВ рдЕрдкрдиреЗ рджреЛрд╕реНрдд рд╕реЗ рдорд┐рд▓рд╛", "I met my friend"] #Similar sentences
embeddings = model.encode(sentences)

# Output embeddings
print(embeddings)

How to Use Vyakyarth on Krutrim Cloud

import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity




krutrim_api_key = <krutrim_api_key>
krutrim_api_base = "https://cloud.olakrutrim.com/v1"


client = OpenAI(
   api_key=os.environ.get("KRUTRIM_API_KEY", krutrim_api_key),
   base_url=os.environ.get("KRUTRIM_BASE_URL", krutrim_api_base),
)


# Function to get embeddings
def get_embedding(sentence):
   response = client.embeddings.create(
       model="Bhasantarit-mini",
       input=sentence
   )
   return response.data[0].embedding 


# Compute cosine similarity
def cosine_sim(emb1, emb2):
   return cosine_similarity([emb1], [emb2])[0][0]




# ========= Test examples =========


# Test example 1 - Hindi
# Result:
# Similarity Score (Similar Sentences): 0.97
# Similarity Score (Dissimilar Sentences): 0.10
similar_sentence_1 = "рдЖрдЬ рдореМрд╕рдо рдмрд╣реБрдд рд╕реБрд╣рд╛рдирд╛ рд╣реИред"  # "Today's weather is very pleasant."
similar_sentence_2 = "рдореМрд╕рдо рдЖрдЬ рдмрд╣реБрдд рдЕрдЪреНрдЫрд╛ рд╣реИред"  # "The weather is very good today."


dissimilar_sentence_1 = "рдореИрдВ рдлреБрдЯрдмреЙрд▓ рдЦреЗрд▓рдирд╛ рдкрд╕рдВрдж рдХрд░рддрд╛ рд╣реВрдБред"  # "I like to play football."
dissimilar_sentence_2 = "рдпрд╣ рдХрд┐рддрд╛рдм рдмрд╣реБрдд рд░реЛрдЪрдХ рд╣реИред"  # "This book is very interesting."


# Get embeddings
embedding_sim_1 = np.array(get_embedding(similar_sentence_1))
embedding_sim_2 = np.array(get_embedding(similar_sentence_2))
embedding_dis_1 = np.array(get_embedding(dissimilar_sentence_1))
embedding_dis_2 = np.array(get_embedding(dissimilar_sentence_2))


similarity_score_sim = cosine_sim(embedding_sim_1, embedding_sim_2)
similarity_score_dis = cosine_sim(embedding_dis_1, embedding_dis_2)
print(f"Similarity Score (Similar Sentences): {similarity_score_sim:.2f}")
print(f"Similarity Score (Dissimilar Sentences): {similarity_score_dis:.2f}")


# Classification
threshold = 0.8  # Define threshold for similarity
if similarity_score_sim > threshold:
   print("Sentence 1:" + similar_sentence_1)
   print("Sentence 2:" + similar_sentence_2)
   print("Similar Sentences: They are classified as similar тЬЕ")
else:
   print("Sentence 1:" + dissimilar_sentence_1)
   print("Sentence 2:" + dissimilar_sentence_2)
   print("Similar Sentences: They are not classified as similar тЭМ")


if similarity_score_dis < 0.5:
   print("Dissimilar Sentences: They are correctly classified as different тЬЕ")
else:
   print("Dissimilar Sentences: They are not classified correctly тЭМ")