SoTA Indic text embedding model to help usec ases like RAG
Vyakyarth is a state-of-the-art multilingual sentence-transformer model designed for semantic textual similarity, search, clustering, and classification across over 100 languages. Built on the STSB-XLM-R-Multilingual architecture and fine-tuned using a contrastive loss objective, Vyakyarth efficiently maps sentences into a 768-dimensional dense vector space, making it ideal for cross-lingual NLP applications.
Vyakyarth is a state-of-the-art multilingual sentence-transformer model designed for semantic textual similarity, search, clustering, and classification across over 100 languages. Built on the STSB-XLM-R-Multilingual architecture and fine-tuned using a contrastive loss objective, Vyakyarth efficiently maps sentences into a 768-dimensional dense vector space, making it ideal for cross-lingual NLP applications.
Vyakyarth enhances virtual assistants, AI chatbots, and automated response systems by ensuring accurate intent recognition and multilingual interaction.
Search engines and knowledge bases benefit from VyakyarthтАЩs ability to retrieve contextually relevant results across multiple languages, moving beyond traditional keyword-based search.
Vyakyarth powers content recommendations for e-commerce, OTT platforms, and news aggregators, enhancing engagement by understanding user preferences across languages.
Businesses can automate multilingual customer support with high intent accuracy, reducing the need for language-specific training.
Vyakyarth ensures effective detection of toxic, misleading, or inappropriate content in multiple languages, making it essential for social media and content platforms.
Task: Retrieval (Indic to English): To evaluate the retrieval capabilities of models, we include the Indic parts of the FLORES 101/200 dataset (Goyal et al., 2022; Costa-juss├а et al., 2022) to IndicXTREME.
Language | MuRIL | IndicBERT | jina-embeddings-v3 | Vyakyarth |
---|---|---|---|---|
Bengali | 77.0 | 91.0 | 97.4 | 98.7 |
Gujarati | 67.0 | 92.4 | 97.3 | 98.7 |
Hindi | 84.2 | 90.5 | 98.8 | 99.9 |
Kannada | 88.4 | 89.1 | 96.8 | 99.2 |
Malayalam | 82.2 | 89.2 | 96.3 | 98.7 |
Marathi | 83.9 | 92.5 | 97.1 | 98.8 |
Sanskrit | 36.4 | 30.4 | 84.1 | 90.1 |
Tamil | 79.4 | 90.0 | 95.8 | 97.9 |
Telugu | 43.5 | 88.6 | 97.3 | 97.5 |
Vyakyarth can be easily integrated using Sentence Transformers.
pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer
# Load the model
model = SentenceTransformer("krutrim-ai-labs/Vyakyarth")
# Encode sentences
sentences = ["рдореИрдВ рдЕрдкрдиреЗ рджреЛрд╕реНрдд рд╕реЗ рдорд┐рд▓рд╛", "I met my friend"] #Similar sentences
embeddings = model.encode(sentences)
# Output embeddings
print(embeddings)
import os
import numpy as np
from openai import OpenAI
from sklearn.metrics.pairwise import cosine_similarity
krutrim_api_key = <krutrim_api_key>
krutrim_api_base = "https://cloud.olakrutrim.com/v1"
client = OpenAI(
api_key=os.environ.get("KRUTRIM_API_KEY", krutrim_api_key),
base_url=os.environ.get("KRUTRIM_BASE_URL", krutrim_api_base),
)
# Function to get embeddings
def get_embedding(sentence):
response = client.embeddings.create(
model="Bhasantarit-mini",
input=sentence
)
return response.data[0].embedding
# Compute cosine similarity
def cosine_sim(emb1, emb2):
return cosine_similarity([emb1], [emb2])[0][0]
# ========= Test examples =========
# Test example 1 - Hindi
# Result:
# Similarity Score (Similar Sentences): 0.97
# Similarity Score (Dissimilar Sentences): 0.10
similar_sentence_1 = "рдЖрдЬ рдореМрд╕рдо рдмрд╣реБрдд рд╕реБрд╣рд╛рдирд╛ рд╣реИред" # "Today's weather is very pleasant."
similar_sentence_2 = "рдореМрд╕рдо рдЖрдЬ рдмрд╣реБрдд рдЕрдЪреНрдЫрд╛ рд╣реИред" # "The weather is very good today."
dissimilar_sentence_1 = "рдореИрдВ рдлреБрдЯрдмреЙрд▓ рдЦреЗрд▓рдирд╛ рдкрд╕рдВрдж рдХрд░рддрд╛ рд╣реВрдБред" # "I like to play football."
dissimilar_sentence_2 = "рдпрд╣ рдХрд┐рддрд╛рдм рдмрд╣реБрдд рд░реЛрдЪрдХ рд╣реИред" # "This book is very interesting."
# Get embeddings
embedding_sim_1 = np.array(get_embedding(similar_sentence_1))
embedding_sim_2 = np.array(get_embedding(similar_sentence_2))
embedding_dis_1 = np.array(get_embedding(dissimilar_sentence_1))
embedding_dis_2 = np.array(get_embedding(dissimilar_sentence_2))
similarity_score_sim = cosine_sim(embedding_sim_1, embedding_sim_2)
similarity_score_dis = cosine_sim(embedding_dis_1, embedding_dis_2)
print(f"Similarity Score (Similar Sentences): {similarity_score_sim:.2f}")
print(f"Similarity Score (Dissimilar Sentences): {similarity_score_dis:.2f}")
# Classification
threshold = 0.8 # Define threshold for similarity
if similarity_score_sim > threshold:
print("Sentence 1:" + similar_sentence_1)
print("Sentence 2:" + similar_sentence_2)
print("Similar Sentences: They are classified as similar тЬЕ")
else:
print("Sentence 1:" + dissimilar_sentence_1)
print("Sentence 2:" + dissimilar_sentence_2)
print("Similar Sentences: They are not classified as similar тЭМ")
if similarity_score_dis < 0.5:
print("Dissimilar Sentences: They are correctly classified as different тЬЕ")
else:
print("Dissimilar Sentences: They are not classified correctly тЭМ")