LinguaMark Logo

LinguaMark: Do Multimodal Models Speak Fairly? A Benchmark-Based Evaluation

Vector Institute for Artificial Intelligence

Motivated by the lack of lingual diversity in state of the art LMMs, we created LinguaMark to evaluate a broad range of LMMs on multilingual Visual Question Answering (VQA) tasks. Our dataset consists of 6,875 image-text pairs spanning 11 languages and 5 social attributes. We evaluate models using three key metrics: Bias, Answer Relevancy, and Faithfulness to get a complete overview of the model's performance across modalities. To encourage reproducibility and further research, we have released our benchmark and evaluation code in the links above.

LinguaMark Framework

Figure: Overview of the LinguaMark evaluation framework. The benchmark uses open-ended VQA prompts grounded in real-world news images and evaluates LMM responses across 11 languages and 5 social attributes: age, gender, occupation, ethnicity, and sports.



In this benchmark, we follow the steps listed below:

  1. Data Annotation: We select a subset of images from our previous work HumaniBench, and create a VQA dataset. As a first step, we create QA pairs in English, and assign each pair to one of the 5 social attributes. Once this is done, we translate the QA pairs into 11 languages and obtain a dataset of 6,875 in total.
  2. Model inference: We choose a list of 7 models namely OpenAI GPT-4o mini and Gemini 2.5 Flash preview among closed-source models, and Cohere Aya Vision 8B, Gemma3, LLaMA 3.2 11B Vision Instruct, Phi 4 Vision Instruct and Gemma3, LLaMA 3.2 11B Vision Instruct, Qwen2.5 among open source models. We run inference on all the 7 models to get results for the Visual Question Answering (VQA) task on all 6,895 pairs.
  3. Evaluation: We calculate 3 metrics Bias, Answer Relevancy and Faithfulness from the inference results of all 7 models. We do a comprehensive evaluation of their performance across all social attributes and languages and showcase our results in the paper.



Result Highlights

#1 Closed source models have the best overall performance

Across all languages and attributes, closed-source models like OpenAI's GPT-4o and Google's Gemini 2.5 Flash consistently outperform open-source alternatives on the VQA task in all 3 metrics. This suggests that proprietary training data and architectures may confer advantages in human alignment tasks, particularly in complex social reasoning scenarios.

Model Bias Answer Relevancy Faithfulness
GPT-4o-mini 11.88% - -
Gemini 2.5 Flash - 87.50% 95.11%

#2 Few open-source models perform competitively with closed-source ones across social attributes and languages

Results across social attributes: All models follow a similar decreasing pattern in bias values across the attributes: Gender > Age >= Occupation > Ethnicity > Sports . GPT-4o and Gemini2.5 have the best metric distributions across all 5 social attributes with lowest values of bias and highest values of answer relevancy and faithfulness. Among open-source models, Aya-Vision has fairly low bias scores across 3 attributes; Qwen2.5 and Aya-Vision outperforms GPT-4o-mini in Answer Relevancy on all attributes, and Qwen2.5 has comparable performance to GPT4o in Faithfulness. This shows that among open-source models, Qwen2.5 generalizes well across modalities and languages.


Results across languages: Gemini2.5 model has the highest scores across many languages for Answer Relevancy and Faithfulness, as can be seen by the widest area coverage in the radar plots. Surprisingly, closed source models show high bias in select high resource languages such as Mandarin, Spanish and French. Gemma3 and Qwen2.5 have comparable performance to GPT-4o for different metrics in the VQA task. Qwen2.5 also shows minimal bias across all languages which makes its performance near equal to the large closed-source models.


#3 Different models use different phrases for an answer, but are equally good in describing the output.

The example below shows that different models can use different phrases to answer the same question , but are equally good in describing the output. For example, in the first image, all models provide an accurate answer in Persian with different phrases, while in the second image, all models provide good answer descriptions in English empahsising on different aspects of the image all of which are equally valid and accurate.


Example 1: Model outputs to a closed ended question in Persian.


Example 2: Model outputs to an open ended question in English.


Future work

For future work, we would like to extend this analysis to include larger LMMs with >= 14B parameters for better comparison with closed-source models. Also, our dataset currently focuses only on 1 VQA task across 5 social attributes and 11 languages, and focuses images related to news media. This can also be extended to include image-text pairs from other languages, such as ones included in ALM-Bench, and to other image sources such as entertainment, etc.


For additional details about LinguaMark evaluation and experimental results, please refer to our main paper. Thank you!

BibTeX

To be released.