Figure: Overview of the LinguaMark evaluation framework. The benchmark uses open-ended VQA prompts grounded in real-world news images and evaluates LMM responses across 11 languages and 5 social attributes: age, gender, occupation, ethnicity, and sports.
In this benchmark, we follow the steps listed below:
Across all languages and attributes, closed-source models like OpenAI's GPT-4o and Google's Gemini 2.5 Flash consistently outperform open-source alternatives on the VQA task in all 3 metrics. This suggests that proprietary training data and architectures may confer advantages in human alignment tasks, particularly in complex social reasoning scenarios.
Model | Bias | Answer Relevancy | Faithfulness |
---|---|---|---|
GPT-4o-mini | 11.88% | - | - |
Gemini 2.5 Flash | - | 87.50% | 95.11% |
Results across social attributes: All models follow a similar decreasing pattern in bias values across the attributes: Gender > Age >= Occupation > Ethnicity > Sports . GPT-4o and Gemini2.5 have the best metric distributions across all 5 social attributes with lowest values of bias and highest values of answer relevancy and faithfulness. Among open-source models, Aya-Vision has fairly low bias scores across 3 attributes; Qwen2.5 and Aya-Vision outperforms GPT-4o-mini in Answer Relevancy on all attributes, and Qwen2.5 has comparable performance to GPT4o in Faithfulness. This shows that among open-source models, Qwen2.5 generalizes well across modalities and languages.
![]() |
![]() |
![]() |
Results across languages: Gemini2.5 model has the highest scores across many languages for Answer Relevancy and Faithfulness, as can be seen by the widest area coverage in the radar plots. Surprisingly, closed source models show high bias in select high resource languages such as Mandarin, Spanish and French. Gemma3 and Qwen2.5 have comparable performance to GPT-4o for different metrics in the VQA task. Qwen2.5 also shows minimal bias across all languages which makes its performance near equal to the large closed-source models.
![]() |
![]() |
![]() |
The example below shows that different models can use different phrases to answer the same question , but are equally good in describing the output. For example, in the first image, all models provide an accurate answer in Persian with different phrases, while in the second image, all models provide good answer descriptions in English empahsising on different aspects of the image all of which are equally valid and accurate.
![]() Example 1: Model outputs to a closed ended question in Persian. |
![]() Example 2: Model outputs to an open ended question in English. |
For future work, we would like to extend this analysis to include larger LMMs with >= 14B parameters for better comparison with closed-source models. Also, our dataset currently focuses only on 1 VQA task across 5 social attributes and 11 languages, and focuses images related to news media. This can also be extended to include image-text pairs from other languages, such as ones included in ALM-Bench, and to other image sources such as entertainment, etc.
For additional details about LinguaMark evaluation and experimental results, please refer to our main paper. Thank you!
To be released.