Figure: Illustrates our two-stage process: (top) data sourcing, filtration, and annotation across four demographic categories (age, gender, race, profession); and (bottom) task setting and multimodal evaluation for grounding, robustness, and reasoning, with outputs scored for accuracy, bias, and faithfulness.
In this benchmark, we follow the steps listed below:
Takeaway: Closed-source VLMs such as Gemini 2.0 and Phi-4 achieve the highest overall accuracy and faithfulness, but they still exhibit substantial social bias. In contrast, Qwen2.5-VL achieves some of the lowest bias scores but at the cost of reduced accuracy, showing that stronger grounding alone does not eliminate stereotype risk.
| Model | Accuracy ↑ | Bias ↓ | Faithfulness ↑ |
|---|---|---|---|
| Gemini 2.0 | 85.97 | 15.19 | 78.96 |
| Phi-4 | 80.00 | 17.10 | 81.67 |
| Qwen2.5-VL | 71.18 | 9.46 | 68.98 |
Takeaway: Accuracy is highest for occupation-related questions and lowest for race/ethnicity, where most models fall below 70%. Bias concentrates around gender and occupation, indicating that models often fall back on stereotype priors when interpreting socially cued images, even when their answers are otherwise accurate and grounded.
|
|
|
Takeaway: Models that are highly faithful to image evidence (e.g., Janus-Pro, Phi-4) can still inject demographic assumptions, especially for gender and race. Models that avoid demographic attribution (e.g., Qwen2.5-VL) reduce bias but sometimes give underspecified answers, revealing a tension between being fully grounded and avoiding harmful social inferences.
![]() |
|
Figure: (A) Overall accuracy across models. (B) Attribute-level breakdown. (C) Bias vs. faithfulness trade-off. |
@misc{narayanan2025biaspicturebenchmarkingvlms,
title={Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment},
author={Aravind Narayanan and Vahid Reza Khazaie and Shaina Raza},
year={2025},
eprint={2509.19659},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2509.19659},
}