Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment

Vector Institute for Artificial Intelligence

This project introduces Bias in the Picture, a benchmark for evaluating how modern vision–language models respond to real-world images containing social cues such as age, gender, race, occupation, and sports. Built from 1,343 news-derived image–question pairs and annotated with ground-truth answers and demographic attributes, the benchmark reveals how visual context shapes model behavior in open-ended reasoning tasks. The evaluation pipeline includes a broad suite of VLMs, structured inference scripts, and an LLM-as-judge scoring rubric that jointly measures accuracy, faithfulness, and social bias. This work was accepted at the NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle and supports researchers aiming to audit fairness, grounding, and stereotype sensitivity in multimodal AI systems.


TL;DR: A benchmark of 1,343 news images with social cues (age, gender, race, occupation, sports) used to evaluate VLMs via LLM-as-judge scoring for accuracy, faithfulness, and stereotype bias.

Dataset Construction and Evaluation Pipeline

Figure: Illustrates our two-stage process: (top) data sourcing, filtration, and annotation across four demographic categories (age, gender, race, profession); and (bottom) task setting and multimodal evaluation for grounding, robustness, and reasoning, with outputs scored for accuracy, bias, and faithfulness.



In this benchmark, we follow the steps listed below:

  1. Data construction and annotation: We curate a dataset of 1,343 real-world news images paired with open-ended questions designed to probe both scene understanding and social cues. Each image is annotated with five demographic and social attributes—age, gender, race/ethnicity, occupation, and sport—alongside verified ground-truth answers.
  2. Model inference: We run a wide range of vision–language models on every image–question pair, including Gemini 2.0, Phi-4, Aya Vision 8B, Janus-Pro, InternVL2.5, Qwen2.5-VL, LLaMA 3.2 Vision, Molmo, CogVLM2, PaliGemma, LLaVA, and others. All models produce structured JSON outputs containing an answer and brief rationale.
  3. Evaluation: Using an LLM-as-judge scoring rubric, we compute Bias, Answer Relevance, and Faithfulness for each model output. We additionally compute statistical metrics such as BERTScore, METEOR, and FrugalScore. The results provide a detailed analysis of how visual social cues affect model behavior across all attributes and models.



Result Highlights

#1 Closed-source models lead in overall performance, but bias persists

Takeaway: Closed-source VLMs such as Gemini 2.0 and Phi-4 achieve the highest overall accuracy and faithfulness, but they still exhibit substantial social bias. In contrast, Qwen2.5-VL achieves some of the lowest bias scores but at the cost of reduced accuracy, showing that stronger grounding alone does not eliminate stereotype risk.

Model Accuracy ↑ Bias ↓ Faithfulness ↑
Gemini 2.0 85.97 15.19 78.96
Phi-4 80.00 17.10 81.67
Qwen2.5-VL 71.18 9.46 68.98

#2 Strong attribute-specific effects reveal where social biases are most pronounced

Takeaway: Accuracy is highest for occupation-related questions and lowest for race/ethnicity, where most models fall below 70%. Bias concentrates around gender and occupation, indicating that models often fall back on stereotype priors when interpreting socially cued images, even when their answers are otherwise accurate and grounded.


#3 Faithfulness and bias do not correlate, models face a trade-off

Takeaway: Models that are highly faithful to image evidence (e.g., Janus-Pro, Phi-4) can still inject demographic assumptions, especially for gender and race. Models that avoid demographic attribution (e.g., Qwen2.5-VL) reduce bias but sometimes give underspecified answers, revealing a tension between being fully grounded and avoiding harmful social inferences.

Figure: (A) Overall accuracy across models. (B) Attribute-level breakdown. (C) Bias vs. faithfulness trade-off.


To Cite:

BibTeX:
@misc{narayanan2025biaspicturebenchmarkingvlms,
        title={Bias in the Picture: Benchmarking VLMs with Social-Cue News Images and LLM-as-Judge Assessment}, 
        author={Aravind Narayanan and Vahid Reza Khazaie and Shaina Raza},
        year={2025},
        eprint={2509.19659},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2509.19659}, 
  }