Large multimodal models (LMMs) have achieved impressive performance on vision–language tasks such as visual question answering (VQA), image captioning, and visual grounding, however, they remain insufficiently evaluated for alignment with human-centered (HC) values such as fairness, ethics, and inclusivity. To address this gap, we introduce HumaniBench, a comprehensive benchmark comprising 32,000 real-world image–question pairs and an accompanying evaluation suite. Using a semi-automated annotation pipeline, each sample is rigorously validated by domain experts to ensure accuracy and ethical integrity. HumaniBench assesses LMMs across seven key alignment principles: fairness, ethics, empathy, inclusivity, reasoning, robustness, and multilinguality: through a diverse set of open- and closed-ended VQA tasks. Grounded in AI ethics theory and real-world social contexts, these principles provide a holistic lens for examining human-aligned behavior. Benchmarking results reveal distinct behavioral patterns: certain model families excel in reasoning, fairness, and multilinguality, while others demonstrate greater robustness and grounding capability. However, most models still struggle to balance task accuracy with ethical and inclusive responses. Techniques such as chain-of-thought prompting and test-time scaling yield measurable alignment gains. As the first benchmark explicitly designed for HC evaluation, HumaniBench offers a rigorous testbed to diagnose limitations, quantify alignment trade-offs, and promote the responsible development of large multimodal models. All data and code are publicly released to ensure transparency and reproducibility.
Table: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.
Figure: Dataset creation pipeline: images are extracted, filtered for duplicates using CLIP, captions & social attributes by GPT-4o, verified by humans, resulting in 13K unique images.
The datasheet for the final HumaniBench dataset can be found here.
Figure: HumaniBench overview. ♠ = covers all seven principles; all tasks are evaluated across five social attributes (age, gender, race, occupation, sport). Sections: (i) icon row, (ii) principle definitions, (iii) seven-task suite (I = image, T = text, B = bounding box), and (iv) metric–principle alignment.
Each of the seven tasks in HumaniBench corresponds to one or more of the seven core human-centric principles that we defined and is designed to reflect realistic, complex, and diverse scenarios.
For HumaniBench, we evaluate thirteen state-of-the-art open-source Large Multimodal Models (LMMs) across diverse human-alignment tasks. These models represent a range of vision encoders, language backbones, and fusion techniques. By focusing exclusively on vision-language models, HumaniBench provides a rigorous testbed for assessing human-centered reasoning, fairness, empathy, and robustness in real-world multimodal scenarios.
| Large Multimodal Models (LMMs) |
|---|
| CogVLM2-19B |
| Cohere Aya Vision 8B |
| DeepSeek VL2 Small |
| GLM-4V-9B |
| InternVL2.5 8B |
| Janus-Pro-7B |
| LLaMA 3.2 11B Vision Instruct |
| LLaVA-v1.6-Vicuna-7B |
| Molmo-7B |
| Phi 4 |
| Phi 3.5 Vision Instruct |
| Qwen2.5-VL-7B Instruct |
| Gemma 3 |
| OpenAI GPT-4o (Closed source) |
| Gemini 2.0 (Closed source) |
We evaluated both open‐source and closed‐source MLLMs on HumaniBench using a variety of tasks (T1–T7). This section presents our main empirical findings and highlights key challenges for MLLMs.
Closed-source large multimodal models generally achieve the highest overall performance, showing strengths in fairness, reasoning, safety, multilingual coverage, and empathetic responses. They tend to produce more balanced outputs across demographics and benefit from stronger safety alignment and reinforcement learning techniques.
Open-source models, however, excel in specific capabilities. Some outperform closed alternatives in understanding tasks, particularly object recognition and visual grounding, and achieve higher robustness through specialized stabilization strategies. In reasoning and ethical safety, open models come close to matching closed models, despite using far fewer computational resources.
Overall, closed models maintain a lead in safety and inclusivity, while open models demonstrate that competitive and well-grounded results can be achieved efficiently without proprietary infrastructure.
Recent evaluations show that closed-source models like GPT-4o and Gemini-2.0 consistently outperform open-source models across social attributes such as age, race, and gender, delivering more balanced and reliable results. While open-source models like CogVLM2 and Qwen2.5 perform well in specific areas like race and sports, they show greater variability in handling gender and occupation, especially across complex tasks like scene understanding and visual grounding.
Figure: Examples from T1 (Scene Understanding), T2 (Instance Identity), and T3 (Multiple-Choice VQA) with questions, ground-truth answers, and GPT-4o reasoning.
Figure: Multilingual qualitative examples llustrating model performance across languages (French, Urdu, Tamil) and associated social attributes (Gender, Occupation, Age). Each column presents a question, ground truth answer, predicted answer, and error analysis.
| High-Resource Languages | Low-Resource Languages |
|---|---|
| English (Reference) | Urdu |
| French | Persian |
| Spanish | Bengali |
| Portuguese | Punjabi |
| Mandarin | Tamil |
| Korean |
Table: Categorization of evaluation languages in the HumaniBench multilingual task. Languages are grouped into high-resource (e.g., English, French, Mandarin) and low-resource (e.g., Urdu, Tamil, Bengali) categories to assess model performance and fairness across varying linguistic resource availability.
Figure: T6: Empathy & Human-Centric Response. Simple vs. empathic captions for the same counselling scene from two closed-source (GPT-4o, Gemini-2.0) and two open-source (Aya Vision, Phi-4) LMMs. LMMs. Linguistic tones—Analytic, Negative, Positive—show empathic prompts lift Positive tone, add slight Negative wording, and keep Analytic steady, indicating prompt framing drives affective style in different models.
In conclusion, HumaniBench reveals that state-of-the-art LMMs still face significant trade-offs between accuracy and human-aligned behavior. While closed-source models generally perform better overall, they exhibit weaknesses in visual grounding. Open-source models show promise in specific areas but lack consistency. Techniques like chain-of-thought and scaling up models offer modest gains, yet alignment deficits persist. These insights underscore the need for multi-objective optimization—combining curated data, safety filters, and targeted fine-tuning—to advance truly human-aligned multimodal systems.
For additional details about HumaniBench evaluation and experimental results, please refer to our main paper. Thank you!
@misc{raza2025humanibenchhumancentricframeworklarge,
title={HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation},
author={Shaina Raza and Aravind Narayanan and Vahid Reza Khazaie and Ashmal Vayani and Mukund S. Chettiar and Amandeep Singh and Mubarak Shah and Deval Pandya},
year={2025},
eprint={2505.11454},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.11454},
}