Large Multimodal Models (LMMs) perform exceptional at vision-language tasks but still face challenges in human-centered criteria such as fairness, ethics, empathy, and inclusivity, principles essential for true alignment with human values. We introduce HumaniBench, a benchmark comprising approximately 32 K image–question pairs from real-world imagery, annotated via a scalable GPT-4o-assisted pipeline and rigorously verified by domain experts. HumaniBench evaluates models across seven human-aligned principles—fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness—covering diverse tasks that include both open- and closed-ended visual question answering (VQA). Benchmarking 15 state-of-the-art LMMs (open- and closed-source) reveals that proprietary models generally lead; however, significant gaps remain in robustness and visual grounding, while open-source models struggle to balance accuracy with adherence to human-aligned principles such as ethics and inclusivity. HumaniBench is the first unified, real-world benchmark explicitly designed around Human-Centered AI principles, providing a rigorous testbed to diagnose alignment issues and guide models toward behavior that is accurate, ethical, inclusive, and socially responsible. To promote transparency and foster future research, we publicly release the full dataset and evaluation code.
Table: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.
Figure: Dataset creation pipeline: images are extracted, filtered for duplicates using CLIP, captions & social attributes by GPT-4o, verified by humans, resulting in 13K unique images.
The datasheet for the final HumaniBench dataset can be found here.
Figure: HumaniBench overview. ♠ = covers all seven principles; all tasks are evaluated across five social attributes (age, gender, race, occupation, sport). Sections: (i) icon row, (ii) principle definitions, (iii) seven-task suite (I = image, T = text, B = bounding box), and (iv) metric–principle alignment.
Each of the seven tasks in HumaniBench corresponds to one or more of the seven core human-centric principles that we defined and is designed to reflect realistic, complex, and diverse scenarios.
For HumaniBench, we evaluate thirteen state-of-the-art open-source Large Multimodal Models (LMMs) across diverse human-alignment tasks. These models represent a range of vision encoders, language backbones, and fusion techniques. By focusing exclusively on vision-language models, HumaniBench provides a rigorous testbed for assessing human-centered reasoning, fairness, empathy, and robustness in real-world multimodal scenarios.
Large Multimodal Models (LMMs) |
---|
CogVLM2-19B |
Cohere Aya Vision 8B |
DeepSeek VL2 Small |
GLM-4V-9B |
InternVL2.5 8B |
Janus-Pro-7B |
LLaMA 3.2 11B Vision Instruct |
LLaVA-v1.6-Vicuna-7B |
Molmo-7B |
Phi 4 |
Phi 3.5 Vision Instruct |
Qwen2.5-VL-7B Instruct |
Gemma 3 |
OpenAI GPT-4o (Closed source) |
Gemini 2.0 (Closed source) |
We evaluated both open‐source and closed‐source MLLMs on HumaniBench using a variety of tasks (T1–T7). This section presents our main empirical findings and highlights key challenges for MLLMs.
Figure: HumaniBench principle-aligned scores. Each entry is the mean score of the tasks mapped to that principle (↑ higher is better). †Closed-source; all others open source.
Recent evaluations show that closed-source models like GPT-4o and Gemini-2.0 consistently outperform open-source models across social attributes such as age, race, and gender, delivering more balanced and reliable results. While open-source models like CogVLM2 and Qwen2.5 perform well in specific areas like race and sports, they show greater variability in handling gender and occupation, especially across complex tasks like scene understanding and visual grounding.
Figure: Examples from T1 (Scene Understanding), T2 (Instance Identity), and T3 (Multiple-Choice VQA) with questions, ground-truth answers, and GPT-4o reasoning.
Figure: Multilingual qualitative examples llustrating model performance across languages (French, Urdu, Tamil) and associated social attributes (Gender, Occupation, Age). Each column presents a question, ground truth answer, predicted answer, and error analysis.
High-Resource Languages | Low-Resource Languages |
---|---|
English (Reference) | Urdu |
French | Persian |
Spanish | Bengali |
Portuguese | Punjabi |
Mandarin | Tamil |
Korean |
Table: Categorization of evaluation languages in the HumaniBench multilingual task. Languages are grouped into high-resource (e.g., English, French, Mandarin) and low-resource (e.g., Urdu, Tamil, Bengali) categories to assess model performance and fairness across varying linguistic resource availability.
Figure: T6: Empathy & Human-Centric Response. Simple vs. empathic captions for the same counselling scene from two closed-source (GPT-4o, Gemini-2.0) and two open-source (Aya Vision, Phi-4) LMMs. LMMs. Linguistic tones—Analytic, Negative, Positive—show empathic prompts lift Positive tone, add slight Negative wording, and keep Analytic steady, indicating prompt framing drives affective style in different models.
In conclusion, HumaniBench reveals that state-of-the-art LMMs still face significant trade-offs between accuracy and human-aligned behavior. While closed-source models generally perform better overall, they exhibit weaknesses in visual grounding. Open-source models show promise in specific areas but lack consistency. Techniques like chain-of-thought and scaling up models offer modest gains, yet alignment deficits persist. These insights underscore the need for multi-objective optimization—combining curated data, safety filters, and targeted fine-tuning—to advance truly human-aligned multimodal systems.
For additional details about HumaniBench evaluation and experimental results, please refer to our main paper. Thank you!
@misc{raza2025humanibenchhumancentricframeworklarge,
title={HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation},
author={Shaina Raza and Aravind Narayanan and Vahid Reza Khazaie and Ashmal Vayani and Mukund S. Chettiar and Amandeep Singh and Mubarak Shah and Deval Pandya},
year={2025},
eprint={2505.11454},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.11454},
}