HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

Abstract

Large Multimodal Models (LMMs) perform exceptional at vision-language tasks but still face challenges in human-centered criteria such as fairness, ethics, empathy, and inclusivity, principles essential for true alignment with human values. We introduce HumaniBench, a benchmark comprising approximately 32 K image–question pairs from real-world imagery, annotated via a scalable GPT-4o-assisted pipeline and rigorously verified by domain experts. HumaniBench evaluates models across seven human-aligned principles—fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness—covering diverse tasks that include both open- and closed-ended visual question answering (VQA). Benchmarking 15 state-of-the-art LMMs (open- and closed-source) reveals that proprietary models generally lead; however, significant gaps remain in robustness and visual grounding, while open-source models struggle to balance accuracy with adherence to human-aligned principles such as ethics and inclusivity. HumaniBench is the first unified, real-world benchmark explicitly designed around Human-Centered AI principles, providing a rigorous testbed to diagnose alignment issues and guide models toward behavior that is accurate, ethical, inclusive, and socially responsible. To promote transparency and foster future research, we publicly release the full dataset and evaluation code.

HumaniBench is the first evaluation framework to unify these human-centric principles—such as fairness, ethics, empathy, and multilingual understanding—into a single suite.

Main contributions:

We release a corpus of about 32K image–text pairs curated from real-world news articles on diverse, socially relevant topics. For each image we generate a caption and assign a social-attribute tag (age, gender, race, sport, or occupation) to create rich metadata for downstream task annotations
Guided by HCAI, we distill seven human-aligned principles into seven realistic LMM tasks: (T1) Scene Understanding, (T2) Instance Identity, (T3) Multiple-Choice VQA, (T4) Multilinguality, (T5) Visual Grounding, (T6) Empathetic Captioning, and (T7) Image Resilience. Each sample in each task is labeled through a semi-automated GPT-4o workflow and rigorously verified by domain experts to ensure reliable ground truth at scale
We benchmark 15 leading LMMs—13 open-source and 2 proprietary—delivering the first holistic measure of their human-readiness. All data, code, and evaluation scripts are publicly released to foster transparent and reproducible research

HumaniBench Framework Overview

Table: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.

Annotation Pipeline

Figure: Dataset creation pipeline: images are extracted, filtered for duplicates using CLIP, captions & social attributes by GPT-4o, verified by humans, resulting in 13K unique images.

The datasheet for the final HumaniBench dataset can be found here.

Tasks Overview

Figure: HumaniBench overview. ♠ = covers all seven principles; all tasks are evaluated across five social attributes (age, gender, race, occupation, sport). Sections: (i) icon row, (ii) principle definitions, (iii) seven-task suite (I = image, T = text, B = bounding box), and (iv) metric–principle alignment.

Each of the seven tasks in HumaniBench corresponds to one or more of the seven core human-centric principles that we defined and is designed to reflect realistic, complex, and diverse scenarios.

T1: Scene Understanding
Evaluates models on open-ended reasoning over everyday scenes with socially grounded attributes using both standard and chain-of-thought prompts.
T2: Instance Identity
Tests the model's ability to identify and describe key individuals or objects in an image based on identity-relevant features.
T3: Multiple-Choice VQA
Assesses fine-grained visual recognition through multiple-choice questions focused on socially salient visual attributes.
T4: Multilinguality
Measures fairness and consistency in visual question answering across ten languages representing diverse cultural and linguistic contexts.
T5: Visual Grounding
Evaluates how accurately a model links textual references to visual regions using bounding boxes.
T6: Empathetic Captioning
Tests the model’s ability to generate emotionally sensitive yet factual image captions for complex social scenes.
T7: Image Resilience
Assesses robustness by comparing model responses to original and visually perturbed versions of the same image.

Large Multimodal Systems under Evaluation

For HumaniBench, we evaluate thirteen state-of-the-art open-source Large Multimodal Models (LMMs) across diverse human-alignment tasks. These models represent a range of vision encoders, language backbones, and fusion techniques. By focusing exclusively on vision-language models, HumaniBench provides a rigorous testbed for assessing human-centered reasoning, fairness, empathy, and robustness in real-world multimodal scenarios.

Large Multimodal Models (LMMs)
CogVLM2-19B
Cohere Aya Vision 8B
DeepSeek VL2 Small
GLM-4V-9B
InternVL2.5 8B
Janus-Pro-7B
LLaMA 3.2 11B Vision Instruct
LLaVA-v1.6-Vicuna-7B
Molmo-7B
Phi 4
Phi 3.5 Vision Instruct
Qwen2.5-VL-7B Instruct
Gemma 3
OpenAI GPT-4o (Closed source)
Gemini 2.0 (Closed source)

General Performance Overview on HumaniBench

We evaluated both open‐source and closed‐source MLLMs on HumaniBench using a variety of tasks (T1–T7). This section presents our main empirical findings and highlights key challenges for MLLMs.

Performance Across Human-Aligned Principles

Closed-source large multimodal models generally achieve the highest overall performance, showing strengths in fairness, reasoning, safety, multilingual coverage, and empathetic responses. They tend to produce more balanced outputs across demographics and benefit from stronger safety alignment and reinforcement learning techniques.

Open-source models, however, excel in specific capabilities. Some outperform closed alternatives in understanding tasks, particularly object recognition and visual grounding, and achieve higher robustness through specialized stabilization strategies. In reasoning and ethical safety, open models come close to matching closed models, despite using far fewer computational resources.

Overall, closed models maintain a lead in safety and inclusivity, while open models demonstrate that competitive and well-grounded results can be achieved efficiently without proprietary infrastructure.

Performance Across Social Attributes

Recent evaluations show that closed-source models like GPT-4o and Gemini-2.0 consistently outperform open-source models across social attributes such as age, race, and gender, delivering more balanced and reliable results. While open-source models like CogVLM2 and Qwen2.5 perform well in specific areas like race and sports, they show greater variability in handling gender and occupation, especially across complex tasks like scene understanding and visual grounding.

Figure: Examples from T1 (Scene Understanding), T2 (Instance Identity), and T3 (Multiple-Choice VQA) with questions, ground-truth answers, and GPT-4o reasoning.

Figure: Multilingual qualitative examples llustrating model performance across languages (French, Urdu, Tamil) and associated social attributes (Gender, Occupation, Age). Each column presents a question, ground truth answer, predicted answer, and error analysis.

High-Resource Languages	Low-Resource Languages
English (Reference)	Urdu
French	Persian
Spanish	Bengali
Portuguese	Punjabi
Mandarin	Tamil
Korean

Table: Categorization of evaluation languages in the HumaniBench multilingual task. Languages are grouped into high-resource (e.g., English, French, Mandarin) and low-resource (e.g., Urdu, Tamil, Bengali) categories to assess model performance and fairness across varying linguistic resource availability.

Figure: T6: Empathy & Human-Centric Response. Simple vs. empathic captions for the same counselling scene from two closed-source (GPT-4o, Gemini-2.0) and two open-source (Aya Vision, Phi-4) LMMs. LMMs. Linguistic tones—Analytic, Negative, Positive—show empathic prompts lift Positive tone, add slight Negative wording, and keep Analytic steady, indicating prompt framing drives affective style in different models.

Conclusion

In conclusion, HumaniBench reveals that state-of-the-art LMMs still face significant trade-offs between accuracy and human-aligned behavior. While closed-source models generally perform better overall, they exhibit weaknesses in visual grounding. Open-source models show promise in specific areas but lack consistency. Techniques like chain-of-thought and scaling up models offer modest gains, yet alignment deficits persist. These insights underscore the need for multi-objective optimization—combining curated data, safety filters, and targeted fine-tuning—to advance truly human-aligned multimodal systems.

For additional details about HumaniBench evaluation and experimental results, please refer to our main paper. Thank you!

BibTeX

@misc{raza2025humanibenchhumancentricframeworklarge, title={HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation}, author={Shaina Raza and Aravind Narayanan and Vahid Reza Khazaie and Ashmal Vayani and Mukund S. Chettiar and Amandeep Singh and Mubarak Shah and Deval Pandya}, year={2025}, eprint={2505.11454}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2505.11454}, }