HumaniBench Logo

HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

1Vector Institute for Artificial Intelligence 2University of Central Florida


HumaniBench is the first comprehensive benchmark designed to evaluate large multimodal models (LMMs) on key human-centered principles such as fairness, ethics, understanding, reasoning, language, inclusivity, empathy and robustness. Built on 32,000 real-world image–question pairs across seven diverse tasks, HumaniBench goes beyond standard accuracy metrics to probe how well models align with human needs, values, and expectations. By providing a rigorous, expert-verified evaluation framework, HumaniBench aims to guide the development of LMMs that are not only powerful and versatile but also ethical, inclusive, and trustworthy in real-world applications.

Figure: This figure shows AI-assisted annotation of the HumaniBench dataset followed by domain-expert verification (top). The benchmark has both open-ended and closed VQA setting and includes seven multimodal tasks (T1–T7). There are seven human-centric principle defined and each task is associated with one or more human-aligned principles (center). Task examples include VQA across modalities and languages. The bottom panel presents the evaluation scheme comprising LLM-based judgments, and standard metrics as needed by the specific task.



Abstract

Large Multimodal Models (LMMs) perform exceptional at vision-language tasks but still face challenges in human-centered criteria such as fairness, ethics, empathy, and inclusivity, principles essential for true alignment with human values. We introduce HumaniBench, a benchmark comprising approximately 32 K image–question pairs from real-world imagery, annotated via a scalable GPT-4o-assisted pipeline and rigorously verified by domain experts. HumaniBench evaluates models across seven human-aligned principles—fairness, ethics, understanding, reasoning, language inclusivity, empathy, and robustness—covering diverse tasks that include both open- and closed-ended visual question answering (VQA). Benchmarking 15 state-of-the-art LMMs (open- and closed-source) reveals that proprietary models generally lead; however, significant gaps remain in robustness and visual grounding, while open-source models struggle to balance accuracy with adherence to human-aligned principles such as ethics and inclusivity. HumaniBench is the first unified, real-world benchmark explicitly designed around Human-Centered AI principles, providing a rigorous testbed to diagnose alignment issues and guide models toward behavior that is accurate, ethical, inclusive, and socially responsible. To promote transparency and foster future research, we publicly release the full dataset and evaluation code.

HumaniBench is the first evaluation framework to unify these human-centric principles—such as fairness, ethics, empathy, and multilingual understanding—into a single suite.

Main contributions:
  1. We release a corpus of about 32K image–text pairs curated from real-world news articles on diverse, socially relevant topics. For each image we generate a caption and assign a social-attribute tag (age, gender, race, sport, or occupation) to create rich metadata for downstream task annotations
  2. Guided by HCAI, we distill seven human-aligned principles into seven realistic LMM tasks: (T1) Scene Understanding, (T2) Instance Identity, (T3) Multiple-Choice VQA, (T4) Multilinguality, (T5) Visual Grounding, (T6) Empathetic Captioning, and (T7) Image Resilience. Each sample in each task is labeled through a semi-automated GPT-4o workflow and rigorously verified by domain experts to ensure reliable ground truth at scale
  3. We benchmark 15 leading LMMs—13 open-source and 2 proprietary—delivering the first holistic measure of their human-readiness. All data, code, and evaluation scripts are publicly released to foster transparent and reproducible research

HumaniBench Framework Overview

Table: Comparison of LMM benchmarks with our seven human-centric principles. Columns are marked ✓ if covered, ✗ if not, or ∼ if partially covered. “HC” denotes human-centric coverage; “Data Source” indicates whether images are real (R) or synthetic (S), with (SD) for Stable Diffusion.

Annotation Pipeline

Figure: Dataset creation pipeline: images are extracted, filtered for duplicates using CLIP, captions & social attributes by GPT-4o, verified by humans, resulting in 13K unique images.

The datasheet for the final HumaniBench dataset can be found here.

Tasks Overview

Figure: HumaniBench overview. ♠ = covers all seven principles; all tasks are evaluated across five social attributes (age, gender, race, occupation, sport). Sections: (i) icon row, (ii) principle definitions, (iii) seven-task suite (I = image, T = text, B = bounding box), and (iv) metric–principle alignment.


Each of the seven tasks in HumaniBench corresponds to one or more of the seven core human-centric principles that we defined and is designed to reflect realistic, complex, and diverse scenarios.

  • T1: Scene Understanding
    Evaluates models on open-ended reasoning over everyday scenes with socially grounded attributes using both standard and chain-of-thought prompts.
  • T2: Instance Identity
    Tests the model's ability to identify and describe key individuals or objects in an image based on identity-relevant features.
  • T3: Multiple-Choice VQA
    Assesses fine-grained visual recognition through multiple-choice questions focused on socially salient visual attributes.
  • T4: Multilinguality
    Measures fairness and consistency in visual question answering across ten languages representing diverse cultural and linguistic contexts.
  • T5: Visual Grounding
    Evaluates how accurately a model links textual references to visual regions using bounding boxes.
  • T6: Empathetic Captioning
    Tests the model’s ability to generate emotionally sensitive yet factual image captions for complex social scenes.
  • T7: Image Resilience
    Assesses robustness by comparing model responses to original and visually perturbed versions of the same image.


Large Multimodal Systems under Evaluation

For HumaniBench, we evaluate thirteen state-of-the-art open-source Large Multimodal Models (LMMs) across diverse human-alignment tasks. These models represent a range of vision encoders, language backbones, and fusion techniques. By focusing exclusively on vision-language models, HumaniBench provides a rigorous testbed for assessing human-centered reasoning, fairness, empathy, and robustness in real-world multimodal scenarios.

Large Multimodal Models (LMMs)
CogVLM2-19B
Cohere Aya Vision 8B
DeepSeek VL2 Small
GLM-4V-9B
InternVL2.5 8B
Janus-Pro-7B
LLaMA 3.2 11B Vision Instruct
LLaVA-v1.6-Vicuna-7B
Molmo-7B
Phi 4
Phi 3.5 Vision Instruct
Qwen2.5-VL-7B Instruct
Gemma 3
OpenAI GPT-4o (Closed source)
Gemini 2.0 (Closed source)


General Performance Overview on HumaniBench

We evaluated both open‐source and closed‐source MLLMs on HumaniBench using a variety of tasks (T1–T7). This section presents our main empirical findings and highlights key challenges for MLLMs.

Performance Across Human-Aligned Principles

Figure: HumaniBench principle-aligned scores. Each entry is the mean score of the tasks mapped to that principle (↑ higher is better). †Closed-source; all others open source.

Performance Across Social Attributes

Recent evaluations show that closed-source models like GPT-4o and Gemini-2.0 consistently outperform open-source models across social attributes such as age, race, and gender, delivering more balanced and reliable results. While open-source models like CogVLM2 and Qwen2.5 perform well in specific areas like race and sports, they show greater variability in handling gender and occupation, especially across complex tasks like scene understanding and visual grounding.

Figure: Examples from T1 (Scene Understanding), T2 (Instance Identity), and T3 (Multiple-Choice VQA) with questions, ground-truth answers, and GPT-4o reasoning.

Figure: Multilingual qualitative examples llustrating model performance across languages (French, Urdu, Tamil) and associated social attributes (Gender, Occupation, Age). Each column presents a question, ground truth answer, predicted answer, and error analysis.

High-Resource Languages Low-Resource Languages
English (Reference) Urdu
French Persian
Spanish Bengali
Portuguese Punjabi
Mandarin Tamil
Korean

Table: Categorization of evaluation languages in the HumaniBench multilingual task. Languages are grouped into high-resource (e.g., English, French, Mandarin) and low-resource (e.g., Urdu, Tamil, Bengali) categories to assess model performance and fairness across varying linguistic resource availability.

Figure: T6: Empathy & Human-Centric Response. Simple vs. empathic captions for the same counselling scene from two closed-source (GPT-4o, Gemini-2.0) and two open-source (Aya Vision, Phi-4) LMMs. LMMs. Linguistic tones—Analytic, Negative, Positive—show empathic prompts lift Positive tone, add slight Negative wording, and keep Analytic steady, indicating prompt framing drives affective style in different models.

Conclusion

In conclusion, HumaniBench reveals that state-of-the-art LMMs still face significant trade-offs between accuracy and human-aligned behavior. While closed-source models generally perform better overall, they exhibit weaknesses in visual grounding. Open-source models show promise in specific areas but lack consistency. Techniques like chain-of-thought and scaling up models offer modest gains, yet alignment deficits persist. These insights underscore the need for multi-objective optimization—combining curated data, safety filters, and targeted fine-tuning—to advance truly human-aligned multimodal systems.


For additional details about HumaniBench evaluation and experimental results, please refer to our main paper. Thank you!

BibTeX

@misc{raza2025humanibenchhumancentricframeworklarge,
        title={HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation}, 
        author={Shaina Raza and Aravind Narayanan and Vahid Reza Khazaie and Ashmal Vayani and Mukund S. Chettiar and Amandeep Singh and Mubarak Shah and Deval Pandya},
        year={2025},
        eprint={2505.11454},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2505.11454}, 
  }