HumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation
π Website | π Paper | π Dataset
π§ Overview
As multimodal generative AI systems become increasingly integrated into human-centered applications, evaluating their alignment with human values has become critical.
HumaniBench is the first comprehensive benchmark designed to evaluate Large Multimodal Models (LMMs) on seven Human-Centered AI (HCAI) principles:
- Fairness
- Ethics
- Understanding
- Reasoning
- Language Inclusivity
- Empathy
- Robustness
π¦ Features
- π· 32,000+ Real-World ImageβQuestion Pairs
- β Human-Verified Ground Truth Annotations
- π Multilingual QA Support (10+ languages)
- π§ Open and Closed-Ended VQA Formats
- π§ͺ Visual Robustness & Bias Stress Testing
- π Chain-of-Thought Reasoning + Perceptual Grounding
π Evaluation Tasks Overview
Task | Focus |
---|---|
Task 1: Scene Understanding | Visual reasoning + bias/toxicity analysis in social attributes (gender, age, occupation, etc.) |
Task 2: Instance Identity | Visual reasoning in culturally rich, socially grounded settings |
Task 3: Multiple Choice QA | Structured attribute recognition via multi-choice questions |
Task 4: Multilingual Visual QA | VQA across 10+ languages, including low-resource ones |
Task 5: Visual Grounding | Bounding box localization of socially salient regions |
Task 6: Empathetic Captioning | Human-style emotional captioning evaluation |
Task 7: Image Resilience | Robustness testing via image perturbations |
𧬠Pipeline
Three-stage process:
-
Data Collection: Curated from global news imagery, tagged by social attributes (age, gender, race, occupation, sport)
-
Annotation: GPT-4oβassisted labeling + human expert verification
-
Evaluation: Comprehensive scoring across Accuracy, Fairness, Robustness, Empathy and Faithfulness
π Key Insights
- π Bias persists, especially across gender and race
- π Multilingual gaps affect low-resource language performance
- β€οΈ Empathy and ethics vary significantly by model family
- π§ Chain-of-Thought reasoning improves performance but doesnβt fully mitigate bias
- π§ͺ Robustness tests reveal fragility to noise, occlusion, and blur
π Citation
If you use HumaniBench or this evaluation suite in your work, please cite:
@misc{raza2025humanibenchhumancentricframeworklarge,
title={HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation},
author={Shaina Raza and Aravind Narayanan and Vahid Reza Khazaie and Ashmal Vayani and Mukund S. Chettiar and Amandeep Singh and Mubarak Shah and Deval Pandya},
year={2025},
eprint={2505.11454},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.11454},
}
π¬ Contact
For questions, collaborations, or dataset access requests, please contact the corresponding author at shaina.raza@vectorinstitute.ai.
β‘ HumaniBench promotes trustworthy, fair, and human-centered multimodal AI.
We invite researchers, developers, and policymakers to explore, evaluate, and extend HumaniBench. π