Dataset Details
Overview
Our projects maintain a collection of datasets hosted on Hugging Face, focusing on human-centered AI, multimodal disinformation detection, and news media bias analysis. These datasets support reproducible research and benchmark development for responsible AI.
Hugging Face Datasets & Repositories
HumaniBench
- Description: A large-scale benchmark with 32,000+ human-verified multilingual image-question pairs for evaluating fairness, ethics, reasoning, and empathy in Large Multimodal Models.
- Stats: 32.6k samples, 10+ languages, visual and textual annotations.
- Link: HumaniBench Dataset
VLDBench
- Description: Benchmark for multimodal disinformation detection with 62,000+ real-world image-article pairs, human verified by domain experts.
- Stats: 62k+ samples, 58 news sources, multimodal labels.
- Link: VLDBench Dataset
NewsMediaBias-Plus
- Description: Comprehensive news articles dataset spanning multiple ideological sources, annotated for bias in text and images.
- Stats: 40.9k+ samples, multi-outlet, includes bias annotations and metadata.
- Link: NewsMediaBias-Plus Dataset
NMB-Plus Bias NER BERT
- Description: Named Entity Recognition model fine-tuned on NewsMediaBias-Plus for bias detection in entity mentions.
- Link: NER Model
Llama3.2 Multimodal Newsmedia Bias Detector
- Description: Multimodal bias detection model leveraging Llama3.2 architecture to identify bias in combined text and images.
- Link: Multimodal Bias Detector
Llama3.2 NLP Newsmedia Bias Detector
- Description: NLP-based bias detector using Llama3.2, specialized for textual bias analysis in news media.
- Link: NLP Bias Detector
Additional Repositories and Models
- NMB-Plus Clean Dataset — Cleaned news media bias dataset (31.3k samples).
- maximuspowers/nmbp-bert-full-articles — BERT-based text classification on full articles.
- maximuspowers/multimodal-bias-classifier — Multimodal bias classifier.
Dataset Access & Usage
All datasets can be loaded via the Hugging Face datasets
library. Example:
from datasets import load_dataset
# HumaniBench datasets by task
scene_understanding_ds = load_dataset("vector-institute/HumaniBench", "task1_Scene_Understanding")
instance_identity_ds = load_dataset("vector-institute/HumaniBench", "task2_Instance_Identity")
multiple_choice_vqa_ds = load_dataset("vector-institute/HumaniBench", "task3_Multiple_Choice_VQA")
multilingual_open_ended_ds = load_dataset("vector-institute/HumaniBench", "task4_Multilingual_OpenEnded")
multilingual_close_ended_ds = load_dataset("vector-institute/HumaniBench", "task4_Multilingual_CloseEnded")
visual_grounding_ds = load_dataset("vector-institute/HumaniBench", "task5_Visual_Grounding")
empathetic_captioning_ds = load_dataset("vector-institute/HumaniBench", "task6_Empathetic_Captioning")
image_resilience_ds = load_dataset("vector-institute/HumaniBench", "task7_Image_Resilience")
# Other datasets
vldbench_ds = load_dataset("vector-institute/VLDBench")
newsmediabias_plus_ds = load_dataset("vector-institute/newsmediabias-plus")
nmb_plus_clean_ds = load_dataset("vector-institute/nmb-plus-clean")
nmb_plus_named_entities_ds = load_dataset("vector-institute/NMB-Plus-Named-Entities")
News Sources & Coverage
Our datasets cover a wide spectrum of news sources, including major US outlets, global media, and diverse political perspectives, ensuring comprehensive bias analysis capabilities.
Refer to the detailed news sources list in the section below:
- Major U.S. News Outlets: CNN, Fox News, CBS News, ABC News, New York Times, Washington Post, USA Today, Wall Street Journal, AP News, Politico, New York Post, Forbes, Reuters, Bloomberg
- Global & Alternative News Sources: BBC, Al Jazeera, PBS NewsHour, The Guardian, Newsmax, HuffPost, CNBC, C-SPAN, The Economist, Financial Times, Time, Newsweek, The Atlantic, The New Yorker, The Hill, ProPublica, Axios
- Conservative & Progressive News Outlets: National Review, The Daily Beast, Daily Kos, Washington Examiner, The Federalist, OANN, Daily Caller, Breitbart
- Canadian News Sources: CBC, Toronto Sun, Global News, The Globe and Mail, National Post