The rise of AI-generated content has amplified the challenge of detecting multimodal disinformation—i.e., online posts/articles that contain images and texts with fabricated information, is specially designed to deceive. While prior AI safety benchmarks focus on bias and toxicity, multimodal disinformation detection remains underexplored. To address this challenge, we present the Vision-Language Disinformation Detection Benchmark \textbf{(VLDBench)}, the first comprehensive benchmark for detecting disinformation across both unimodal (text-only) and multimodal (text and image) content, comprising 31,000 news article-image pairs, spanning 13 distinct categories, for robust evaluation. \textbf{VLDBench} features a rigorous semi-automated data curation pipeline, with 22 domain experts dedicating 300+ hour\textbf{s} to annotate the entire 31k samples, achieving a strong inter-annotator agreement (Cohen’s $\kappa = 0.78$). We extensively evaluate state-of-the-art Large Language Models (LLMs) and Vision-Language Models (VLMs), demonstrating that integrating textual and visual cues in multimodal news posts improves disinformation detection accuracy by 5–35\% compared to unimodal models. Developed in alignment with AI governance frameworks such as the EU AI Act, NIST guidelines, and the MIT AI Risk Repository 2024, \textbf{VLDBench} is expected to become a benchmark for detecting disinformation in online multi-modal contents.
VLDBench is a comprehensive classification multimodal benchmark for disinformation detection in news articles. We categorized our data into 13 unique news categories by providing image-text pairs to GPT-4o.
Table: Comparison of VLDBench with contemporary datasets. The Annotation (👤) means Manual and (👤,⚙️) indicates Hybrid (Human+AI). Access (✔️) refers to open-source and (❗) indicates Request required. The Real is defined as (✔️) if it’s real-world data and (❌) if it’s Synthetic data. *Multiple includes Politics, National, Business & Finance, International, Local/Regional, Entertainment, Opinion/Editorial, Health, Other, Sports, Technology, Weather & Environment, Science.
Figure: Summary statistics for VLDBench. Each article is annotated twice, once as text-only and once as text +image, yielding 62,678 labelled instances.
Figure: Summary statistics for VLDBench. Each article is annotated twice, once as text-only and once as text +image, yielding 62,678 labelled instances.
Figure: Disinformation Trends Across News Categories generated by GPT-4o based on disinformation narratives and confidence levels.
Language-Only LLMs | Vision-Language Models (VLMs) |
---|---|
Phi-3-mini-128k-instruct | Phi-3-Vision-128k-Instruct |
Vicuna-7B-v1.5 | LLaVA-v1.5-Vicuna7B |
Mistral-7B-Instruct-v0.3 | LLaVA-v1.6-Mistral-7B |
Qwen2-7B-Instruct | Qwen2-VL-7B-Instruct |
InternLM2-7B | InternVL2-8B |
DeepSeek-V2-Lite-Chat | Deepseek-VL2-small |
GLM-4-9B-chat | GLM-4V-9B |
LLaMA-3.1-8B-Instruct | LLaMA-3.2-11B-Vision |
LLaMA-3.2-1B-Instruct | Deepseek Janus-Pro-7B |
Pixtral |
Figure: Comparison of zero-shot vs. instruction-fine-tuned (IFT) performance, with 95% confidence intervals computed from three independent runs.
Figure: Textual Perturbations. We describe three controlled text perturbations—Synonym Substitution, Misspelling, and Negation—and analyse how each distorts meaning. Our evaluation shows that text negation flips factual statements most often into disinformation, driving the largest drop in model accuracy
Figure: Visual Perturbations. We introduce five image attacks—Gaussian Blur, Additive Noise, Resizing, Cross-Modal Mismatch (C-M), and Both-Modality (B-M)—and report that the cross-modal and combined attacks cause the greatest misclassification under multimodal setting.
Figure: Human evaluation results on a 500-sample test set. Models were tasked with classifying disinformation and justifying their predictions. PC = prediction correctness, RC = reasoning clarity (mean ± std.).
Figure: Mapping of VLDBench components to risk mitigation strategies outlined in the AI MIT Risk and Responsibility Repository. Each pipeline component addresses specific risks related to privacy, misinformation, discrimination, robustness, and interpretability.
VLDBench addresses the urgent challenge of AI-era disinformation through responsible data stewardship, prioritizing human-centered design, ethical data sourcing, and governance-aligned evaluation (EU AI Act, NIST). Unlike existing benchmarks, it is the first to explicitly evaluate modern LLMs/ VLMs on disinformation topic, with 62k multimodal samples spanning 13 topical categories (e.g., sports, politics). While compatible with traditional ML models, its design focuses on emerging multimodal threats. Some limitations need discussion : (1) reliance on pre-verified news sources risks sampling bias, (2) hybrid AI-human annotations may inherit annotator biases, and (3) the English-only corpus limits multilingual applicability. Future work should expand to adversarial cross-modal attacks (e.g., deepfake text-image contradictions) and low-resource languages. Despite these constraints, VLDBench establishes a foundational step toward systematic disinformation benchmarking, enabling researchers to stress-test models agains real-world deception tactics while adhering to AI governance frameworks.
For additional details about VLDBench evaluation and experimental results, please refer to our main paper. Thank you!
@misc{raza2025vldbenchvisionlanguagemodels,
title={VLDBench: Vision Language Models Disinformation Detection Benchmark},
author={Shaina Raza and Ashmal Vayani and Aditya Jain and Aravind Narayanan and Vahid Reza Khazaie and Syed Raza Bashir and Elham Dolatabadi and Gias Uddin and Christos Emmanouilidis and Rizwan Qureshi and Mubarak Shah},
year={2025},
eprint={2502.11361},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.11361},
}