VLDBench: Vision Language Models Disinformation Detection Benchmark

1Vector Institute for Artificial Intelligence 2University of Central Florida 3The University of Texas at Austin 4Sheridan College 5York University 6University of Groningen

Motivated by the growing influence of Generative AI in shaping digital narratives and the critical need to combat disinformation, we present the Vision-Language Disinformation Detection Benchmark (VLDBench). This comprehensive benchmark empowers researchers to evaluate and enhance the capabilities of AI systems in detecting multimodal disinformation, addressing the unique challenges posed by the interplay of textual and visual content. By bridging gaps in existing benchmarks, VLDBench sets the stage for building safer, more transparent, and equitable AI models that safeguard public trust in digital platforms.

Figure: Visual & Textual Disinformation Example: Amplifying fear (left: false biohazard imagery) and controversy (gender biases in sports), distorting perception through fabricated associations and emotional manipulation.



Abstract

The rapid rise of Generative AI (GenAI)-generated content has made detecting disinformation increasingly challenging. In particular, multimodal disinformation i.e., online posts/articles that contain images and texts with fabricated information are specially designed to deceive. While existing AI safety benchmarks primarily address bias and toxicity, disinformation detection remains largely underexplored. To address this challenge, we present the Vision-Language Disinformation Detection Benchmark (VLDBench)—the first comprehensive benchmark for detecting disinformation across both unimodal (text-only) and multimodal (text and image) content, comprising 31,000 news article-image pairs, spanning 13 distinct categories, for robust evaluation. VLDBench features a rigorous semi-automated data curation pipeline, with 22 domain experts dedicating 300+ hours to annotation, achieving a strong inter-annotator agreement (Cohen’s κ = 0.82). We extensively evaluate state-of-the-art Large Language Models (LLMs) and Vision Language Models (VLMs), demonstrating that integrating textual and visual cues in multimodal news posts improves disinformation detection accuracy by 5–15% compared to unimodal models. Developed in alignment with AI governance frameworks such as the EU AI Act, NIST guidelines, and the MIT AI Risk Repository 2024, VLDBench is expected to become a benchmark for detecting disinformation in online multi-modal content. Our code and data will be made publicly available.

VLDBench is the largest and most comprehensive humanly verified disinformation detection benchmark with over 300 hours of human verification.

Main contributions:
  1. VLDBench: Human-verified multimodal benchmark for disinformation detection. Curated from 58 diverse news sources, it contains 31.3k news article-image pairs spanning 13 distinct categories.
  2. Expert Annotation: Curated by 22 domain experts over 300+ hours, achieving high label accuracy (Cohen’s κ = 0.82).
  3. Model Benchmarking: Evaluates LLMs and VLMs, identifying performance gaps and areas for improvement in addressing the challenges of multimodal disinformation in various contexts.

VLDBench Dataset Overview

VLDBench is a comprehensive classification multimodal benchmark for disinformation detection in news articles. We categorized our data into 13 unique news categories by providing image-text pairs to GPT-4o.

Table: Comparison of VLDBench with contemporary datasets.

Figure: Category distribution with overlaps. Total unique articles = 31,339. Percentages sum to > 100% due to multi-category articles.

Data Statistics

Data comprises 31,339 articles and visual samples curated from 58 news sources ranging from the Financial Times, CNN, and New York Times to Axios and Wall Street Journal. VLDBench spans 13 unique categories: National, Business and Finance,International, Entertainment, Local/Regional, Opinion/Editorial, Health, Sports, Politics, Weather and Environment, Technology, Science, and Other -adding depth to the disinformation domains.

Figure: Key Dataset Statistics

Three-Stage Pipeline

Figure: VLDBench is a multimodal disinformation detection framework, focusing on LLM/VLM benchmarking, human-AI collaborative annotation, and risk mitigation. It operates through a three-stage pipeline: (1) Data (collection, filtering, and quality assurance of text-image pairs), (2) Annotation (GPT4 labeling with human validation), (3) Benchmarking (prompt-based evaluation and robustness testing).

Data

The collected dataset underwent a rigorous processing pipeline, including quality checks to remove incomplete or low-quality entries, duplicates, and irrelevant URLs. Articles were selected based on textual depth and image quality, with curated text-image news articles moved to the annotation phase for further analysis and model training.

Annotation

After quality assurance, each article was classified by GPT-4o as either Likely or Unlikely to contain disinformation, with text-image alignment assessed three times per sample to ensure accuracy and resolve ties. The figure below shows an example of disinformation narratives analyzed by GPT-4o, highlighting confidence levels and reasoning.

Figure: Disinformation Trends Across News Categories generated by GPT-4o based on disinformation narratives and confidence levels.



LLMs and VLMs used for Benchmarking

After annotation stage, we move to benchmarking where we evaluate ten state-of-the-art open-source VLMs and nine LLMs on VLDBench, evaluating LLMs on text-only tasks and VLMs on multimodal analysis (text + images). We focus on open-source LLMs and VLMs to promote accessibility and transparency in our research. The evaluation process includes both quantitative and qualitative assessments n both prompt-based and fine-tuned models

Language-Only LLMs Vision-Language Models (VLMs)
Phi-3-mini-128k-instruct Phi-3-Vision-128k-Instruct
Vicuna-7B-v1.5 LLaVA-v1.5-Vicuna7B
Mistral-7B-Instruct-v0.3 LLaVA-v1.6-Mistral-7B
Qwen2-7B-Instruct Qwen2-VL-7B-Instruct
InternLM2-7B InternVL2-8B
DeepSeek-V2-Lite-Chat Deepseek-VL2-small
GLM-4-9B-chat GLM-4V-9B
LLaMA-3.1-8B-Instruct LLaMA-3.2-11B-Vision
LLaMA-3.2-1B-Instruct Deepseek Janus-Pro-7B
Pixtral


Experimental results on VLDBench

Our investigation focuses on three core ques404 tions: (1) Does multimodal (text+image) data improve disinformation detection compared to text alone? (2) Does instruction-based fine tuning enhance generalization and robustness? (3) How vulnerable are models to adversarial perturbations in text and images?

Multimodal Models Surpass Unimodal Baselines

VLMs generally out perform language-only LLMs. For example, LLaMA-3.2-11B-Vision outperforms LLaMA 3.2-1B (text-only). Similarly, Phi, LLaVA, Pixtral, InternVL, DeepSeek-VL, and GLM360 4V perform better than their text-only counterparts. The performance gains between these two sets of models are quite pronounced. For instance, LLaVA-v1.5-Vicuna7B improves accuracy by 27% over its unimodal base (Vicuna-7B), highlighting the critical role of visual context. However, Qwen2-VL-7B marginally lags behind its text-only counterpart, suggesting that the effectiveness of modality integration can vary depending on the model’s architecture. While top LLMs remain competitive, VLMs excel in recall, a vital trait for minimizing missed disinformation in adversarial scenarios.

Instruction Fine-Tuning Enhances Performance

Instruction fine-tuning (IFT) on models like Phi, Mistral-Llava, Qwen, and Llama3.2—along with their vision-language counterparts—utilizing the training subset of VLDBench led to significant performance improvements across all models compared to their zero-shot capabilities. For instance, Phi-3-Vision-IFT achieved a 7% increase in F1 model over their zero-shot baselines. This enhancement is not solely due to better output formatting; rather, it reflects the model ability to adapt to and learn from disinformation-specific cues in the data.

Figure: Comparison of zero-shot vs. instruction-fine-tuned (IFT) performance, with 95% confidence intervals computed from three independent runs.

Robustness to Adversarial Perturbations

Text and Image Attacks

We tested each model under controlled perturbations (zero-shot evaluation), including textual and image perturbations. Textual perturbations include synonym substitution, misspellings, and negations. Image perturbations (I-P) include blurring, gaussian noise, and resizing. We also include Multi modality attacks: cross-modal misalignment (C-M) (e.g., mismatched image captions) and both modality perturbations (B-P) (both text and image distortions). The following images are examples of various perturbations.

Figure: We describe the text perturbations in the caption, introducing Synonym, Misspelling, and Negation. Our analysis shows that text negation leads to a majority of disinformation cases.

Figure: We describe the image perturbations in the caption, introducing Blur, Noise, and Resizing, Cross-Modal (C-M) Mismatch, Both-Modality (BM). Our analysis shows that C-M and B-M leads to a majority of disinformation cases.

Combined Attacks

Combining text+image adversarial attacks can cause catastrophic performance drops in high-capacity models. These findings illustrate that multimodal methods, despite generally higher baseline accuracy, remain susceptible when adversaries deliberately target both modalities.

Human Evaluation Establishes Reliability and Reasoning Depth

We conducted a human evaluation of three IFT VLMs (LlaMA-3.2-11B, Pixtral, LLaVA-v1.6) on a balanced 500-sample test set (250 disinformation, 250 neutral). Each model classified samples and provided a rationale. Three independent reviewers, blinded to model identities, rated the outputs based on Prediction Correctness (PC) and Reasoning Clarity (RC), both on a scale of 1–5. The image below shows a representative example of model reasoning and highlights differences in explanatory quality.

Figure: Human evaluation results on a 500-sample test set. Models were tasked with classifying disinformation and justifying their predictions. PC = prediction correctness, RC = reasoning clarity (mean ± std.).

Conclusion

VLDBench addresses the urgent challenge of disinformation through a design rooted in responsible data stewardship and human-centered principles, integrating best practices to meet key AI governance requirements. Unlike other benchmarks, VLDBench uniquely targets the complexity of disinformation in the post-ChatGPT era, where GenAI has amplified a lot of false information. It is the first dataset explicitly designed to evaluate modern V/LLMs on emerging disinformation challenges, maintaining a topical focus.

However, some limitations need attention. The reliance on pre-verified news sources introduces potential sampling bias, and the annotation process, partially based on AI, may inherit some biases. There is also a need for more research into adversarial attacks on multimodal performance. The current focus on English language only limits its applicability to multilingual and culturally diverse contexts. Despite these limitations, VLDBench represents an effort in benchmarking disinformation detection and opens venues for collaboration from researchers and practitioners to address this challenge.


For additional details about VLDBench evaluation and experimental results, please refer to our main paper. Thank you!

Social Statement

Disinformation threatens democratic institutions, public trust, and social cohesion. Generative AI exacerbates the problem by enabling sophisticated multimodal campaigns that exploit cultural, political, and linguistic nuances, requiring solutions beyond technical approaches.

VLDBench addresses this challenge as the first multimodal benchmark for disinformation detection, combining text and image analysis with ethical safeguards. It prioritizes cultural sensitivity through regional annotations and mitigates bias with audits and human-AI hybrid validation. Ground-truth labels are sourced from fact-checked references with transparent provenance tracking.

As both a technical resource and a catalyst for collaboration, VLDBench democratizes access to cutting-edge detection tools by open-sourcing its benchmark and models. It highlights systemic risks, like adversarial attack vulnerabilities, to drive safer and more reliable systems. Designed to foster partnerships across academia, industry, journalism, and policymaking, VLDBench bridges the gap between research and real-world impact.

Ethical risks are carefully addressed through restricted access, exclusion of synthetic tools, and human oversight requirements. Representation gaps in non-English content are documented to guide future adaptations. Binding agreements prohibit harmful applications such as censorship, surveillance, or targeted disinformation campaigns.

By focusing exclusively on disinformation detection, VLDBench supports media literacy, unbiased fact-checking, and policy discussions on AI governance. Its ethical design and equitable access empower communities and institutions to combat disinformation while fostering trust in digital ecosystems.

BibTeX

@misc{raza2025vldbenchvisionlanguagemodels,
        title={VLDBench: Vision Language Models Disinformation Detection Benchmark}, 
        author={Shaina Raza and Ashmal Vayani and Aditya Jain and Aravind Narayanan and Vahid Reza Khazaie and Syed Raza Bashir and Elham Dolatabadi and Gias Uddin and Christos Emmanouilidis and Rizwan Qureshi and Mubarak Shah},
        year={2025},
        eprint={2502.11361},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2502.11361}, 
  }