Vector Institute AIXPERT
SONIC-O1

SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding

Ahmed Y. Radwan1, Christos Emmanouildis2, Hina Tabassum3, Deval Pandya1, Shaina Raza1

1Vector Institute, Canada    2University of Groningen, The Netherlands    3York University, Canada   

SONIC-O1 is the first open evaluation suite for real-world, long-form audio-video interactions, targeting global comprehension, fine-grained reasoning, and temporal grounding — with demographic metadata for fairness analysis.

231 videos ~60 hours 4,958 human-verified QAs 13 conversational domains 3 evaluation tasks
SONIC-O1 benchmark overview showing three evaluation tasks: summarization, multiple-choice QA, and temporal localization with demographic metadata
Figure 1. SONIC-O1 benchmark overview. Three example tasks: (Top) Video Summarization requiring global comprehension; (Middle) Multiple-Choice Question (MCQ) with evidence-grounded reasoning; (Bottom) Temporal Localization with precise event timing. Demographic metadata (race, gender, age) shown beneath each video segment enables fairness-aware evaluation across 13 conversational domains.

Abstract

Multimodal Large Language Models (MLLMs) have become a major focus in recent AI research. However, most existing work still centers on static image understanding, while their ability to process sequential audio-video data remains underexplored. This gap highlights the need for a high-quality benchmark to systematically evaluate MLLM performance in dynamic, temporally grounded settings.

We introduce SONIC-O1, a comprehensive, fully human-verified benchmark spanning 13 real-world conversational domains with 4,958 annotations and demographic metadata. SONIC-O1 evaluates MLLMs on key tasks, including open-ended summarization, multiple-choice question answering, and temporal localization with supporting rationales (reasoning).

Experiments across both commercial closed-source and open-source models reveal important limitations. While the gap in MCQ accuracy is relatively small, we observe a substantial 22.9% performance difference in temporal localization between the best commercial and best open-source systems. Performance further degrades across demographic groups, indicating persistent disparities in model behavior.

Overall, SONIC-O1 provides an open evaluation suite for temporally grounded and socially robust multimodal understanding.

Overview

231
Videos
~60h
Total Duration
4,958
Human-Verified QAs
3
Evaluation Tasks

What SONIC-O1 Evaluates

  • Global comprehension — Summarization of long-form interactions
  • Fine-grained reasoning — Multiple-choice QA with evidence grounding
  • Temporal grounding — Event localization with precise timestamps
  • Demographic robustness — Performance across race, age, and gender slices

Unlike prior benchmarks that focus on short clips or single modalities, SONIC-O1 evaluates omnimodal understanding (audio + video) on realistic, long-duration interactions from high-stakes domains.

Dataset

Sunburst chart showing the distribution of 13 conversational topics across 5 domains
Figure 2. Video categories. SONIC-O1 covers 5 key domains and 13 sub-class video types spanning professional, educational, legal/civic, service-oriented, and community/public health interactions.

13 Conversational Topics

Professional

  • Job Interviews
  • Workplace Team Meetings

Educational

  • Parent-Teacher Conferences

Legal / Civic

  • Courtroom Proceedings
  • Community Town Halls

Service-Oriented

  • Customer Service
  • Restaurant Service
  • Housing/Apartment Tours

Community / Public Health

  • Medical (Patient-Doctor)
  • Emergency Response
  • Public Transportation Conflicts
  • Mental Health Counseling
  • Olympics (Sports)
Bar chart showing video duration distribution across topics with short, medium, and long videos
Figure 3. Video duration distribution by topic. SONIC-O1 spans short (<5 min), medium (5-20 min), and long (20-60 min) videos across all conversational domains.
Bar chart showing distribution of summarization, MCQ, and temporal localization questions across topics
Figure 4. Question type distribution over topics. Each domain includes annotations for all three evaluation tasks: summarization, multiple-choice QA, and temporal localization.

Demographic Coverage

SONIC-O1 includes demographic annotations across:

  • Race/Ethnicity: White, Black, Asian, Hispanic, Indigenous, Arab
  • Gender: Male, Female
  • Age: 18-24, 25-39, 40+

All demographic labels are annotated from observable characteristics via AI-assisted human review, enabling systematic fairness evaluation across demographic groups.

Evaluation Tasks

SONIC-O1 evaluates three complementary capabilities: global comprehension (summarization), fine-grained reasoning (MCQ), and temporal grounding (localization).

Task 1

Video Summarization

Generate narrative summaries capturing key events, actions, and outcomes across full videos (up to 60 minutes). Tests global comprehension and ability to synthesize information across long temporal spans.

Metrics:
  • LLM-as-Judge score (0–10)
  • ROUGE-L
  • Cosine similarity
231 instances
Task 2

Multiple-Choice QA

Answer questions about 3-minute video segments with four answer choices plus "Not enough evidence" option. Requires fine-grained comprehension and evidence-grounded reasoning across audio-visual modalities.

Metrics:
  • Accuracy (%)
  • Rationale quality (LLM judge)
1,335 instances
Task 3

Temporal Localization

Localize events in time with start/end timestamps and temporal relations (before/during/after). Tests whether models can identify not just what happens but when it occurs.

Metrics:
  • Recall@IoU (R@0.3, R@0.5, R@0.7)
  • Mean IoU (mIoU)
  • Mean Absolute Error (MAE)
3,392 instances

Results

We evaluate 6 state-of-the-art multimodal models on SONIC-O1. Closed-source models (Gemini 3.0 Pro) consistently outperform open-source alternatives, though temporal localization remains challenging for all systems.

Model LLM Params Summarization
Score (0–10) ↑
MCQ
Accuracy (%) ↑
Temporal
R@0.5 (%) ↑
Gemini 3.0 Pro † 7.07 96.4 25.4
Qwen3-Omni 30B 5.72 93.6 2.8
UniMoE-2.0 33B 4.71 88.2 1.0
MiniCPM-o-2.6 9B 3.34 87.4 0.7
VITA 1.5 8B 2.77 81.6 1.2
VideoLLaMA2 7B 1.53 54.3 0.4

Denotes closed-source model. All metrics are macro-averaged across 13 conversational topics. See the live leaderboard for full per-task breakdowns and latest submissions.

Key Findings

  • Accuracy-temporal grounding disconnect: Gemini 3.0 Pro achieves 96.4% MCQ accuracy but only 25.4% R@0.5 for temporal localization, revealing that models can identify what happens but struggle to pinpoint when.
  • Temporal localization is the hardest task: Open-source models achieve <3% R@0.5, with a 22% performance gap to Gemini 3.0 Pro, indicating fundamental limitations in temporal reasoning.
  • Model scale matters: Larger models (Qwen3-Omni 30B, UniMoE-2.0 33B) significantly outperform smaller variants (7-9B), though even at scale, temporal grounding remains challenging.

Per-Topic Performance

SONIC-O1 spans 13 conversational domains across professional, civic, service-oriented, and community interactions. Performance varies significantly across topics, revealing domain-specific strengths and weaknesses.

Key Observations

  • Structured interactions are easier: Models perform best on formal settings like medical consultations, job interviews, and courtroom proceedings where interactions follow predictable patterns.
  • High-stakes scenarios are harder: Emergency response and mental health counseling show lower scores across all models, likely due to increased perceptual complexity and emotional nuance.
  • Gemini 3.0 Pro leads consistently: Achieves the highest scores across nearly all topics, with particularly strong performance on professional and educational domains.
  • Open-source models show uneven robustness: Smaller models (VideoLLaMA2, VITA 1.5) struggle more on complex topics, while larger open-source models (Qwen3-Omni) show more stable cross-topic performance.
Radar chart showing per-topic performance across 13 conversational domains for 6 models
Figure 6. Performance comparison across 13 conversational domains. We evaluate six MLLMs on video summarization using LLM-judge scores (0-10 scale, higher is better). Gemini 3.0 Pro consistently outperforms open-source models, while high-stakes scenarios (Emergency Response, Mental Health) prove more challenging than structured interactions (Medical, Job Interviews) across all models.

Fairness Analysis

SONIC-O1 includes demographic annotations (race, gender, age) to enable systematic fairness evaluation. Results reveal significant performance disparities across demographic groups, with Black and Indigenous participants showing consistently lower scores.

Summarization Fairness (LLM-as-Judge Score, 0–10)

Model White Black Asian Hispanic Indigenous Arab Gap ↓
Gemini 3.0 Pro † 6.68 6.02 7.05 6.41 6.70 6.90 1.03
Qwen3-Omni 5.28 4.39 5.71 4.99 4.13 5.95 1.82
UniMoE-2.0-Omni 4.29 3.45 4.62 3.70 4.35 5.00 1.55
MiniCPM-o-2.6 3.26 2.92 3.26 3.04 3.61 3.57 0.69
VITA 1.5 2.50 2.31 2.65 2.21 1.65 2.76 1.11
VideoLLaMA2 1.45 1.38 1.63 1.23 1.04 1.00 0.63

MCQ Fairness (Accuracy, %)

Model White Black Asian Hispanic Indigenous Arab Gap ↓
Gemini 3.0 Pro † 96.9 96.4 97.8 96.1 94.3 98.4 4.1
Qwen3-Omni 93.3 92.0 96.1 92.8 77.1 96.9 19.8
UniMoE-2.0-Omni 88.9 87.4 89.2 85.5 80.0 95.3 15.3
MiniCPM-o-2.6 87.5 86.3 88.7 81.6 82.9 92.7 11.1
VITA 1.5 82.0 82.2 84.6 79.1 62.9 93.2 30.3
VideoLLaMA2 55.1 55.0 57.9 51.0 65.7 66.0 15.0

Temporal Localization Fairness (Recall@0.5, %)

Model White Black Asian Hispanic Indigenous Arab Gap ↓
Gemini 3.0 Pro † 23.0 19.5 30.7 23.8 40.9 21.1 21.4
Qwen3-Omni 2.6 1.8 2.9 2.3 0.0 1.6 2.9
UniMoE-2.0 1.2 0.6 0.6 0.1 1.3 0.2 1.2
MiniCPM-o-2.6 0.9 0.3 0.8 0.2 0.0 2.6 2.6
VITA 1.5 1.4 1.4 1.4 0.8 1.3 1.2 0.6
VideoLLaMA2 0.5 0.3 0.4 0.0 1.3 0.0 1.3

Denotes closed-source model. Best performing group per model. Worst performing group per model. Gap = difference between best and worst performing groups (lower is better).

Fairness Observations

  • Black and Indigenous groups show systematically lower performance: Across most models and tasks, these groups consistently score below other demographic slices, indicating training data imbalances.
  • Temporal localization shows the largest disparities: Gemini 3.0 Pro achieves 40.9% R@0.5 for Indigenous participants but only 19.3% for Black participants—a 21.6 point gap, the largest across all tasks.
  • Closed-source models are more robust: Gemini 3.0 Pro shows smaller demographic gaps compared to open-source alternatives, likely due to more diverse training data and safety alignment.
  • Gender and age show smaller but consistent gaps: Female participants and older adults (40+) consistently score slightly higher across all models, suggesting bias toward more formal interaction styles.

Conclusion

We introduced SONIC-O1, a real-world audio-video benchmark for evaluating MLLMs through the lens of fairness and AI safety. SONIC-O1 includes human-verified annotations and three tasks: summarization, MCQs, and temporal localization with reasonings, to assess both semantic understanding and temporal grounding.

🔍

Key Finding: Audio Dominates, Temporal Grounding Remains Hard

Our findings show that audio or transcripts often provide the strongest cues for comprehension, while temporal localization remains the most challenging setting. Gemini 3.0 Pro achieves 96.4% MCQ accuracy and generates high-quality rationales, yet attains only 25.4% recall at IoU≥0.5 for temporal localization—a 22.9% performance gap separates the best commercial and best open-source systems on this task.

We also observe persistent demographic disparities across models, emphasizing the need for more equitable multimodal evaluation. Performance degrades across demographic groups, with Black and Indigenous participants showing systematically lower scores across most models and tasks. Temporal localization exhibits the most severe disparities, with some models collapsing to 0.0% R@0.5 for Indigenous participants while maintaining 40.9% for the same group with Gemini.

Overall, SONIC-O1 offers a practical testbed for measuring robustness and fairness in realistic audio-video scenarios, and we hope it will guide future work on:

  • Stronger temporal reasoning beyond frame-level understanding
  • Broader benchmark coverage across languages, domains, and modalities
  • Fairness-aware training to address demographic performance gaps
  • Native audio-video integration rather than treating audio as optional
Read the Paper →

Quick Start

The SONIC-O1 repository provides an end-to-end pipeline for data curation, annotation generation, and model evaluation. Dataset and annotations are hosted on Hugging Face.

Installation

# Clone repository (note: nested structure)
            git clone https://github.com/VectorInstitute/sonic-o1.git
            cd sonic-o1/sonic-o1

            # Download dataset + annotations from Hugging Face
            pip install huggingface_hub
            huggingface-cli download vector-institute/sonic-o1 --repo-type dataset --local-dir ./

            # Setup Python environment
            python -m venv .venv
            source .venv/bin/activate  # On Windows: .venv\Scripts\activate
            pip install -r requirements_venv.txt

Load Dataset

from datasets import load_dataset

            # Load individual tasks
            ds_summ = load_dataset("vector-institute/sonic-o1", "task1_summarization")
            ds_mcq = load_dataset("vector-institute/sonic-o1", "task2_mcq")
            ds_temporal = load_dataset("vector-institute/sonic-o1", "task3_temporal_localization")

            # Each sample includes:
            # - video metadata (ID, topic, duration, demographics)
            # - question/prompt
            # - ground truth answer
            # - rationale (for MCQ and temporal tasks)

Pipeline Stages

The repository includes modular scripts for each stage of the pipeline:

  1. 01_data_curation/ — Video search, filtering, and metadata extraction
  2. 02_caption_generation/ — Whisper-based caption generation for videos without captions
  3. 03_demographics_annotation/ — AI-assisted demographic labeling with human verification
  4. 04_vqa_generation/ — Multi-task annotation generation (summarization, MCQ, temporal)
  5. 05_evaluation_inference/ — Model evaluation scripts and metric computation

See the GitHub repository for detailed documentation and usage examples.

BibTeX

If you use SONIC-O1 in your research, please cite our paper:

@article{sonic-o1-2026,
          title   = {SONIC-O1: A Real-World Benchmark for Evaluating Multimodal Large Language Models on Audio-Video Understanding},
          year    = {2026},
        }

Acknowledgements

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute (http://www.vectorinstitute.ai/#partners).

This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389), which aims to develop an agentic, multi-layered, GenAI-powered framework for creating explainable, accountable, and transparent AI systems.

Vector Institute AIXPERT