We introduced SONIC-O1, a real-world audio-video benchmark for evaluating MLLMs through the lens
of fairness and AI safety. SONIC-O1 includes human-verified annotations and three tasks:
summarization, MCQs, and temporal localization with reasonings, to assess both semantic
understanding and temporal grounding.
🔍
Key Finding: Audio Dominates, Temporal Grounding Remains Hard
Our findings show that audio or transcripts often provide the strongest cues for
comprehension, while temporal localization remains the most challenging
setting. Gemini 3.0 Pro achieves 96.4% MCQ accuracy and generates
high-quality rationales, yet attains only 25.4% recall at IoU≥0.5 for
temporal localization—a 22.9% performance gap separates the best commercial
and best open-source systems on this task.
We also observe persistent demographic disparities across models, emphasizing
the need for more equitable multimodal evaluation. Performance degrades across demographic groups,
with Black and Indigenous participants showing systematically lower scores across
most models and tasks. Temporal localization exhibits the most severe disparities, with some models
collapsing to 0.0% R@0.5 for Indigenous participants while maintaining 40.9% for the same group
with Gemini.
Overall, SONIC-O1 offers a practical testbed for measuring robustness and fairness in realistic
audio-video scenarios, and we hope it will guide future work on:
- Stronger temporal reasoning beyond frame-level understanding
- Broader benchmark coverage across languages, domains, and modalities
- Fairness-aware training to address demographic performance gaps
- Native audio-video integration rather than treating audio as optional