Skip to content

SONIC-O1 Multi-Agent System

A compound multi-agent system for audio-video understanding with Qwen3-Omni.

SONIC-O1 is not a single model -- it is a coordinated system of specialized agents that plan, reason, ground, and reflect to produce accurate, temporally-grounded answers from video and audio content.


Key Features

  • Multi-Agent Coordination


    Specialized agents for planning, reasoning, and reflection work together through a LangGraph workflow with conditional branching.

  • Chain-of-Thought Reasoning


    Step-by-step analysis with explicit reasoning traces, self-verification at each step, and confidence scoring.

  • Temporal Grounding


    Frame-captioning index splits video into segments, captions each with time-sliced video and audio, and injects a timestamped index into prompts for accurate second-level citations.

  • Self-Reflection


    Automatic quality assessment with iterative refinement until a confidence threshold is met. Detects gaps and hallucinations.

  • Multi-Step Planning


    Automatic decomposition of complex queries into sub-tasks with sequential execution and context passing.

  • Efficient Processing


    vLLM backend with tensor parallelism, prompt caching, smart video segmentation, and per-segment audio/video slicing.


Quick Start

# Edit config with your model path
vim configs/agent_config.yaml

# Edit SLURM script with your video/audio paths
vim slurm/run_sonic_agent_native.sh

# Submit job
sbatch slurm/run_sonic_agent_native.sh

# Monitor
tail -f logs/sonic_agent_*.out

See the User Guide for detailed usage examples and configuration options, or the Architecture page for a deep dive into the multi-agent workflow and temporal grounding strategy.


Agent Modes

Mode Agents Active Use Case Flag
Direct Planner + Temporal Index Simple queries (default)
Reasoning Planner + Temporal Index + Reasoner Complex analysis --reasoning
Reflective Planner + Temporal Index + Reflection Quality-critical --reflection
Multi-Step Planner (Advanced) + Temporal Index Comparisons, decomposition --multi-step
Full All agents Maximum capability --all-features

Citation

@software{sonic_o1_multi_agent,
  author = {Radwan, Ahmed Y.},
  title = {Sonic O1: A Multi-Agent System for Audio-Video Understanding},
  year = {2026},
  publisher = {Vector Institute},
  note = {Compound AI system with planning, reasoning, and reflection agents}
}

Acknowledgments

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389).

Built on: