SONIC-O1 Multi-Agent System¶

A compound multi-agent system for audio-video understanding with Qwen3-Omni.

SONIC-O1 is not a single model -- it is a coordinated system of specialized agents that plan, reason, ground, and reflect to produce accurate, temporally-grounded answers from video and audio content.

Key Features¶

Multi-Agent Coordination

Specialized agents for planning, reasoning, and reflection work together through a LangGraph workflow with conditional branching.
Chain-of-Thought Reasoning

Step-by-step analysis with explicit reasoning traces, self-verification at each step, and confidence scoring.
Temporal Grounding

Frame-captioning index splits video into segments, captions each with time-sliced video and audio, and injects a timestamped index into prompts for accurate second-level citations.
Self-Reflection

Automatic quality assessment with iterative refinement until a confidence threshold is met. Detects gaps and hallucinations.
Multi-Step Planning

Automatic decomposition of complex queries into sub-tasks with sequential execution and context passing.
Efficient Processing

vLLM backend with tensor parallelism, prompt caching, smart video segmentation, and per-segment audio/video slicing.

Quick Start¶

# Edit config with your model path
vim configs/agent_config.yaml

# Edit SLURM script with your video/audio paths
vim slurm/run_sonic_agent_native.sh

# Submit job
sbatch slurm/run_sonic_agent_native.sh

# Monitor
tail -f logs/sonic_agent_*.out

See the User Guide for detailed usage examples and configuration options, or the Architecture page for a deep dive into the multi-agent workflow and temporal grounding strategy.

Agent Modes¶

Mode	Agents Active	Use Case	Flag
Direct	Planner + Temporal Index	Simple queries	(default)
Reasoning	Planner + Temporal Index + Reasoner	Complex analysis	`--reasoning`
Reflective	Planner + Temporal Index + Reflection	Quality-critical	`--reflection`
Multi-Step	Planner (Advanced) + Temporal Index	Comparisons, decomposition	`--multi-step`
Full	All agents	Maximum capability	`--all-features`

Citation¶

@software{sonic_o1_multi_agent,
  author = {Radwan, Ahmed Y.},
  title = {Sonic O1: A Multi-Agent System for Audio-Video Understanding},
  year = {2026},
  publisher = {Vector Institute},
  note = {Compound AI system with planning, reasoning, and reflection agents}
}

Acknowledgments¶

Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389).

Built on:

Qwen3-Omni -- Multimodal foundation model
vLLM -- Efficient LLM inference
LangGraph -- Workflow orchestration
Vector Institute AI Engineering Template