SONIC-O1 Multi-Agent System¶
A compound multi-agent system for audio-video understanding with Qwen3-Omni.
SONIC-O1 is not a single model -- it is a coordinated system of specialized agents that plan, reason, ground, and reflect to produce accurate, temporally-grounded answers from video and audio content.
Key Features¶
-
Multi-Agent Coordination
Specialized agents for planning, reasoning, and reflection work together through a LangGraph workflow with conditional branching.
-
Chain-of-Thought Reasoning
Step-by-step analysis with explicit reasoning traces, self-verification at each step, and confidence scoring.
-
Temporal Grounding
Frame-captioning index splits video into segments, captions each with time-sliced video and audio, and injects a timestamped index into prompts for accurate second-level citations.
-
Self-Reflection
Automatic quality assessment with iterative refinement until a confidence threshold is met. Detects gaps and hallucinations.
-
Multi-Step Planning
Automatic decomposition of complex queries into sub-tasks with sequential execution and context passing.
-
Efficient Processing
vLLM backend with tensor parallelism, prompt caching, smart video segmentation, and per-segment audio/video slicing.
Quick Start¶
# Edit config with your model path
vim configs/agent_config.yaml
# Edit SLURM script with your video/audio paths
vim slurm/run_sonic_agent_native.sh
# Submit job
sbatch slurm/run_sonic_agent_native.sh
# Monitor
tail -f logs/sonic_agent_*.out
See the User Guide for detailed usage examples and configuration options, or the Architecture page for a deep dive into the multi-agent workflow and temporal grounding strategy.
Agent Modes¶
| Mode | Agents Active | Use Case | Flag |
|---|---|---|---|
| Direct | Planner + Temporal Index | Simple queries | (default) |
| Reasoning | Planner + Temporal Index + Reasoner | Complex analysis | --reasoning |
| Reflective | Planner + Temporal Index + Reflection | Quality-critical | --reflection |
| Multi-Step | Planner (Advanced) + Temporal Index | Comparisons, decomposition | --multi-step |
| Full | All agents | Maximum capability | --all-features |
Citation¶
@software{sonic_o1_multi_agent,
author = {Radwan, Ahmed Y.},
title = {Sonic O1: A Multi-Agent System for Audio-Video Understanding},
year = {2026},
publisher = {Vector Institute},
note = {Compound AI system with planning, reasoning, and reflection agents}
}
Acknowledgments¶
Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.
This research was funded by the European Union's Horizon Europe research and innovation programme under the AIXPERT project (Grant Agreement No. 101214389).
Built on:
- Qwen3-Omni -- Multimodal foundation model
- vLLM -- Efficient LLM inference
- LangGraph -- Workflow orchestration
- Vector Institute AI Engineering Template