Architecture¶
Overview¶
SONIC-O1 is a compound multi-agent system built on LangGraph. Each component is a specialised node in a state-machine workflow. The graph executes sequentially with conditional branching based on user-selected agent modes.
Multi-Agent Roles¶
| Agent | Module | Responsibility |
|---|---|---|
| Planner | agents/planner.py |
Parse temporal references, detect query type, determine segmentation, compute video duration |
| Temporal Index Builder | processors/temporal_index.py |
Split video into segments, caption each via the VLM, assemble a timestamped text index |
| Reasoner | agents/reasoner.py |
Chain-of-Thought: understand query, plan approach, execute analysis, verify, refine |
| Reflection | agents/reflection.py |
Evaluate response quality, score confidence, iteratively refine if below threshold |
| Prompt Builder | processors/prompt_builder.py |
Construct query-type-aware prompts with temporal grounding directives |
| Multimodal Engine | core/multimodal_engine.py |
Orchestration functions (backward compatible) |
| Video Processor | core/video_processor.py |
Video frame sampling (PyAV), metadata extraction |
| Audio Processor | core/audio_processor.py |
Audio loading with time-range slicing, chunking |
| Multimodal Utils | core/multimodal_utils.py |
Shared utilities, constants, math helpers |
Workflow Graph¶
┌──────────────┐
│ Planning │
└──────┬───────┘
│
┌──────▼───────┐
│ Segmentation │
└──────┬───────┘
│
┌──────▼───────────┐
│ Temporal Indexing│
└──────┬───────────┘
│
┌──────────────┼──────────────┐
│ route: multi_step / │
│ reasoning / direct │
└──┬───────────┬──────────┬───┘
│ │ │
┌──────▼──┐ ┌─────▼────┐ ┌──▼──────┐
│MultiStep│ │ Reasoning│ │ Direct │
└──────┬──┘ └─────┬────┘ └──┬──────┘
│ │ │
└───────────┼──────────┘
│
┌───────┴───────┐
│use_reflection?│
└───┬───────┬───┘
│ │
┌───────▼──┐ ┌─▼──────┐
│Reflection│ │ Cleanup│
└───────┬──┘ └───┬────┘
│ │
┌───▼──────────▼───┐
│ Cleanup │
└────────┬─────────┘
│
END
State¶
All nodes read from and write to a shared SonicState TypedDict that flows
through the graph. Key fields include:
| Field | Type | Set By |
|---|---|---|
query |
str |
User input |
plan |
dict |
Planning node |
actual_video_path |
str |
Segmentation node |
temporal_index |
str |
Temporal indexing node |
response |
str |
Reasoning or Direct node |
evidence |
dict |
Direct node |
reflection |
dict |
Reflection node |
Temporal Grounding¶
Temporal grounding ensures the model cites accurate timestamps in seconds rather than vague references. The strategy depends on video length.
Short Videos (< min_duration_sec)¶
For videos below the configured threshold (default 180 s), the temporal indexing step is skipped. The prompt builder injects a sampling-context hint telling the model the video duration, frame count, and sampling interval so it can estimate timestamps from visual progression.
Long Videos (>= min_duration_sec)¶
A frame-captioning index is built before inference:
-
Segment computation -- The video is divided into N equal-length segments (capped by
num_segments, minimum 30 s per segment). -
Per-segment captioning -- For each segment, the VLM receives only the time-sliced video frames (
video_start/video_end) and the matching audio chunk (audio_start/audio_end). This avoids loading the full file for every segment. -
Index assembly -- Captions are combined into a plain-text index:
-
Prompt injection -- The index is placed at the top of the main inference prompt, before the user query, with a directive:
IMPORTANT: A timestamped content index of the video is provided below. Use it to cite accurate timestamps in seconds when describing events.
This approach is inspired by the retrieve-then-read paradigm: expensive multimodal perception is performed once per segment (cheap, parallel-ready), and the downstream reasoning step operates on text -- which LLMs handle with high fidelity.
Audio Slicing¶
Both video and audio support time-range parameters:
- Video:
video_start/video_endrestrict frame decoding viacontainer.seek()in PyAV. - Audio:
audio_start/audio_endmap to the existingoffset/durationparameters inload_audio_pyav, loading only the relevant samples.
When no time range is specified (the main inference pass), the full file is loaded as before.
Directory Structure¶
sonic-o1-agent/
├── src/sonic_o1_agent/
│ ├── api.py # FastAPI demo server (vLLM server mode)
│ ├── demo/ # Demo UI (HTML/CSS/JS)
│ ├── agents/ # Multi-Agent System
│ │ ├── sonic_agent.py # Main orchestrator
│ │ ├── planner.py # Planning agent
│ │ ├── planner_advanced.py # Multi-step decomposition
│ │ ├── reasoner.py # Chain-of-Thought agent
│ │ └── reflection.py # Self-reflection agent
│ ├── models/ # Model wrappers
│ │ └── qwen_model.py # Qwen3-Omni + vLLM
│ ├── core/ # Multimodal processing
│ │ ├── multimodal_engine.py # Orchestration (backward compatible)
│ │ ├── video_processor.py # Video processing (PyAV)
│ │ ├── audio_processor.py # Audio processing (PyAV)
│ │ └── multimodal_utils.py # Shared utilities & constants
│ ├── processors/ # Prompt engineering & indexing
│ │ ├── prompt_builder.py # Dynamic prompt construction
│ │ └── temporal_index.py # Frame-captioned segment grounding
│ ├── workflows/ # LangGraph orchestration
│ │ ├── state.py # SonicState (workflow state schema)
│ │ ├── nodes.py # All workflow node functions
│ │ └── graph.py # StateGraph with conditional edges
│ └── utils/ # Utilities
│ └── segmenter.py # Video/audio segmentation (ffmpeg)
├── configs/
│ └── agent_config.yaml # System configuration
├── scripts/
│ ├── run_agent.py # CLI entry point
│ └── verify_setup.py # Setup verification
├── slurm/
│ └── run_sonic_agent_native.sh # SLURM job script
├── docs/ # MkDocs documentation
└── tests/ # Unit & integration tests
Model Backend¶
SONIC-O1 uses Qwen3-Omni served through vLLM for efficient multi-GPU inference.
| Feature | Detail |
|---|---|
| Model | Qwen3-Omni-30B-A3B-Instruct |
| Serving | vLLM with tensor parallelism |
| Context | Up to 65 536 tokens |
| Modalities | Native video + audio (no transcription needed) |
| Decoding | Greedy (temperature=0.0) for deterministic output |
| Caching | Prefix caching enabled for follow-up efficiency |
The model wrapper (qwen_model.py) handles:
- Lazy loading and engine recovery after crashes or OOM
- Chat template formatting via
Qwen3OmniMoeProcessor - Text-only generation for intermediate reasoning steps
- Multimodal generation with time-range-aware video and audio slicing
Dual Backend: Embedded vs Server Mode¶
The model wrapper (qwen_model.py) supports two backends, selected by a
single config key:
| Backend | When | How |
|---|---|---|
| Embedded | vllm_base_url absent or empty |
Loads model in-process via vllm.LLM. Original behaviour. |
| Server | vllm_base_url set (e.g. http://localhost:8080/v1) |
Connects to a running vllm serve via OpenAI SDK. |
In server mode, video frames are extracted client-side using the same
process_video_with_metadata logic as embedded mode (respecting
max_frames, min_frames, and time-range slicing), then sent as base64
JPEGs. This gives identical frame sampling behaviour regardless of backend
while enabling vLLM's continuous batching for concurrent requests.
The temporal index builder automatically parallelises segment captioning
in server mode via ThreadPoolExecutor, sending multiple caption requests
simultaneously to the vLLM server.