Architecture¶

Overview¶

SONIC-O1 is a compound multi-agent system built on LangGraph. Each component is a specialised node in a state-machine workflow. The graph executes sequentially with conditional branching based on user-selected agent modes.

Multi-Agent Roles¶

Agent	Module	Responsibility
Planner	`agents/planner.py`	Parse temporal references, detect query type, determine segmentation, compute video duration
Temporal Index Builder	`processors/temporal_index.py`	Split video into segments, caption each via the VLM, assemble a timestamped text index
Reasoner	`agents/reasoner.py`	Chain-of-Thought: understand query, plan approach, execute analysis, verify, refine
Reflection	`agents/reflection.py`	Evaluate response quality, score confidence, iteratively refine if below threshold
Prompt Builder	`processors/prompt_builder.py`	Construct query-type-aware prompts with temporal grounding directives
Multimodal Engine	`core/multimodal_engine.py`	Orchestration functions (backward compatible)
Video Processor	`core/video_processor.py`	Video frame sampling (PyAV), metadata extraction
Audio Processor	`core/audio_processor.py`	Audio loading with time-range slicing, chunking
Multimodal Utils	`core/multimodal_utils.py`	Shared utilities, constants, math helpers

Workflow Graph¶

                    ┌──────────────┐
                    │   Planning   │
                    └──────┬───────┘
                           │
                    ┌──────▼───────┐
                    │ Segmentation │
                    └──────┬───────┘
                           │
                    ┌──────▼───────────┐
                    │ Temporal Indexing│
                    └──────┬───────────┘
                           │
            ┌──────────────┼──────────────┐
            │  route: multi_step /        │
            │  reasoning / direct         │
            └──┬───────────┬──────────┬───┘
               │           │          │
        ┌──────▼──┐  ┌─────▼────┐  ┌──▼──────┐
        │MultiStep│  │ Reasoning│  │  Direct │
        └──────┬──┘  └─────┬────┘  └──┬──────┘
               │           │          │
               └───────────┼──────────┘
                          │
                  ┌───────┴───────┐
                  │use_reflection?│
                  └───┬───────┬───┘
                      │       │
              ┌───────▼──┐  ┌─▼──────┐
              │Reflection│  │ Cleanup│
              └───────┬──┘  └───┬────┘
                      │         │
                  ┌───▼──────────▼───┐
                  │     Cleanup      │
                  └────────┬─────────┘
                           │
                          END

State¶

All nodes read from and write to a shared SonicState TypedDict that flows through the graph. Key fields include:

Field	Type	Set By
`query`	`str`	User input
`plan`	`dict`	Planning node
`actual_video_path`	`str`	Segmentation node
`temporal_index`	`str`	Temporal indexing node
`response`	`str`	Reasoning or Direct node
`evidence`	`dict`	Direct node
`reflection`	`dict`	Reflection node

Temporal Grounding¶

Temporal grounding ensures the model cites accurate timestamps in seconds rather than vague references. The strategy depends on video length.

Short Videos (< `min_duration_sec`)¶

For videos below the configured threshold (default 180 s), the temporal indexing step is skipped. The prompt builder injects a sampling-context hint telling the model the video duration, frame count, and sampling interval so it can estimate timestamps from visual progression.

Long Videos (>= `min_duration_sec`)¶

A frame-captioning index is built before inference:

Segment computation -- The video is divided into N equal-length segments (capped by num_segments, minimum 30 s per segment).
Per-segment captioning -- For each segment, the VLM receives only the time-sliced video frames (video_start / video_end) and the matching audio chunk (audio_start / audio_end). This avoids loading the full file for every segment.

Index assembly -- Captions are combined into a plain-text index:

[0s - 92s] Speaker introduces the case and outlines the charges.
[92s - 184s] Defense attorney presents opening statement.
[184s - 277s] Prosecution shows exhibit A, a security recording.
...

Prompt injection -- The index is placed at the top of the main inference prompt, before the user query, with a directive:

IMPORTANT: A timestamped content index of the video is provided below. Use it to cite accurate timestamps in seconds when describing events.

This approach is inspired by the retrieve-then-read paradigm: expensive multimodal perception is performed once per segment (cheap, parallel-ready), and the downstream reasoning step operates on text -- which LLMs handle with high fidelity.

Audio Slicing¶

Both video and audio support time-range parameters:

Video: video_start / video_end restrict frame decoding via container.seek() in PyAV.
Audio: audio_start / audio_end map to the existing offset / duration parameters in load_audio_pyav, loading only the relevant samples.

When no time range is specified (the main inference pass), the full file is loaded as before.

Directory Structure¶

sonic-o1-agent/
├── src/sonic_o1_agent/
│   ├── agents/                       # Multi-Agent System
│   │   ├── sonic_agent.py            # Main orchestrator
│   │   ├── planner.py                # Planning agent
│   │   ├── planner_advanced.py       # Multi-step decomposition
│   │   ├── reasoner.py               # Chain-of-Thought agent
│   │   └── reflection.py             # Self-reflection agent
│   ├── models/                       # Model wrappers
│   │   └── qwen_model.py             # Qwen3-Omni + vLLM
│   ├── core/                         # Multimodal processing
│   │   ├── multimodal_engine.py      # Orchestration (backward compatible)
│   │   ├── video_processor.py        # Video processing (PyAV)
│   │   ├── audio_processor.py       # Audio processing (PyAV)
│   │   └── multimodal_utils.py     # Shared utilities & constants
│   ├── processors/                   # Prompt engineering & indexing
│   │   ├── prompt_builder.py         # Dynamic prompt construction
│   │   └── temporal_index.py         # Frame-captioned segment grounding
│   ├── workflows/                    # LangGraph orchestration
│   │   ├── state.py                  # SonicState (workflow state schema)
│   │   ├── nodes.py                  # All workflow node functions
│   │   └── graph.py                  # StateGraph with conditional edges
│   └── utils/                        # Utilities
│       └── segmenter.py              # Video/audio segmentation (ffmpeg)
├── configs/
│   └── agent_config.yaml             # System configuration
├── scripts/
│   ├── run_agent.py                  # CLI entry point
│   └── verify_setup.py              # Setup verification
├── slurm/
│   └── run_sonic_agent_native.sh     # SLURM job script
├── docs/                             # MkDocs documentation
└── tests/                            # Unit & integration tests

Model Backend¶

SONIC-O1 uses Qwen3-Omni served through vLLM for efficient multi-GPU inference.

Feature	Detail
Model	Qwen3-Omni-30B-A3B-Instruct
Serving	vLLM with tensor parallelism
Context	Up to 65 536 tokens
Modalities	Native video + audio (no transcription needed)
Decoding	Greedy (`temperature=0.0`) for deterministic output
Caching	Prefix caching enabled for follow-up efficiency

The model wrapper (qwen_model.py) handles:

Lazy loading and engine recovery after crashes or OOM
Chat template formatting via Qwen3OmniMoeProcessor
Text-only generation for intermediate reasoning steps
Multimodal generation with time-range-aware video and audio slicing