Abstract

Agentic AI systems extend large language models with planning, tool use, memory, and multi-step control loops, shifting deployment from a single predictive model to a behavior-producing system. We argue interpretability has not made this accompanying shift: prevailing methods remain model-centric, explaining isolated outputs rather than diagnosing long-horizon plans, tool-mediated actions, and multi-agent coordination. This gap limits auditability and accountability because failures emerge from interactions among planning, memory updates, delegation, and environmental feedback. \textbf{We advance the position that interpretability for agentic AI must be system-centric, focusing on trajectories, responsibility assignment, and lifecycle dynamics, not only internal model mechanisms.} To operationalize this view, we propose the Agentic Trajectory and Layered Interpretability Stack (ATLIS), spanning real-time behavioral monitoring, mechanistic analysis, abstraction bridging, multi-agent coordination analysis, and safety and alignment oversight. We map these layers to a five-stage agent lifecycle and motivate risk-aware activation of high-cost analyses during incidents alongside continuous low-overhead monitoring in production.

Introduction

(Add introduction here.)

Position

The interpretability field is solving the wrong problem for the agentic era. Current methods explain how individual models compute outputs but cannot explain why an agent selected a particular plan, how coordination failed, or where accountability lies. We argue three points: (1) interpretability methods must co-evolve with agentic capabilities rather than follow them, embedding transparency into planning, tool use, and memory from the outset; (2) agentic opacity occurs at distinct layers—behavioral, mechanistic, coordination, and safety, each requiring tailored methods; and (3) interpretability must integrate across the full agent development lifecycle rather than serve as a one-time audit.

Core Claims (placeholders)

Coevolution over reaction. …
Layered decomposition. …
Lifecycle integration. …

Alternate Positions and Counterarguments

(Add Counterpositions here)

Conceptual Framework

(Briefly introduce ATLIS and the lifecycle + layered stack figure)

ATLIS (Agentic Trajectory & Layered Interpretability Stack) is an agentic deployment lifecycle and integrated interpretability stack for Agentic AI systems. This framework integrates five interpretability layers across the Agentic AI system lifecycle: (1) Real-Time Behavioral Monitoring tracks observable agent actions; (2) Mechanistic Circuit Analysis examines internal model representations; (3) Abstraction-Level Bridging connects low-level circuits to high-level reasoning; (4) Multi-Agent Analysis evaluates coordination dynamics; and (5) Safety and Alignment ensures adherence to predefined objectives. It is important to highlight that the framework incorporates two loops: blue arrows denote the monitoring refinement feedback loop, while orange arrows denote the safety and alignment revision loop. Computational overhead ranges from low (Layer 1 continuous monitoring) to high (Layer 2 full circuit extraction during incident response).

Illustrative Case Study

Add illustrative Hospital Use Case.

Illustrative Example of using ATLIS for disease diagnosis.

Call to Action

System-Level Attribution and Tracing.
Scalable Runtime Interpretability.
Benchmarks and Evaluation Infrastructure.

Discussion and Limitations

Add discussion implementation limitations with ATLIS.

Position: As we move from models to systems, we need and should use the Agentic Trajectory and Layered Interpretability Stack (ATLIS)