Accountable Deployment of Agentic AI Demands Layered, System-Level Interpretability

Authors: Judy Zhu¹, Dhari Gandhi¹, Ahmad Rezaie Mianroodi^1,2, Dhanesh Ramachandram¹, Sedef Akinli Kocak¹, Shaina Raza¹

Affiliations: ¹Vector Institute, ²Dalhousie University

When an agentic system fails, who is accountable and how do we determine why? Unlike static models that produce isolated predictions, agentic systems plan, use tools, maintain memory, and coordinate actions over time. Failures may arise not only from incorrect outputs, but from evolving internal state and errors across components that are difficult to trace.

⚠️

Real-World Failures

Semi-autonomous driving incidents spanning perception, planning, and human-machine handoff. A NYC chatbot advising employers to take workers' tips, violating local law. Root causes: inadequate constraint enforcement, poor grounding, absent lifecycle oversight.

→

🔍

The Interpretability Gap

Model-centric methods (feature attributions, saliency maps, circuit analysis) explain individual outputs—not how decisions emerge from interactions across multiple components over time.

Position

Agentic AI systems behave through trajectories: they plan, invoke tools, update memory, and coordinate over multiple steps. However, interpretability remains largely model-centric, focused on explaining single predictions rather than tracing long-horizon behavior and responsibility across interacting components. As a result, critical failures, such as tool misuse, coordination breakdowns, or goal drift, often evade existing audits until harm occurs.

🎯

Position Statement

The interpretability field is solving the wrong problem for the agentic era. Current methods explain how individual models compute outputs but cannot explain why an agent selected a particular plan, how multi-agent coordination failed, or where accountability lies within a system. We argue three points: (1) interpretability methods must co-evolve with agentic capabilities rather than follow them, embedding transparency into planning, tool use, and memory from the outset; (2) agentic opacity occurs at distinct layers such as behavioral, mechanistic, coordination, and safety, each requiring tailored methods; and (3) interpretability must integrate across the full agent development lifecycle rather than serve as a one-time audit.

Claim 1: Coevolution over reaction

Agentic systems change quickly as teams add new tools, memory designs, and orchestration. Their failure modes change with them. If interpretability is added only after deployment, teams often realize too late that they did not collect the right logs or traces to reconstruct what happened. Coevolution means designing agents so that plans, tool calls, and memory updates are recorded and reviewable from the start, and improving that instrumentation as the system evolves.

Claim 2: Layered decomposition

Agentic failures rarely have a single cause. A bad outcome can result from behavioral drift, misleading internal representations, incorrect tool selection, coordination breakdowns between agents, or unmet safety constraints, and these factors often interact. A layered approach makes investigation manageable. It clarifies where to look first, what evidence to collect at each level, and how to connect evidence across levels so that explanations are grounded in traceable system behavior.

Claim 3: Lifecycle integration

Interpretability is not a one time report. It is an operational requirement across the full lifecycle. It is needed to establish expected behavior before launch, monitor drift during runtime, support incident response when failures occur, and feed lessons back into testing and governance. Lifecycle integration ensures interpretability supports accountability in practice, not only after harm but also through ongoing prevention and improvement.

Alternative Views

Interpretability should follow, not co-evolve

Interpretability for agentic systems should come after mature architectures are developed. Investing too early may not help, since future agents could rely on very different representations. This invokes Polanyi's Paradox—agents can be competent without being able to explicitly explain how they do so.

Response: Interpretability research has historically lagged behind capability development, leaving practitioners dependent on explanations that are difficult to validate. Retrofitting interpretability after deployment is harder than building it in from the start. Co-evolution ensures that transparency mechanisms are tested and refined alongside capabilities.

Behavioral control is sufficient

Detailed interpretability is not required for safe deployment. Alignment can be achieved through RLHF, constitutional objectives, and rigorous black-box evaluation. Strong performance on benchmarks and behavioral checks are sufficient to build trust.

Response: Behavioral testing may show that a system performs well on average, but cannot explain why a multi-step plan failed or how errors propagated across components. Without system-level interpretability, failures cannot be reliably traced, audited, or corrected. Interpretability is not opposed to behavioral control but complementary to it.

Holistic rather than layered interpretability

Agentic systems should be explained as integrated wholes rather than decomposed into separate layers. Separating behavioral, mechanistic, coordination, and safety layers may obscure tightly coupled interactions and produce fragmented explanations that miss cross-layer dependencies.

Response: Layered interpretability does not treat layers as isolated silos. It decomposes the system into analytically distinct levels while preserving the ability to trace interactions across them. Failures often originate in specific subsystems; distinguishing a tool-routing error from a planning failure requires layer-specific analysis before tracing how the error cascaded.

Lifecycle-integrated interpretability is computationally impractical

Continuous integration across the deployment lifecycle is impractical. Runtime logging, monitoring, and tracing can add computational overhead, increase latency, and complicate deployment in multi-agent systems operating under real-time constraints.

Response: Practical constraints are real, but restricting interpretability to pre-deployment audits or post-failure analysis is insufficient. ATLIS addresses this through lightweight, risk-aware monitoring: Layer 1 behavioral tracking runs continuously with low overhead, while deeper analysis (Layers 2–4) activates only when anomalies are detected. This graduated approach balances interpretability with computational cost.

Our Proposal: ATLIS Framework

To operationalize this position, we introduce ATLIS (Agentic Trajectory and Layered Interpretability Stack), a framework integrating five interpretability layers across a five-stage deployment lifecycle. ATLIS enables lightweight continuous monitoring with risk-aware escalation to deeper system-level analysis when incidents are detected. ATLIS provides a blueprint for closing the growing gap between agentic capabilities and the interpretability infrastructure needed to govern them.

ATLIS Framework Overview — **Figure 1.** ATLIS (Agentic Trajectory & Layered Interpretability Stack) is a deployment lifecycle and integrated interpretability stack for agentic systems. This framework integrates five interpretability layers across the agentic system lifecycle: (1) Real-Time Behavioral Monitoring tracks observable agent actions; (2) Mechanistic Circuit Analysis examines internal model representations; (3) Abstraction-Level Bridging connects low-level circuits to high-level reasoning; (4) Multi-Agent Analysis evaluates coordination dynamics; and (5) Safety and Alignment ensures adherence to predefined objectives. The framework incorporates two loops: blue arrows denote the monitoring refinement feedback loop, while orange arrows denote the safety and alignment revision loop. Computational overhead ranges from low (Layer 1 continuous monitoring) to high (Layer 2 full circuit extraction during incident response). Open PDF
Couldn't load `assets/ATLIS-framework_page-0001.jpg`. Check filename + capitalization.

Example: Palliative Care Referral

To ground ATLIS in practice, consider a hospital deploying an agentic system design to support palliative care referral for patients with treatment-resistant cancer. The hospital runs two separate systems each in different disease sites: lung and gastrointestinal (GI). Each site uses the same agentic referral workflow, with recommendations routed to clinicians for approval and audit logging. Figure 2 illustrates this diagnostic pathway, showing how ATLIS layers activate across the lifecycle to detect, trace, and resolve the referral timing divergence.

🔴 The Divergence

Despite identical referral criteria and subagent architectural components, the lung site deployment begins referring patients later than the GI site deployment for matched risk profiles. In one system, short-term symptoms improvement stored in memory delays escalation; in the other, accumulated hospitalizations trigger earlier referral. The divergence emerges only over time, due to differences in how longitudinal evidence is stored and propagated through monitoring and planning.

🟢 How ATLIS Surfaces the Divergence

ATLIS surfaces this drift before it causes harm at Layer 1 flagging a systematic delay in lung referrals relative to GI for comparable patients. It then traces the root cause through Layers 2–4 using inter-agent coordination signals and mechanistic differences. This pinpoints the divergence to differences in memory weighting and inter-agent handoff dynamics.

Layer 3 maps these differences in symptom versus hospitalization weighting to divergent clinical reasoning pathways, while Layer 5 evaluates whether referral timing remains within prescribed safety bounds and escalates borderline cases for human-in-the-loop clinician review.

Because ATLIS is embedded across the deployment lifecycle, these findings feed back into updated simulations (pre-deployment), refined runtime baselines (operations), and recalibrated orchestration policies (post-incident learning), enabling continuous system correction rather than one-time diagnosis.

⚠️ Why Model-Centric Methods Would Fail Here

Model-centric interpretability would likely miss this failure. Feature attribution methods (e.g., SHAP or Integrated Gradients) explain individual predictions, but the divergence arises from longitudinal interactions between monitoring, memory, planning, and referral timing. Chain-of-thought inspection or single-agent circuit analysis may appear locally consistent, while the true cause lies upstream in memory updates and cross-agent handoff dynamics. Without behavioral drift baselines and coordination-level analysis, the system-level mechanism would remain hidden.

Illustrative Case Study Diagram — **Figure 2.** This is an example of implementing ATLIS for healthcare referrals. Two agentic systems, with identical architectures and referral criteria, begin producing different referral timing for patients with comparable risk profiles. This divergence emerges from how accumulated hospitalizations trigger earlier referral in one system. Open PDF
Couldn't load `assets/Illustrative Example Diagram_page-0001.jpg`. Check spaces + capitalization.

Call to Action and Implementation Plan

Adopting system-centric interpretability for agentic systems raises several priority research directions and has broader implications requiring concerted effort from interdisciplinary groups.

🔬 Research Directions

🧭

System-Level Attribution

Current interpretability methods attribute outputs to model internals, but agentic outcomes emerge from multi-step trajectories spanning planning, memory, tool use, and delegation. New attribution methods are needed that track how decisions propagate across components over time, including formal causal frameworks for responsibility assignment in multi-component systems and techniques for reconstructing decision paths under partial observability.

📈

Scalable Runtime Monitoring

Mechanistic analysis remains computationally prohibitive for continuous deployment. Lightweight alternatives are needed, such as multi-resolution observability that logs coarse artifacts continuously while reserving expensive analysis for flagged episodes, minimal sufficient statistics for drift detection, and selective activation strategies that trigger deeper interpretability only when risk thresholds are exceeded.

🧪

Benchmarks & Evaluation

There is a lack of standardized resources for evaluating system-centric interpretability. Progress requires shared benchmarks capturing agentic phenomena (coordination failures, goal drift, emergent behaviors), common logging schemas, trajectory datasets with ground-truth failure annotations, and evaluation protocols for multi-agent settings.

🌐 Broader Implications

🔬

For Academia

Formalize this stack as a modular research object to test methods across orchestration and memory layers, publish reproducible reference architectures, and log standards.

👩‍💻

For Practitioners

Build traceability instrumentation at each module boundary to clarify why specific plans or tool actions occur. This supports guardrail implementation around tool use under uncertainty.

⚖️

For Regulators

Shift compliance standards to require system-level evidence, such as mandating tracing requirements, full lifecycle monitoring plans, and decision provenance as part of approval or procurement processes, rather than relying solely on model performance metrics.

🏢

For Organizations

Perceive agentic systems as continuously governed entities, adopting layered auditing and incident response loops as core acceptance criteria. Encourage shared learning mechanisms, such as standardized incident taxonomies, to encourage proactive interpretability integration within deployed agentic systems and earlier disclosures of failures.

Citation

Use the BibTeX below to cite this work.

@article{zhu2026interpretability,
  title={Accountable Deployment of Agentic AI Demands Layered, System-Level Interpretability},
  author={Judy Zhu and Dhari Gandhi and Ahmad Rezaie Mianroodi and Dhanesh Ramachandram and Sedef Akinli Kocak and Shaina Raza},
  journal={TechRxiv},
  doi={10.36227/techrxiv.177069676.68687733/v1},
  year={2026}
}