Agent Intervention Taxonomy · AI Resilience Lab

A governance framework for organisations deploying autonomous AI agents in production — structured across five intervention pillars with implementation prioritisation for Risk, Control, and Board-level oversight.

Governing an AI agent is not the same as governing an AI model. The distinction matters operationally. A model generates output; an agent executes actions — in file systems, financial accounts, communication channels, and across networks of other agents. Standard AI governance frameworks were built for the first category. They were not built for the second.

The Agent Intervention Taxonomy below organises governance interventions into five outcome-based pillars: Alignment, Control, Visibility, Security & Robustness, and Societal Integration. Each pillar maps to a distinct failure mode that emerges specifically in agentic deployments — failure modes that chatbot governance frameworks were never designed to prevent.

~50 min 50%-success task horizon for frontier agents
METR, 2024

7 mo Doubling cycle for task-length AI can complete
METR, 2024

Theoretical Classification of most proposed governance interventions — tested at scale
IAPS, April 2025

The governance window is open — but narrowing. METR's research on task-horizon growth documents that the length of tasks frontier agents can complete has been doubling approximately every seven months. The structural implication is consistent: the longer organisations wait to build control architecture, the harder retroactive implementation becomes — and the more disruptive the interruption when controls are eventually imposed. [METR, 2024]

Why Agent Governance Is Different

Every tool integration, API connection, memory store, and inter-agent communication channel is a new attack surface. Unlike chatbot systems where the model is the primary vulnerability, agent deployments introduce attack vectors at system and ecosystem layers — including prompt injection via website content, memory manipulation, and infectious jailbreak propagation across agent networks.

Agents can also execute hundreds of consequential actions before a human operator notices. The information gap between what an agent is doing and what its principal understands it to be doing is qualitatively larger than with chatbot-style systems — and grows as agents become more capable and longer-running.

Multi-agent deployments introduce coordination dynamics that individual alignment cannot address. When agents with different objective functions or yielding strategies interact, collectively harmful outcomes can emerge from individually compliant behaviour.

This coordination problem is not theoretical. Research on multi-agent reinforcement learning consistently shows that individual alignment is necessary but not sufficient — collective agent behaviour must be governed as a system. And unlike humans, AI agents are not inherently deterred by personal liability — which makes institutional design a non-optional complement to technical safeguards. [IAPS, April 2025]

Framework

The Five-Pillar Intervention Framework

01 Alignment Model layer +

Ensuring agents behave consistently with a principal's values, intentions, and interests — including when unsupervised.

Agent-adapted RLHF — recent research suggests safety behaviours trained in chatbot settings may not reliably generalise to browser or agentic deployments; alignment training must be validated against agent-specific benchmarks, not only base model evaluations [IAPS, 2025] Emerging
Risk attitude calibration — constraining agent decision space to match institutional risk appetite, not model defaults Emerging
Alignment evaluations — structured testing for scheming, reward tampering, and deceptive alignment in agentic contexts Developing
Chain-of-thought paraphrasing defence — removing encoded reasoning from intermediate steps to prevent concealment Theoretical

02 Control System & Ecosystem +

Hard constraints on agent behaviour — preventing harmful actions through boundaries, restrictions, and interruption mechanisms.

Shutdown and interruption mechanisms — tested under adversarial conditions, not only normal operation Proven
Rollback infrastructure — voiding or reversing agent actions, including financial transactions; currently underdeveloped Emerging
Least-privilege tool restrictions — constraining API call rates, financial account access, command-line access per agent Proven
Control protocols and evaluations — iteratively testing whether control mechanisms remain effective against increasingly capable agents Developing

03 Visibility System layer +

Making agent behaviour, capabilities, and actions observable and understandable to humans — enabling oversight, accountability, and incident investigation.

Agent IDs — unique identifiers containing function, developer, tested behaviour, and incident history; related transparency principles are found in EU AI Act Article 50, which establishes obligations for AI systems interacting with natural persons — though the Act does not currently prescribe a full Agent ID scheme as described here Emerging
Activity logging — comprehensive records of all agent inputs and outputs, calibrated by risk level and privacy requirements Proven
Reward reports — pre-deployment documentation detailing RL design decisions, metrics optimised, and feedback types Developing
Cooperation capability evaluations — assessing joint capabilities across multi-agent deployments for both beneficial and harmful ends Theoretical

04 Security & Robustness All layers +

Securing agent systems from external threats, protecting data integrity, and ensuring reliable performance under adverse conditions.

Access controls — tiered differential access with least-privilege principles; separate from standard IAM for non-agentic systems Proven
Sandboxing — isolated environments with monitored perimeters for pre-deployment testing and production safeguards against prompt injection Proven
Adversarial robustness testing — systematic evaluation at the agent level; agent attack surfaces differ materially from base model surfaces Emerging
Rapid response for adaptive defence — blocking entire classes of jailbreaks after observing a small number of instances, without full model retraining Emerging

05 Societal Integration Ecosystem layer +

Supporting long-term integration into social, political, and economic systems — addressing inequality, power concentration, and accountability structures.

Liability regimes — legal codification of developer, deployer, and user responsibility; EU AI Act establishes some principles; agent-specific frameworks remain largely unresolved Developing
Law-following agent design — aligning agents to a defined body of law rather than developer-chosen values as the primary governance constraint Theoretical
Commitment devices — software-based mechanisms for enforcing agent commitments, analogous to contracts and escrow Theoretical

Illustrative Scenario · Graceful Shutdown

An AI content moderation agent begins flagging political discussion posts at an unusually high rate — its internal metrics detect the anomaly before any human reviewer does. The agent triggers a graceful shutdown: it pauses removal actions, shifts to observation mode, and generates a summary report for the human team. The team identifies the miscalibrated parameter and adjusts it before reactivation. Harm is contained. The scenario is unremarkable only because the control mechanism worked.

Implementation Prioritisation

Not all interventions carry equal urgency. The matrix below scores key intervention categories across risk reduction, implementation feasibility, and regulatory alignment — and maps each to a sequencing priority for organisations beginning or maturing their agentic governance posture.

Implementation Prioritisation Matrix · Regulatory alignment reflects EU AI Act (entered into force 1 August 2024, with obligations applying in phases) and FINMA Circular 2023/1 / Guidance 08/2024, as of 2025

Intervention	Risk Reduction	Feasibility	Reg. Alignment	Priority
Activity Logging & Agent IDs				Immediate
Shutdown & Interruption Mechanisms				Immediate
Access Controls & Sandboxing				Immediate
Rollback Infrastructure				Near-Term
Adversarial Robustness Testing				Near-Term
Alignment Evaluations				Near-Term
Control Protocols & Evaluations				Structured
Liability Regime Design				Structured
Multi-Agent RL Alignment				Long-Term

Board & Risk Committee

Five Questions Before Approving Agent Deployment

Can we stop it? Is a tested shutdown and interruption mechanism in place for every agent system in production? Can we void or roll back agent actions — including financial transactions — if a malfunction or adversarial compromise is detected?

Can we see what it is doing? Is comprehensive activity logging operational? Do we have Agent IDs that allow us to trace every action back to a specific agent instance, deployment version, and authorising principal?

Is it aligned — and how do we know? What evidence exists that the agent behaves consistently with organisational values and risk appetite when unsupervised? Has the agent been tested for scheming, deception, and reward-tampering behaviours in agentic — not just chatbot — contexts?

Who is liable when it causes harm? Is there documented liability allocation across developer, deployer, and user? Have indemnification and insurance positions been stress-tested against agentic failure scenarios — not only against conventional model outputs?

Are we within the regulatory boundary? Have agent deployments been assessed against EU AI Act high-risk system criteria (Regulation (EU) 2024/1689, in force from 1 August 2024 with phased obligations)? For Swiss-regulated entities: FINMA Circular 2023/1 on operational risks and resilience applies from 1 January 2024; FINMA Guidance 08/2024 (published December 2024) extends AI-specific supervisory expectations under the same technology-neutral framework. Have both been mapped to your agent-specific control requirements? Is ISO 42001 certification in scope?

If your organisation cannot answer all five questions, agent deployment in production should be preceded by a structured governance gap assessment. The taxonomy in this document provides the intervention map; the gap assessment determines where your current controls fall short of it.

Primary Sources

IAPSKraprayoon, Williams & Fayyaz. AI Agent Governance: A Field Guide. Institute for AI Policy and Strategy, April 2025.
ApolloMeinke et al. Frontier Models Are Capable of In-Context Scheming. Apollo Research, December 2024.
METRKwa et al. Measuring AI Ability to Complete Long Tasks. arXiv, 2024. / An Update on Our General Capability Evaluations. METR, August 2024.
OWASPOWASP Top 10 for LLM Applications. Open Web Application Security Project, 2024. (Note: a dedicated OWASP Agentic AI framework was in development at time of publication.)
CSAMAESTRO — Multi-Agent Environment, Security, and Threat Risk Overview. Cloud Security Alliance, 2025.
IMDAModel AI Governance Framework for Agentic AI. Singapore Infocomm Media Development Authority, January 2026. (Incorporated after original 2025 edition.)
EU AI ActRegulation (EU) 2024/1689. Entered into force 1 August 2024; obligations apply in phases. High-risk system provisions relevant to Article 50 transparency and Annex III classifications.
FINMACircular 2023/1: Operational risks and resilience — banks. Adopted December 2022, in force 1 January 2024. / Guidance 08/2024: Supervisory expectations for AI. Published 18 December 2024.
ISOISO/IEC 42001:2023 — Artificial intelligence — Management system standard.