Governing an AI agent is not the same as governing an AI model. The distinction matters operationally. A model generates output; an agent executes actions — in file systems, financial accounts, communication channels, and across networks of other agents. Standard AI governance frameworks were built for the first category. They were not built for the second.
The Agent Intervention Taxonomy below organises governance interventions into five outcome-based pillars: Alignment, Control, Visibility, Security & Robustness, and Societal Integration. Each pillar maps to a distinct failure mode that emerges specifically in agentic deployments — failure modes that chatbot governance frameworks were never designed to prevent.
METR, 2024
METR, 2024
IAPS, April 2025
The governance window is open — but narrowing. METR's research on task-horizon growth documents that the length of tasks frontier agents can complete has been doubling approximately every seven months. The structural implication is consistent: the longer organisations wait to build control architecture, the harder retroactive implementation becomes — and the more disruptive the interruption when controls are eventually imposed. [METR, 2024]
Why Agent Governance Is Different
Every tool integration, API connection, memory store, and inter-agent communication channel is a new attack surface. Unlike chatbot systems where the model is the primary vulnerability, agent deployments introduce attack vectors at system and ecosystem layers — including prompt injection via website content, memory manipulation, and infectious jailbreak propagation across agent networks.
Agents can also execute hundreds of consequential actions before a human operator notices. The information gap between what an agent is doing and what its principal understands it to be doing is qualitatively larger than with chatbot-style systems — and grows as agents become more capable and longer-running.
Multi-agent deployments introduce coordination dynamics that individual alignment cannot address. When agents with different objective functions or yielding strategies interact, collectively harmful outcomes can emerge from individually compliant behaviour.
This coordination problem is not theoretical. Research on multi-agent reinforcement learning consistently shows that individual alignment is necessary but not sufficient — collective agent behaviour must be governed as a system. And unlike humans, AI agents are not inherently deterred by personal liability — which makes institutional design a non-optional complement to technical safeguards. [IAPS, April 2025]
Framework
The Five-Pillar Intervention Framework
Ensuring agents behave consistently with a principal's values, intentions, and interests — including when unsupervised.
- Agent-adapted RLHF — recent research suggests safety behaviours trained in chatbot settings may not reliably generalise to browser or agentic deployments; alignment training must be validated against agent-specific benchmarks, not only base model evaluations [IAPS, 2025] Emerging
- Risk attitude calibration — constraining agent decision space to match institutional risk appetite, not model defaults Emerging
- Alignment evaluations — structured testing for scheming, reward tampering, and deceptive alignment in agentic contexts Developing
- Chain-of-thought paraphrasing defence — removing encoded reasoning from intermediate steps to prevent concealment Theoretical
Hard constraints on agent behaviour — preventing harmful actions through boundaries, restrictions, and interruption mechanisms.
- Shutdown and interruption mechanisms — tested under adversarial conditions, not only normal operation Proven
- Rollback infrastructure — voiding or reversing agent actions, including financial transactions; currently underdeveloped Emerging
- Least-privilege tool restrictions — constraining API call rates, financial account access, command-line access per agent Proven
- Control protocols and evaluations — iteratively testing whether control mechanisms remain effective against increasingly capable agents Developing
Making agent behaviour, capabilities, and actions observable and understandable to humans — enabling oversight, accountability, and incident investigation.
- Agent IDs — unique identifiers containing function, developer, tested behaviour, and incident history; related transparency principles are found in EU AI Act Article 50, which establishes obligations for AI systems interacting with natural persons — though the Act does not currently prescribe a full Agent ID scheme as described here Emerging
- Activity logging — comprehensive records of all agent inputs and outputs, calibrated by risk level and privacy requirements Proven
- Reward reports — pre-deployment documentation detailing RL design decisions, metrics optimised, and feedback types Developing
- Cooperation capability evaluations — assessing joint capabilities across multi-agent deployments for both beneficial and harmful ends Theoretical
Securing agent systems from external threats, protecting data integrity, and ensuring reliable performance under adverse conditions.
- Access controls — tiered differential access with least-privilege principles; separate from standard IAM for non-agentic systems Proven
- Sandboxing — isolated environments with monitored perimeters for pre-deployment testing and production safeguards against prompt injection Proven
- Adversarial robustness testing — systematic evaluation at the agent level; agent attack surfaces differ materially from base model surfaces Emerging
- Rapid response for adaptive defence — blocking entire classes of jailbreaks after observing a small number of instances, without full model retraining Emerging
Supporting long-term integration into social, political, and economic systems — addressing inequality, power concentration, and accountability structures.
- Liability regimes — legal codification of developer, deployer, and user responsibility; EU AI Act establishes some principles; agent-specific frameworks remain largely unresolved Developing
- Law-following agent design — aligning agents to a defined body of law rather than developer-chosen values as the primary governance constraint Theoretical
- Commitment devices — software-based mechanisms for enforcing agent commitments, analogous to contracts and escrow Theoretical
An AI content moderation agent begins flagging political discussion posts at an unusually high rate — its internal metrics detect the anomaly before any human reviewer does. The agent triggers a graceful shutdown: it pauses removal actions, shifts to observation mode, and generates a summary report for the human team. The team identifies the miscalibrated parameter and adjusts it before reactivation. Harm is contained. The scenario is unremarkable only because the control mechanism worked.
Implementation Prioritisation
Not all interventions carry equal urgency. The matrix below scores key intervention categories across risk reduction, implementation feasibility, and regulatory alignment — and maps each to a sequencing priority for organisations beginning or maturing their agentic governance posture.
Implementation Prioritisation Matrix · Regulatory alignment reflects EU AI Act (entered into force 1 August 2024, with obligations applying in phases) and FINMA Circular 2023/1 / Guidance 08/2024, as of 2025
| Intervention | Risk Reduction | Feasibility | Reg. Alignment | Priority |
|---|---|---|---|---|
| Activity Logging & Agent IDs | Immediate | |||
| Shutdown & Interruption Mechanisms | Immediate | |||
| Access Controls & Sandboxing | Immediate | |||
| Rollback Infrastructure | Near-Term | |||
| Adversarial Robustness Testing | Near-Term | |||
| Alignment Evaluations | Near-Term | |||
| Control Protocols & Evaluations | Structured | |||
| Liability Regime Design | Structured | |||
| Multi-Agent RL Alignment | Long-Term |
Board & Risk Committee
Five Questions Before Approving Agent Deployment
If your organisation cannot answer all five questions, agent deployment in production should be preceded by a structured governance gap assessment. The taxonomy in this document provides the intervention map; the gap assessment determines where your current controls fall short of it.
Primary Sources
- IAPSKraprayoon, Williams & Fayyaz. AI Agent Governance: A Field Guide. Institute for AI Policy and Strategy, April 2025.
- ApolloMeinke et al. Frontier Models Are Capable of In-Context Scheming. Apollo Research, December 2024.
- METRKwa et al. Measuring AI Ability to Complete Long Tasks. arXiv, 2024. / An Update on Our General Capability Evaluations. METR, August 2024.
- OWASPOWASP Top 10 for LLM Applications. Open Web Application Security Project, 2024. (Note: a dedicated OWASP Agentic AI framework was in development at time of publication.)
- CSAMAESTRO — Multi-Agent Environment, Security, and Threat Risk Overview. Cloud Security Alliance, 2025.
- IMDAModel AI Governance Framework for Agentic AI. Singapore Infocomm Media Development Authority, January 2026. (Incorporated after original 2025 edition.)
- EU AI ActRegulation (EU) 2024/1689. Entered into force 1 August 2024; obligations apply in phases. High-risk system provisions relevant to Article 50 transparency and Annex III classifications.
- FINMACircular 2023/1: Operational risks and resilience — banks. Adopted December 2022, in force 1 January 2024. / Guidance 08/2024: Supervisory expectations for AI. Published 18 December 2024.
- ISOISO/IEC 42001:2023 — Artificial intelligence — Management system standard.