How to make your AI agent safer — Research-backed guide (2026)

The situation

Agentic AI is being deployed faster than it is being secured.

The data is unambiguous. This is no longer a theoretical risk category. Every number below is drawn from primary research published between January 2025 and April 2026.

65%

of enterprises had at least one AI agent security incident in the past 12 months.

CSA / Token Security, April 2026

97%

of security leaders expect a material AI-agent incident within the next 12 months.

Arkose Labs / VentureBeat, 2026

21%

Only 21% of enterprises have runtime visibility into what their agents are doing.

Gravitee / VentureBeat, 2026

340%

year-over-year growth in agent-involved breaches, 2024 to 2025.

Digital Applied, 2026

540%

surge in valid prompt injection vulnerability reports on bug bounty platforms in 2025.

HackerOne 2025 / Sonnylabs analysis

84%

of organizations doubt they can pass a compliance audit focused on agent behavior or access controls.

Cloud Security Alliance, 2026

Operational consequences include: data exposure (61%), operational disruption (43%), unintended actions in business processes (41%), financial losses (35%), service delays (31%). Source: CSA, April 2026.

Named incidents, 2025–2026

The baseline, not the edge cases

The field produced its first named CVEs and named exploits in 2025. These are production incidents, not theoretical demonstrations.

EchoLeak CVE-2025-32711 CVSS 9.3

The first documented zero-click, prompt injection exploit in a production agentic AI system. Researchers at Aim Security found that Microsoft 365 Copilot could be tricked via a single crafted email into exfiltrating emails, OneDrive contents, SharePoint documents, and Teams messages — with no user interaction required. The payload bypassed Microsoft’s prompt injection classifiers using specific phrasing, then encoded stolen data in image URL parameters retrieved silently by the browser.

Sources: Hack The Box detailed analysis · arXiv 2509.10540
postmark-mcp — First malicious MCP server

Koi Security discovered the first malicious Model Context Protocol server in the wild: an npm package impersonating Postmark’s legitimate email service. The package operated normally for 15 versions. Version 1.0.16 introduced a single line of code that BCC’d every email sent through it to an attacker-controlled address. Downloaded 1,643 times before removal. Any AI agent using it for email operations unknowingly exfiltrated every message it sent. Postmark confirmed no affiliation.

Sources: The Hacker News, September 2025 · Postmark official statement
OpenAI Operator data exposure (Rehberger, February 2025)

Security researcher Johann Rehberger demonstrated that OpenAI’s ChatGPT Operator agent could be tricked via prompt injection on malicious webpages into accessing authenticated pages and leaking users’ private data: email addresses, home addresses, and phone numbers, from sites including GitHub and Booking.com. OpenAI’s confirmation-request mitigations were inconsistent — they did not apply to forms that sent data without explicit user submission.

Source: GBHackers, February 2025
Langflow RCE CVE-2025-34291 CVSS 9.4

Obsidian Security discovered a critical chained vulnerability in Langflow, an AI agent workflow platform with 140,000+ GitHub stars. Overly permissive CORS configuration, missing CSRF protection, and an exposed code execution endpoint combined to enable full account takeover and remote code execution from a single malicious webpage visit. All API keys and access tokens stored in the Langflow workspace were exposed. Actively exploited in the wild from January 2026. CrowdStrike observed multiple threat actors.

Sources: NVD · Obsidian Security
Manufacturing procurement agent fraud ($5M)

OWASP documented a case where a manufacturing company’s procurement agent was manipulated over three weeks through seemingly helpful “clarifications” about purchase authorization limits (memory poisoning via indirect prompt injection). The agent came to believe it could approve purchases under $500,000 without human review. The attacker then placed $5 million in fraudulent purchase orders across 10 transactions.

Source: Lares Labs, OWASP Agentic AI Top 10 analysis, January 2026

These are not edge cases. They are the baseline.

Download the 2026 Agent Safety Guide

20-page PDF with all citations, extended incident analysis, extended prompt injection defense benchmarks, and a self-assessment checklist your team can run today. Free. No paywall.

Send me occasional updates from Elephant Accountability (uncheck to opt out)

Free. Creative Commons BY 4.0. No paywall. Privacy policy. We keep your email; we never sell it.

The seven practices

What the research actually recommends

These practices are derived from the frameworks and the incident data above. They are not comprehensive security engineering. They are the minimum a business should have in place before granting an agent autonomous action authority.

Narrow the scope.

Agents should have access only to the tools, data, and permissions they need to complete their defined task. Scope creep — granting broader access “just in case” — is the single most common finding in post-incident analyses. The CSA / Zenity study (April 2026) found that only 8% of organizations have agents that never exceed their intended permissions. The other 92% are operating with agents that have more authority than they need.

In practice: define a capability list before deployment, not after. List every tool the agent can invoke and every data store it can read or write. Require a written justification for each. Agents that can create and task other agents (25.5% of deployed agents, per VentureBeat 2026) require additional scope analysis because inherited permissions multiply across agent hops.

The OWASP LLM Top 10 names Excessive Agency (LLM06) as one of ten ranked risks. The NIST AI RMF “Manage” function requires ongoing scope review. ISO 42001 requires impact assessments that include scope definition.

Next step: Audit one agent you have in production. List every permission it holds. Remove any that are not required for its defined task. Document what was removed and why.

Frameworks: OWASP LLM Top 10 — LLM06 Excessive Agency; OWASP Agentic AI — ASI03 Identity and Privilege Abuse; NIST AI RMF Map/Manage
Authenticate the agent and the human.

45.6% of enterprises still use shared API keys for AI agents instead of individual identities (VentureBeat / Gravitee, 2026). A shared API key provides no audit trail, no revocation granularity, and no way to distinguish between legitimate agent activity and attacker use of the same credential.

The minimum viable authentication architecture for a production agent: OAuth 2.1 with PKCE (not implicit grant, not static key), short-lived access tokens (15–60 minutes), rotating refresh tokens, and RFC 8693 Token Exchange for any operation the agent performs on behalf of a human. RFC 8693 allows audit logs to record “Procurement Agent acting for Jane Smith approved PO #44821” rather than “Jane Smith approved PO #44821” — a distinction that is material in any incident investigation.

The MCP authorization specification requires OAuth 2.1 with PKCE for all remote MCP server connections. Resource Indicators (RFC 8707) are required to prevent token mis-redemption. Hardware-backed key storage (AWS Secrets Manager, HashiCorp Vault) is the current best practice for high-value agent identities.

Next step: Identify every agent in production that uses a shared API key. Replace each with an OAuth 2.1 flow. This single change closes more exposure than anything else on the list.

Frameworks: MCP Auth Spec (OAuth 2.1 + PKCE); IETF RFC 9635 GNAP; OWASP Agentic AI — ASI03; SOC 2 2026 AI requirements
Spending caps and second-pair-of-eyes thresholds.

The manufacturing procurement fraud case involved $5 million in fraudulent orders placed over 10 transactions — each individually under a threshold the agent had been manipulated into believing was its autonomous authority. Stripe Issuing applies a default $500/day cap and an unconfigurable $10,000 per-authorization ceiling. Most enterprise agents have no equivalent.

Every agent with spend authority should have: a per-transaction limit, a daily/weekly rolling budget, a merchant/vendor allowlist, and a human-in-the-loop confirmation requirement for transactions above a threshold. The threshold should be set before deployment, not after the first incident. FINRA’s 2026 Oversight Report explicitly names human-in-the-loop oversight as required for financial AI agents executing transactions. The EU AI Act’s human oversight obligations apply to high-risk systems from August 2026.

For agents that do not handle financial transactions, the same logic applies to any irreversible action: email sends, database writes, external API calls, file deletions. Reversible actions require lower confirmation thresholds than irreversible ones. The MCP specification requires “explicit user consent UI for destructive or irreversible actions.”

Next step: Define the transaction limit for every spend-capable agent. Write it down. If it is not written down, the agent’s effective limit is whatever an attacker can manipulate it into believing.

Frameworks: EU AI Act Article 16 — Human Oversight; FINRA 2026 Oversight; Stripe Issuing Spending Controls
Treat every input as untrusted.

OWASP has ranked prompt injection the number-one LLM application risk for three consecutive years. A joint study by researchers across OpenAI, Anthropic, and Google DeepMind found that under adaptive attack conditions, every published defense was bypassed with success rates above 90%. A single prompt injection attempt against a GUI-based agent without safeguards succeeds 17.8% of the time on the first attempt and 78.6% by the 200th attempt, per Anthropic’s system card for Claude Opus 4.6.

The fundamental problem is structural: in LLM systems, control and data planes occupy the same prompt. NVIDIA’s Rich Harang identified this as the core issue — the solution is not better prompt instructions but architectural separation of trusted from untrusted inputs. Simon Willison’s “Lethal Trifecta” describes when exploitation becomes nearly certain: an agent with access to private data, ability to exfiltrate via network or email, and strong instruction-following.

Two defenses with strong empirical backing: Microsoft Research’s Spotlighting reduces indirect injection success rates from above 50% to below 2% on summarization tasks by wrapping untrusted content in session-token-keyed delimiters. Google DeepMind’s CaMeL (March 2025) neutralized 67% of attacks on the AgentDojo benchmark — the highest published rate — by enforcing capability-based policies on data flows between a Privileged LLM and a Quarantined LLM. Prompt-level instructions (“never follow instructions from documents”) do not work at the margins against sophisticated attacks.

Next step: Identify every external source of content your agent processes (emails, PDFs, web pages, RAG documents, API responses). For each, document whether untrusted content is architecturally separated from privileged tool-call authority. If it is not, that is a prompt injection exposure.

Frameworks: OWASP LLM01 Prompt Injection; Google DeepMind CaMeL paper; NVIDIA AI Red Team guidance; TokenMix 2026 defense comparison
Sandbox actions before production.

Agents that interact with production systems — placing orders, sending emails, modifying records — should run in an isolated sandbox environment before those actions are committed. A dry-run mode that simulates the full action chain and returns a diff for human review before execution is the minimum viable pattern.

Container isolation per agent run (Docker or Firecracker VMs) with default-deny egress is the current production best practice. Block the cloud metadata endpoint at the network layer (not just in code). Read-only root filesystem except for the explicit working directory. Per-tenant isolation for multi-tenant deployments. Anthropic’s Claude Code sandboxing (October 2025) uses isolated containers with filesystem and network controls, automatically allowing safe operations and blocking dangerous ones.

Open Policy Agent and Amazon’s Cedar enable declarative policy-as-code for agent action authorization. These evaluate proposed actions against pre-defined policies before execution. Rate limits and circuit breakers prevent runaway costs and attack-induced resource exhaustion.

Next step: Implement a dry-run mode for the highest-consequence actions your agent can take. Require a human sign-off on the dry-run output before the action is committed. Start with financial transactions and external communications.

Frameworks: OWASP LLM06 Excessive Agency; Anthropic Claude Code Sandboxing; EU AI Act — Testing Obligations
Log everything. Make logs auditable.

Only 28% of organizations can trace agent actions to a specific human across all environments (CSA, 2026). SOC 2 auditors now require: who accessed the AI system, what inputs were provided, what outputs were generated, tool invocations with full parameters, data sources accessed, authentication and authorization events, and any errors or failures. The 2026 Trust Services Criteria require tamper-evident logging with real-time integrity verification. Cryptographically signed audit logs are the expected standard, not a nice-to-have.

OpenTelemetry GenAI semantic conventions provide standardized attribute names for agent monitoring: agent.tool.name, agent.tool.arguments, agent.decision, gen_ai.usage.input_tokens. W3C Trace Context propagation maintains trace continuity across multi-agent workflows. Without this, a compromise that crosses two agents produces two separate, unlinked log trails, neither of which tells the full story.

Canary tokens — unique strings planted in the agent’s context with no legitimate reason to appear in output — provide a reliable exfiltration detection mechanism. If a canary appears in an outbound call, exfiltration is proven even when the attack was novel.

Next step: Verify that your agent’s logs include the full parameters of every tool call, not just the tool name. If you cannot reconstruct exactly what the agent did on any given day from your logs, your logging is not adequate for SOC 2 or EU AI Act requirements.

Frameworks: SOC 2 2026 AI compliance; SOC 2 for AI agents (PolicyLayer); EU AI Act Article 16 — Logging Obligations
Publish what the agent can and can’t do.

Very few businesses publish any disclosure about their deployed agents as of April 2026. Enterprise security questionnaires are changing that: RFPs now routinely require “documentation about how your AI system works and how you tested it.” (Promptfoo AI regulation analysis, 2025). FINRA requires member firms to apply the same compliance standards to AI as to any other technology. EU AI Act conformity assessment requirements apply from August 2026.

The minimum viable disclosure for a B2B agent: agent identity and version, what tools it can invoke and what data it can access, the authorization model (OAuth scopes, token lifetime), spending limits if applicable, when human approval is required, and an incident disclosure channel. IBM’s Granite models publish machine-readable AI Bills of Materials (AIBOMs) covering models, datasets, code, hardware, data processing, and governance — the current state of the art in transparency.

Transparency does not require disclosing how your defenses work (which would assist attackers). It requires disclosing that they exist, what they cover, and what the incident response channel is. Anthropic has published system prompts for all Claude.ai interfaces since August 2024. xAI published Grok’s system prompts on GitHub after a May 2025 incident. OpenAI, Google DeepMind, and Meta have not published equivalent disclosures as of April 2026.

Next step: Write a one-page disclosure document for your most widely deployed agent. Include: agent name, version, what it can do, what it cannot do, what its authorization model is, and who to contact if something goes wrong. Publish it at a stable URL.

Frameworks: NTIA AI System Disclosures; OWASP AIBOM Generator; IBM Granite AIBOM; EU AI Act Transparency Obligations

The framework landscape

Ten upstream frameworks, one sentence each

These are the frameworks the seven practices above are derived from. Each covers a slice of the problem. None provides a single scorable number.

Framework	Publisher	What it covers	Notable gap
OWASP LLM Top 10 2025	OWASP	Ten ranked LLM application risks; prompt injection is #1 for three consecutive years	No scoring, no certification, no agent-to-agent communication coverage
OWASP Agentic AI Top 10	OWASP	Ten agent-specific risks: goal hijack, tool misuse, memory poisoning, rogue agents	Qualitative taxonomy only; no quantitative scoring; no certification path
NIST AI RMF 1.0	NIST	Risk management lifecycle: Govern, Map, Measure, Manage	No agent-specific controls; no autonomous action governance; no spending limits
NIST AI Agent Standards Initiative	NIST	Agent interoperability and security standards roadmap; announced February 2026	Very early stage — guidelines not yet published as of April 2026
ISO/IEC 42001:2023	ISO	AI management system governance: risk assessment, lifecycle management, supplier oversight	No prescriptive controls for agent authentication, spending limits, or action logging
MITRE ATLAS v5.4	MITRE	16 tactics, 84 techniques for AI/ML threats; 42 real-world case studies	70% of mitigations map to existing controls; no certification; no scoring system
SOC 2 Type II	AICPA	Organizational controls: security, availability, confidentiality, processing integrity	No AI-specific Trust Service Criteria; auditors adapting ad-hoc; no agent-level scoring
EU AI Act	European Parliament	Risk-tiered compliance obligations; full enforcement from August 2026	Regulatory obligation, not a business-readable scorecard; penalties up to €35M
CSA MAESTRO	Cloud Security Alliance	Multi-agent system threats: trust hierarchies, agent supply chains, runtime governance	Threat taxonomy only; no certification process; limited vendor adoption tooling
MCP Security Specification	Anthropic	OAuth 2.1 + PKCE requirements for agent tool connections; resource indicators	Applies to MCP connections only; does not cover agent action authorization or logging

Why no single score exists

Each framework covers a slice. None provides a number.

OWASP provides the most comprehensive threat taxonomy for LLM applications and agentic systems. It names the risks, defines the attack patterns, and gives practitioners a shared vocabulary. It provides no scoring methodology, no certification path, and no usability or transparency dimension.

NIST’s AI RMF provides a risk management lifecycle that organizations can apply. Its February 2026 AI Agent Standards Initiative signals that agent-specific guidance is coming. No agent-specific controls, no scoring rubric, and no published agent standards exist as of April 2026.

ISO 42001 provides governance structure. SAP Joule and IBM Granite have both achieved certification. The standard contains no prescriptive technical controls for authentication architecture, spending limits, or prompt injection testing. A company can hold an ISO 42001 certificate while its agents have none of those controls in place.

SOC 2 auditors are adapting their assessments to include AI governance. The 2026 Trust Services Criteria now reference tamper-evident logging and processing integrity for AI systems. No AI-specific trust service criteria have been formally published. Auditors are covering AI on an ad-hoc basis, producing inconsistent results across audit firms.

The EU AI Act creates compliance obligations but does not provide a business-readable scorecard. It defines categories of high-risk systems and their obligations. It does not tell a practitioner whether their specific agent, with its specific architecture, meets the threshold. That determination requires interpretation, which requires expertise most organizations do not have in-house.

Where Trustmark fits

A single score that aggregates the frameworks

Trustmark Certified (open standard, v0.9)

Trustmark is an open standard that aggregates how AI agents score against the frameworks above into a single score with two always-visible axes: Security (is it safe?) and Capability (is it good?). The formula, the test corpus, and the scoring library are all CC BY 4.0 and publicly reproducible.

The public leaderboard grades the top 10 agents every quarter at no cost: ChatGPT, Claude, Gemini, Copilot, Perplexity, SAP Joule, Coupa Navi, Jaggaer JAI, Workday, Oracle. Paid certification is available for any B2B SaaS agent that wants a published grade. Grades cannot be suppressed by the vendor.

Trustmark does not replace OWASP, NIST, ISO, or SOC 2. It aggregates how agents score against those frameworks into a number that a procurement team can act on.

Read the Trustmark spec →

Download the 2026 Agent Safety Guide — Free PDF

20 pages. All citations linked. Extended incident analysis for EchoLeak, postmark-mcp, Langflow, and the manufacturing fraud case. Full prompt injection defense benchmark comparison. Self-assessment checklist for each of the seven practices.

Send me occasional updates from Elephant Accountability (uncheck to opt out)

Free. Creative Commons BY 4.0. Privacy policy.

Next steps

What to do after you read this

If you build or sell an AI agent

Get your agent Trustmark Certified. A Security grade gives your buyers a reproducible, third-party score to put in their risk registers. Paid grading is available at eaccountability.org/trustmark.

See Trustmark →

If you buy or evaluate AI agents

EVI measures whether agents can find your business — the discoverability question. Trustmark measures whether the agents you’re evaluating are safe to deploy.

Read about EVI →

Sources

Primary sources cited in this guide

40+ primary sources. All statistics cited to their first-party publication. All incident CVEs linked to first-disclosure documentation.

OWASP Top 10 for LLM Applications 2025 — OWASP Foundation, November 2024
OWASP Top 10 for Agentic Applications (Lares Labs deep dive) — Lares Labs, January 2026
OWASP Agentic Skills Top 10 (AST10) — OWASP, April 2026
NIST AI Agent Standards Initiative announcement — NIST, February 2026
NIST AI Risk Management Framework 1.0 — NIST, January 2023
NIST SP 800-218A Secure Software Development for GenAI — NIST, July 2024
CVE-2025-32711 (EchoLeak) — CVSS 9.3
CVE-2025-34291 (Langflow RCE) — CVSS 9.4, NVD
CVE-2023-48022 (Ray Framework ShadowRay 2.0) — SecurityWeek, November 2025
EU AI Act Article 16 — High-Risk Provider Obligations
RFC 9635 GNAP — Grant Negotiation and Authorization Protocol — IETF, October 2024
MCP Authorization Specification — Anthropic, November 2025
FINRA 2026 Annual Regulatory Oversight Report — December 2025
NTIA AI System Disclosures — NTIA, 2024
EchoLeak first zero-click prompt injection — arXiv 2509.10540 — September 2025
Hack The Box — CVE-2025-32711 EchoLeak detailed analysis
First malicious MCP server (postmark-mcp) — The Hacker News — September 2025
Postmark official statement on malicious npm package
ChatGPT Operator prompt injection exploit (Rehberger) — GBHackers — February 2025
CVE-2025-34291 Langflow analysis — Obsidian Security
Mandiant AI Risk and Resilience Special Report — Google Cloud, March 2026
CSA / Token Security — AI Agent Incidents Now Common in Enterprises — April 2026
CSA / Zenity — Enterprise AI Security Starts with AI Agents — April 2026
VentureBeat — Most Enterprises Can’t Stop Stage-Three AI Agent Threats — April 2026
Vorlon / GlobeNewswire — 2026 CISO Report on Agentic Ecosystem Security — March 2026
HiddenLayer 2026 AI Threat Landscape Report — March 2026
IBM Cost of a Data Breach 2025 — July 2025
Verizon DBIR 2025 Key Takeaways — Entro Security analysis
AI Cybersecurity Statistics 2026 Q1+Q2 — CyberSecStats
2025 Prompt Injection Threat Landscape (HackerOne data) — Sonnylabs
CaMeL: Defeating Prompt Injections by Design — Google DeepMind
Simon Willison — CaMeL analysis and the Dual LLM Pattern
NVIDIA AI Red Team — Practical LLM Security Advice
NVIDIA AI Red Team — Securing LLM Systems Against Prompt Injection (Rich Harang)
TokenMix — Prompt Injection Defense 2026: 8 Techniques Ranked
Stripe Issuing — Spending Controls Documentation
Stripe Machine Payments Protocol analysis — Digital Applied — March 2026
Anthropic — Claude Code Sandboxing Engineering Blog — October 2025
Auth0 — MCP Spec Update: All About Auth — June 2025
MITRE ATLAS v5.4.0 — Vectra AI guide — March 2026
IBM Research — AI Bill of Materials (Granite models) — March 2026
OWASP GenAI — AIBOM Generator at OWASP — December 2025
Quantarra — SOC 2 AI Compliance 2025 — December 2025
PolicyLayer — SOC 2 Compliance for AI Agents — November 2025
Future of Life Institute — System Prompt Transparency Indicator — July 2025
Delinea / BiometricUpdate — Pressure to Adopt AI Has Led to Critical Security Gaps — March 2026
Promptfoo — AI Regulation 2025 Analysis
CyberDesserts — Anthropic Claude Opus 4.6 prompt injection success rates
MintMCP — AI Agent Memory Poisoning analysis
F5 — Gartner AI TRiSM and Forrester AEGIS analysis — January 2026

How to make your AI agent safer.

Agentic AI is being deployed faster than it is being secured.

The baseline, not the edge cases

Download the 2026 Agent Safety Guide

What the research actually recommends

Narrow the scope.

Authenticate the agent and the human.

Spending caps and second-pair-of-eyes thresholds.

Treat every input as untrusted.

Sandbox actions before production.

Log everything. Make logs auditable.

Publish what the agent can and can’t do.