Security
AI Agent Security: Risks, Controls, and a Production Checklist
Learn how to secure AI agents against prompt injection, over-permissioned tools, unsafe memory, insecure handoffs, and risky outputs with practical controls.

Guide coverage
Security
Agent News Watch for teams building and operating AI agents.
Agent security is not only a prompt injection problem. The real risk surface includes permissions, side effects, memory writes, delegated actions, and the places where models can trigger workflows nobody intended.
AI agent security is the discipline of keeping a model-driven system from reading, writing, or delegating beyond the policy boundary you intended. That means securing inputs, tools, memory, outputs, and the workflows that connect them. If you still need the base implementation sequence, start with How to Build AI Agents. If you are still deciding which workflow deserves any autonomy at all, add AI Agent Use Cases. Keep AI Agent Architecture and Multi-Agent Architecture nearby when the question is how far the blast radius expands as roles and tools multiply. Then use this page to turn that design into something you can actually ship.
Security also sits directly next to protocol design. Model Context Protocol shapes how agents reach tools and resources. Agent-to-Agent Protocol shapes how one agent system can hand work to another. Multi-Agent Architecture helps decide when those delegated roles should exist in the first place. The live A2A v1.0.0 brief is a reminder that interoperability progress always expands the surface that needs governance.
Why agent security is different from standard app security
Traditional applications usually execute deterministic logic written by developers. Agent systems add a model that can choose actions, interpret natural-language instructions, and generate structured requests against downstream tools. That does not replace classic security work. It adds a new decision-making layer that must be bounded by policy and verification.
The main difference is not that models are mysterious. It is that they make unsafe actions easier to trigger through ambiguous instructions, untrusted retrieved content, or over-broad permissions. Security work has to account for both adversarial inputs and normal operational drift.
1Control map2User input and retrieved content3 -> policy filters and validation4 -> model reasoning step5 -> approved tool or protocol action6 -> output checks and side-effect review7 -> audit log, alerting, and incident response
The main risk categories
Prompt injection and instruction hijacking
Prompt injection matters because agents read untrusted text from users, documents, websites, tickets, and tools. A malicious instruction can try to override the system prompt, expose hidden data, or steer the model toward unsafe actions. Good defenses combine content isolation, least privilege, and action validation. Do not expect one filter prompt to solve the problem.
Over-permissioned tools and unsafe actions
The easiest way to create an incident is to let the model reach powerful write actions too directly. Sending email, modifying records, deleting resources, changing code, or invoking payment flows should all be treated as high-risk capabilities with explicit approvals or deterministic policy checks.
Memory leakage and poisoned state
If memory or task state stores unverified information, the system can replay bad assumptions over many future runs. Sensitive content may also leak into prompts or logs that more tools and teammates can access than intended. Keep durable memory narrow and auditable.
Unsafe outputs and automation chaining
Even if the model never touches a dangerous tool directly, its outputs may feed another system that does. Structured output validation, allowlists, and downstream approval gates matter because an unsafe answer can become an unsafe action two steps later.
Multi-agent trust and delegation risk
As soon as one agent can delegate to another, trust assumptions get harder. The receiving system may have different policies, different tool access, or weaker validation. That is why cross-agent handoffs need explicit identity, scope, and audit rules instead of informal prompt chains.
1Risk surface | Typical failure mode | Stronger default control2Prompt injection | Untrusted text changes model behavior | isolate content, reduce permissions, validate actions3Over-permissioned tools | Model triggers sensitive writes too easily | least privilege, approvals, narrow tool schemas4Unsafe memory | Bad facts persist across sessions | separate state stores, review durable writes5Unsafe outputs | Generated text causes downstream side effect | schema checks, allowlists, deterministic validation6Cross-agent delegation | One agent inherits another's unsafe trust | scoped identities, explicit auth, audit trails
Threat-model the full agent system
A useful threat model starts with assets and capabilities, not with prompts alone. What data can the agent read? What systems can it change? What irreversible actions can it trigger? Which parts of the workflow are visible to operators, and which are happening only inside model outputs or tool adapters?
Then map where instructions and context enter the system, where state persists, and where side effects occur. Threat modeling is especially valuable when a workflow spans retrieval, model reasoning, protocol calls, and tool execution because each handoff can change who is trusted and why.
Security controls by layer
Inputs and retrieved content
Tag content by trust level, strip or isolate untrusted instructions where possible, and avoid blending policy text with retrieved user content in one unstructured blob. If the agent uses web or document retrieval, assume retrieved text can contain hostile instructions.
Tools and side effects
Define tools narrowly. Split read actions from write actions. Use structured inputs, explicit auth, timeout limits, and audit logs. Keep the model from inventing free-form commands where a typed interface would do.
Memory and persistent state
Store only what future runs truly need. Review or score durable memory writes, and keep sensitive content out of long-lived state by default. A compact memory system is usually safer than a clever one.
Outputs, approvals, and logging
Validate structured outputs before they trigger downstream systems. Require human approval for sensitive writes, delegation, and irreversible actions. Log prompt inputs, selected tools, tool parameters, outputs, and policy decisions in a form security and ops teams can actually inspect.
Least privilege, sandboxing, and human-in-the-loop design
Least privilege is still the default answer. Give the agent access only to the tools and fields it needs for the current task. Prefer pre-scoped service accounts, read-only modes, and temporary credentials where possible. If the job can be done in a sandbox first, do that before opening live write access.
Human approval should not be a vague fallback. Treat it as part of the system design: when is approval required, what context does the reviewer see, and what happens after a rejection or timeout? Good approval design is as much an architecture question as a security one.
Securing MCP and agent-to-agent communication
Protocol adoption changes the shape of the security problem, not its existence. With Model Context Protocol, you still need to verify which servers are trusted, which tools are exposed, and whether returned content can inject instructions. With Agent-to-Agent Protocol, you need to know which agent called whom, on whose behalf, with which permissions, and how task state is monitored across the handoff.
That is why authentication, scoped identities, and auditability matter more as systems become more interoperable. Standards make integration cleaner, but they do not make trust automatic.
Monitoring, anomaly detection, and incident response
Security posture depends on observability. Monitor unexpected tool usage, unusual delegation patterns, spikes in failed validations, and changes in memory-write behavior. Build alerts around the actions that would matter during an incident, not just generic latency metrics.
You also need a recovery plan: disable a tool, revoke a credential, pause a workflow, quarantine a memory store, or require manual review on the next run. Pair this operational layer with AI Agent Evaluation so reliability and safety checks evolve together.
Production security checklist for launch review
1Launch checklist2[ ] every tool has an explicit owner, schema, and permission scope3[ ] read and write actions are separated where possible4[ ] high-risk actions require deterministic checks or approval5[ ] untrusted retrieved content is isolated from policy instructions6[ ] durable memory writes are limited and auditable7[ ] protocol servers and delegated agents use scoped auth8[ ] prompt, tool, and output logs are retained for investigation9[ ] kill switches exist for tools, workflows, and delegated actions10[ ] incident response steps are documented before launch
What to read next
Use AI Agent Use Cases to size the autonomy and blast radius before rollout, How to Build AI Agents to design the workflow, AI Agent Architecture to map the control surfaces, Multi-Agent Architecture to reason about trust when the workflow splits across specialist roles, AI Agent Evaluation to verify the system under failure, Model Context Protocol to govern tool and resource access, and Agent-to-Agent Protocol to reason about delegated work across agent systems. For live context, keep the A2A v1.0.0 brief and the weekly AI agent launch roundup close when the protocol and framework landscape moves.
Continue the guide path
Move from this topic into the next pilot, architecture, stack, protocol, or live-release decision.

Guide coverage
Foundations / Implementation
Agent News Watch for teams building and operating AI agents.
Foundations / Implementation
Learn the best AI agent use cases for product, ops, engineering, and support teams, plus how to choose the right autonomy level, architecture, and rollout path.

Guide coverage
Architecture
Agent News Watch for teams building and operating AI agents.
Architecture
Learn how AI agent architecture works across models, tools, memory, orchestration, guardrails, and multi-agent patterns with practical reference designs.

Guide coverage
Architecture
Agent News Watch for teams building and operating AI agents.
Architecture
Learn when multi-agent architecture outperforms single-agent systems, which coordination patterns fit best, and how to manage context, reliability, security, and cost.

Guide coverage
Evaluation
Agent News Watch for teams building and operating AI agents.
Evaluation
Learn how to evaluate AI agents with task-based evals, regression checks, human review, and production metrics across tools, safety, latency, and cost.

Guide coverage
Protocols
Agent News Watch for teams building and operating AI agents.
Protocols
Learn what Model Context Protocol is, how MCP clients and servers work, and when it beats bespoke tool integrations for AI agents.

News coverage
Protocols / Interoperability
Agent News Watch for teams building and operating AI agents.
Protocols / Interoperability
A2A v1.0.0 adds tasks/list, modern OAuth flows, multitenant gRPC support, and breaking spec cleanup. Here is what agent builders need to test.