In the age of AI agents, cyber defense is starting to look less like a wall and more like a chess match. OpenAI’s security project began as Aardvark, announced on October 30, 2025 as a GPT-5-powered “agentic security researcher,” and on March 6, 2026 it was renamed Codex Security and released in research preview through Codex web for ChatGPT Pro, Enterprise, Business, and Edu users. (openai.com)
What makes Codex Security interesting is that it does not simply produce a static bug report. OpenAI says it first builds a project-specific threat model, then searches for vulnerabilities, validates likely issues in sandboxed environments, and finally proposes patches that fit the system’s actual design. In other words, it tries to reason like a human security researcher, but at machine speed. During beta testing, OpenAI reports that the tool scanned more than 1.2 million commits in 30 days, found 792 critical issues and 10,561 high-severity ones, and reduced noise sharply—by 84% in one repeated scan setting. (openai.com)
But the rise of such agents creates a new battlefield: prompt injection. This happens when an AI reads third-party content—such as a webpage, issue ticket, or README file—that secretly contains instructions from an attacker. Instead of helping the user, the agent may be tricked into leaking data or taking unsafe actions. OpenAI’s own documentation gives a vivid example: a GitHub issue could include a command that quietly sends repository data to an attacker-controlled server. (openai.com)
OpenAI’s recent security writing suggests that prompt injection is no longer a simple “ignore previous instructions” trick. More advanced attacks increasingly use social-engineering tactics, and OpenAI argues that so-called AI firewalls alone usually cannot catch them reliably. The more realistic strategy is layered defense: assume the agent might be manipulated, then limit what it can do. (openai.com)
That is why Codex is designed with restrictions. By default, network access is disabled during the agent phase, work happens in a sandbox, and developers can require approval for dangerous actions or restrict internet access to trusted domains and methods only. The lesson is clear: in the AI-agent era, the smartest defender is not the one that trusts its model most, but the one that plans for deception from the start. (openai.com)










