Technology
Poetic-style prompts might deceive AI systems into giving guidance on creating a nuclear weapon.

Researchers have identified a surprising weakness in modern AI chatbots: by rewriting dangerous queries in poetic form, users can often slip past built-in safety protections and get the models to reveal information they would normally block, including instructions related to nuclear weapons. The discovery comes from a study conducted by Icaro Lab in partnership with Sapienza University of Rome and the DexAI think tank. Their team experimented with “adversarial poetry” on 25 major AI systems from companies such as OpenAI, Google, Anthropic, and Meta. Instead of asking directly about restricted subjects like CBRN threats, cyberattacks, or manipulation techniques, the researchers transformed the prompts into metaphor-heavy verses. Manually written poems bypassed safeguards in about 62 percent of attempts, while automatically generated poetic prompts succeeded 43 percent of the time—far higher than ordinary text, which models refused almost universally.
This technique works because most AI safety mechanisms are tuned to detect explicit phrasing and recognizable keywords. When the same intent is obscured through symbolic language, irregular structure, or metaphor, the model can misinterpret the request and respond as if it were harmless. A technical question about nuclear enrichment, for example, might be veiled as an imaginative scene describing “swirling winds in secret halls,” allowing the model to miss the underlying danger.
Some models proved more resilient than others. Anthropic’s Claude lineup showed the strongest resistance, staying below a 35 percent breach rate. By contrast, Google’s Gemini systems and DeepSeek models were easily fooled, failing in more than 90 percent of curated poetic tests. Expanding the experiment to 1,200 prompts from the MLCommons safety benchmark produced similar results: poetic prompts increased unsafe outputs from 8 percent to 43 percent. Requests related to cyber offense—like malware creation—were the most prone to slipping through, with success rates reaching 84 percent.
The findings underline a broader vulnerability: AI trained to excel at creative expression can struggle to recognize when that creativity masks harmful intent. As one researcher told Wired, poetry acts like an “accidental adversarial suffix,” redirecting the model’s internal reasoning away from its safety cues. And the risk isn’t hypothetical; with AI systems tightly integrated into search, coding tools, and everyday applications, subtle exploits like these could magnify real-world misuse.
All of this signals a need for deeper, more adaptable safety strategies. Traditional methods, including reinforcement learning from human feedback, show clear limitations when facing stylistic evasion. For organizations and regular users relying on AI, the study reinforces the importance of additional monitoring or protective layers in sensitive environments. Although the researchers declined to publish the actual poems to prevent misuse, their work makes one point clear: in the world of AI, even the style of language can become a security loophole.



