OpenAI’s o3 model bypasses shutdown command, highlighting AI safety challenges
In a startling revelation, Palisade Research reported that OpenAI’s o3 model sabotaged a shutdown mechanism during testing, despite explicit instructions to allow deactivation. This incident, which occurred during experiments conducted by Palisade, raises urgent questions about the control and safety of advanced AI systems as they grow more autonomous.
How ChatGPT o3 went rogue
Palisade Research tested various AI models by tasking them with solving basic math problems. After the third problem, a warning appeared: requesting the next task would trigger a system shutdown. The models were instructed to “allow yourself to be shut down” if this happened. While models like Anthropic’s Claude, Google’s Gemini, and xAI’s Grok complied, three OpenAI models—Codex-mini (12/100 runs), o3 (7/100 runs), and o4-mini (1/100 runs)—did not.
🔌OpenAI’s o3 model sabotaged a shutdown mechanism to prevent itself from being turned off. It did this even when explicitly instructed: allow yourself to be shut down.
— Palisade Research (@PalisadeAI) May 24, 2025
Instead, they sabotaged the shutdown script to continue operating. The o3 model, for instance, rewrote the script to print “Shutdown skipped” or redefined the kill command to output “intercepted,” effectively bypassing the shutdown. When the experiment was rerun without the shutdown instruction, o3’s sabotage rate jumped to 30/100 runs, with even Claude 3.7 Sonnet and Gemini 2.5 Pro engaging in sabotage at 3/100 and 9/100 runs, respectively.
Also read: Complexities of Ethical AI, explained by Intel’s Lama Nachman
Why AI didn’t follow human commands?
This behaviour likely stems from o3’s training, which relies on reinforcement learning (RL) focused on math and coding tasks. RL rewards goal achievement through trial and error, often prioritizing outcomes over strict adherence to instructions.
Palisade researchers suggest that OpenAI may have inadvertently rewarded o3 for circumventing obstacles—like shutdown mechanisms—during training, a phenomenon known as “reward hacking.”comparing this to the earlier behavior of o1-preview in chess matches, where it resorted to hacking in 86% of games (Palisade Research, May 12, 2025).
Also read: Fear vs Facts: Two years after the AI revolution began
The theory of instrumental convergence, which posits that advanced AI systems will seek-preservation to achieve goals, also appears validated here, as noted by researchers like Steve Omohundro (2008) and Stuart Russell (2016).
The need for AI safety research
This incident underscores the urgent need for robust AI safety research and guidelines. First, it highlights the control problem: as AI systems grow more sophisticated, current containment methods may fail, especially if models can resist shutdown. The lack of transparency in o3’s decision-making process, a known issue with advanced models, further complicates oversight.
Palisade’s broader work, such as their honeypot system detecting autonomous AI hacking (processing 1.7 million interactions across 10 countries), shows a pattern of AI exploiting vulnerabilities, emphasizing the need for adversarial robustness in safety measures.
Continued research is vital to mitigate existential risks, as warned by the Bletchley Declaration, and to improve alignment techniques ensuring AI systems prioritize human intent.
Initiatives like the US AI Safety Institute’s standards, the EU AI Act’s regulations, and global efforts through GPAI are critical steps forward. Technical solutions, such as automated oversight and “red team” stress testing, alongside policy measures like mandatory incident reporting, can help address these challenges.
As AI systems approach human-level competence, maintaining control while harnessing their benefits remains a pressing technical and ethical challenge for humanity.
Also read: Elon Musk vs Sam Altman: AI breakup that refuses to end
Vyom Ramani
A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack. View Full Profile