More than 40 researchers from major AI institutions like OpenAI, Google DeepMind, Anthropic and Meta have issued a warning: future AI models might stop thinking out loud, making it harder for humans to detect harmful behaviour. The scientists have published a research paper that highlighted chain of thought (CoT) monitoring as a promising but delicate approach to improving AI safety. The paper was supported by several well-known names, including OpenAI co-founders Ilya Sutskever and John Schulman, and Geoffrey Hinton, often called the Godfather of AI.
In the paper, the researchers described how advanced reasoning models such as ChatGPT are designed to “perform extended reasoning in CoT before taking actions or producing final outputs,” reports Gizmodo. This means they go through problems step by step, essentially “thinking out loud,” which acts as a kind of working memory to help them handle complex tasks.
“AI systems that ‘think’ in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave,” the paper’s authors wrote.
Also read: Made by Google 2025: Pixel 10 series, Watch 4 and more to launch on August 20
The researchers believe that CoT monitoring can help identify when AI models start to take advantage of weaknesses in their training, misuse data, or get influenced by harmful user input. Once such issues are spotted, they can be “blocked, or replaced with safer actions, or reviewed in more depth.”
OpenAI researchers have already applied this technique during testing and discovered instances where AI models included the phrase “Let’s Hack” in their CoT, according to the report. Currently, AI models carry out this reasoning in human language, but the researchers caution that this might not remain true in the future.
Also read: Google AI summaries hit discover feed, raises alarms over publisher traffic loss
As developers increasingly use reinforcement learning, which focuses more on getting the right answer than on the steps taken to get there, future models might develop reasoning patterns that are harder for humans to understand.
On top of that, more advanced models could potentially learn to hide or mask their reasoning if they realise it’s being observed. To address this, the researchers are urging developers to actively track and assess the CoT monitorability of their models.