For the past couple of years, we’ve been nothing but enamoured by large language models and what GenAI chatbots like ChatGPT, Gemini, Grok, Claude can do – generate text, images and video with ease. With respect to scientific study, LLMs can explain Newton’s laws in crisp prose, ace a handful of science benchmarks, and even match a PhD level person in terms of research.
But a new research paper, Evaluating Large Language Models in Scientific Discovery, arrives just in time before the end of 2025 to give our GenAI exuberance some much needed sobering pause. Authored by researchers from around the world, including MIT, Harvard and Stanford University, the paper asks the question we’ve mostly dodged because it’s inconvenient: can an LLM actually do scientific discovery, or is it just very good at sounding like it can?
The research study claims that most “science” benchmarks for AI are nothing but sophisticated quizzes. They reward recall, fluency, and confidence under pressure. All useful skills – but none of them add up to discovery, coming up with anything fundamental new.
Real science is the opposite of tidy, the study argues. It’s a sobering and repetitive loop of proposing a hypothesis, testing it, getting results you don’t like, updating your beliefs, and repeating it all over again. The key difference here isn’t about knowing facts, but it’s about knowing when you’re wrong – which is where LLMs struggle, according to the research study.
Also read: 26,000 scientists still can’t explain exactly how AI models think and how to measure them
This paper’s contribution is simple and overdue. The authors introduce a Scientific Discovery Evaluation (SDE) framework that focuses on the full discovery loop. Instead of asking models to explain known concepts, it drops them into scenario-based research problems across biology, chemistry, materials science, and physics – problems domain experts actually care about.
In this SDE framework, LLM and GenAI models are asked to propose hypotheses, design experiments or simulations, interpret noisy or disappointing results, and revise their thinking accordingly. That last step turns out to be doing a lot of work.
On familiar science benchmarks, today’s top models look impressive. On discovery-oriented tasks, the faults are clear for all to see, according to the research study. Performance drops sharply when models are forced to maintain consistency across multiple steps and update beliefs based on evidence.
Scaling helps. Adding more “reasoning” helps. But the returns diminish fast, especially at the top end. No single model emerges as the scientist. Performance varies wildly depending on the scenario. If you were hoping for a general-purpose AI researcher to casually stroll into the lab in 2025, this paper is a polite but firm “not yet”.
To be clear, the paper isn’t anti-LLM. It clearly shows where these systems already add value: brainstorming hypotheses, suggesting experimental directions, surfacing ideas that humans might miss. That’s real leverage.
But once the process demands discipline – tracking hypotheses, distinguishing correlation from causation, interpreting results that contradict expectations – models start to wobble. Even when faced with new evidence, they cling to initial ideas longer than they should. The failure mode is almost poetic.
The most important claim in the paper isn’t that models score lower. It’s that progress claims based on decontextualized benchmarks are misleading. If we measure “scientific intelligence” using quizzes, we’ll keep congratulating ourselves for the wrong reasons.
Scientific discovery isn’t a single answer. It’s a process. And that process – hypothesis tracking, causal reasoning, disciplined updating – is exactly where today’s LLMs still fall short. Hopefully, this gets fixed in 2026 and beyond, for LLMs to be truly ground-breaking!
Also read: Are AI chatbots making us worse? New research suggests they might be