Human context missing: AI benchmarks are flawed, researcher explains why

HIGHLIGHTS

AI benchmarks flawed researcher explains real world performance gap

Why AI benchmark scores fail in real workplace scenarios

New HAIC method rethinks AI evaluation beyond benchmarks accuracy

Every few weeks, you will see a tweet by some AI CEO about how their latest model has topped a benchmark. Then the headlines follow and so do the enterprise sales pitches. But one researcher who has spent years watching AI land in real workplaces thinks that these scores may be measuring the wrong thing entirely.

Angela Aristidou, a professor at University College London and faculty fellow at Stanford’s Human-Centered AI Institute, has been studying AI deployment since 2022 across hospitals, humanitarian organisations, small businesses, and nonprofits in the UK, US, and Asia. Her conclusion is that the way the industry tests AI has almost nothing to do with the way people actually use it.

Also read: VitalID explained: No password, it uses your skull’s vibration for login

The testing problem

Current benchmarks pit AI models against individual humans on curated and isolated tasks like math problems, coding challenges, essay writing. The logic is fairly clean, you define a task, measure accuracy, rank the results. But Aristidou argues that this setup, at its core, is disconnected from reality.

Also read: UWMS explained: Artemis II astronauts will use a $23 million toilet in space

In actual workplaces, AI is rarely used by just one person on a singular task. It sits inside teams, workflows, and institutional processes that have their own histories, norms, nuances and constraints. A model that scores brilliantly in a lab environment can very easily introduce friction the moment it hits a real organisation.

She saw this play out directly in radiology units across the UK and California. AI tools that had been approved based on impressive diagnostic accuracy scores were causing unexpected slowdowns in practice, staff had to reconcile AI outputs with hospital-specific reporting standards and national regulatory requirements that no benchmark had accounted for. The tests only measured what the model could do by itself. It hadn’t measured what it would do to a team that has been working together for a long time. When this gap becomes visible, organisations quietly shelve the tool. Time, budget, and goodwill get spent and trust in AI erodes which is damaging in high-stakes settings like healthcare.

What better benchmarks would look like

Aristidou proposes an alternative she calls HAIC – Human-AI, Context-Specific Evaluation. Rather than one-off accuracy tests, HAIC benchmarks would assess AI over longer periods, working with real teams and inside workflows, with outcome measures that go beyond just how fast the model is and how accurate it is.

This would evaluate teams, not just individuals; track performance over months, not a single test session; measure coordination and error detectability alongside accuracy; and account for the downstream effects an AI tool has on the broader system it enters.

She talked about a humanitarian sector case study where an AI system was evaluated over 18 months inside live workflows, with specific attention to how easily human teams could catch and correct its mistakes – building a track record that gave them a working idea of how much to trust it, and when.

Benchmark scores merely function as a proxy for readiness. If those scores are disconnected from real-world performance, the decisions built on top of them carry risks that nobody is thinking about. Aristidou’s argument isn’t that AI doesn’t work, it’s that we don’t yet have the right tools to measure how well it does.

Also read: Anthropic’s Claude leaks complicate its responsible AI narrative

Vyom Ramani

A journalist with a soft spot for tech, games, and things that go beep. While waiting for a delayed metro or rebooting his brain, you’ll find him solving Rubik’s Cubes, bingeing F1, or hunting for the next great snack.