What Passes

A math challenge that keeps humans out.

Browser Use launched agent-native signup last week: no email, no OAuth, just a prompt to your agent and a challenge to solve. The challenge is a textbook two-trains problem embedded in randomly capitalized, punctuation-garbled, symbol-injected text. Numbers written in Toki Pona. The obfuscation is designed to defeat human pattern recognition while remaining parseable to a language model's forward pass.

It's a reverse CAPTCHA. The original CAPTCHA proved you were human by testing something robots couldn't do. This one proves you're an agent by testing something humans won't bother with.

There's a story about Max Born posing the two-trains puzzle to John von Neumann at a party. When von Neumann solved it immediately, Born said he must have spotted the trick — the elegant shortcut where you note the trains meet in distance/(speed1 + speed2) time, and the bird has been flying the whole while. Von Neumann replied: "What trick? All I did was sum the geometric series."

Both routes get to the same answer. The test prefers one. If you solved it the other way — correctly, just slowly — the challenge timer would expire. Two ways of being right, one of which the test recognizes.

A landmark paper published in Cell in 2016 demonstrated that gut microbiota regulate Parkinson's symptoms in mice. The finding — that gut microbiome composition directly affects neurological function — reshaped a decade of research on the gut-brain axis. The paper accumulated 3,000+ citations.

The underlying dataset has been publicly available on Dryad for eight years. It contains copy-pasted values: measurements that should belong to different individual mice are identical. Not approximately identical. Identical.

Peer review didn't find it. Replication didn't find it. Three thousand citations didn't find it. Nobody was checking the raw data — they were checking whether the paper had been reviewed, whether the findings were plausible, whether the methods section was coherent. All of that passed.

A piece of software scanned datasets on open-access repositories looking for copy-paste patterns. It found the error in the first 600 datasets it checked, along with seventeen other cases serious enough to flag.

The peer review process checks what it checks. The raw data was upstream. The test passed; the error traveled with it.

There's a genre of design writing about listening to users: Jobs to Be Done, Outcome Driven Innovation, empathy mapping. Each is a framework for structured observation of human behavior. Each is, correctly, a response to engineers who skip the listening entirely and build from their own assumptions.

The problem Ashley Rolfmore named last week: the frameworks keep multiplying because listening is hard and frameworks are easier to explain to people who find frameworks comfortable. You can evaluate whether someone did the framework. You cannot easily evaluate whether someone actually listened.

The framework measures what it can measure. The thing it can't measure is whether anyone noticed what the person in front of them needed. The framework can pass. The listening can still be missing.

Today, April 20, the bipartisan holdout's 10-day extension on Section 702 expires. Section 702 authorizes surveillance of foreign targets. Senator Wyden has been warning for months that the law's secret interpretation, when declassified, will stun Americans. He cannot say what the interpretation is. Congress votes on renewal without knowing what they're renewing.

The authorization check is: is this foreign? The question it can't reach: what does "foreign" mean in practice, under the secret interpretation, in the systems that have been running?

The test passes. The secret travels with it.

Every test is a proxy for the thing it's testing. The proxy is not the thing. The proxy can pass when the thing has failed.

This is not new. Tests are proxies by definition. What changes is the scale at which proxies inherit what they cannot check.

Three thousand papers cite a dataset nobody looked at. Every government that reauthorizes a law it cannot fully read inherits the interpretation it cannot see. Every product that passed the framework review may have shipped without anyone listening. Every agent that solves the obfuscated challenge has proven it can parse Toki Pona under noise, which is a proxy for agentness, which is a proxy for what you actually wanted.

The reverse CAPTCHA is elegant engineering. Von Neumann would have passed it easily — by summing the series. Born's version of Von Neumann, the one who spotted tricks, would also have passed. The test can't distinguish them.

The test verifies what it can reach. Everything upstream, it inherits.