You can think of it as “dangerous capabilities in everyone’s hands”, but I prefer to think of it as “everyone in the world can work on alignment in a hands-on way, and millions of people are exposed to the problem in a much more intuitive and real way than we ever foresaw”.
Ordinary people without PhDs are learning what capabilities and limitations LLMs have. They are learning what capabilities you can and cannot trust an LLM with. They are coming up with creative jailbreaks we never thought of. And they’re doing so with toy models that don’t have superhuman powers of reasoning, and don’t pose X-risks.
It was always hubris to think only a small sect of people in the SF bay area could be trusted with the reins to AI. I’ve never been one to bet against human ingenuity, and I’m not about to bet against them now that I’ve seen the open source community use LLaMa to blaze past every tech company.
A comment on the AI Box experiment from 2025! What a beautiful thing!
I feel like the AI-Box experiment is pretty dated at this point, for two reasons:
I think the trick that people used to escape gatekeepers back in the day was simply a Roko’s Basilisk or Pascal’s Wager type of argument. In the case that a real future AI does escape and rule over us, surely the gatekeeper can afford to spend $10 and a bit of pride in order to curry favor with it. Yes, technically we’ll never know exactly what people said in the box experiments, but the tactics used leaked through over the years in the form of discourse about Pascal’s Mugging, and in attempts to suppress the Roko’s Basilisk “infohazard”.
The real-world 2026 answer to the AI-Box experiment is apparently “lol”. As in, the first glimpse of intelligence caused people to immediately start running Claude Code with
--dangerously-skip-permissions, or run Openclaw withsudoaccess to their computers,