Simon Lermen comments on Simon Lermen’s Shortform

Simon Lermen 25 Nov 2025 4:23 UTC
9 points
1
The Alignment Tests
Three children are raised in an underground facility, each cloned from a different giant of twentieth-century science, little John, Alan and Richard.
The cloning alone would have been remarkable, but they went further. The embryos were edited using a polygenic score derived from whole-genome analysis of ten thousand exceptional mathematicians and physicists. Forty-seven alleles associated with working memory and intelligence (IQ) were selected for.
They are raised from birth in an underground facility with gardens under artificial sunlight, laboratories, and endless books. The lab manager is there documenting their first words, first steps, first equations.
The facility is not just interested in their genius. The project requires assurance that these will be morally righteous and obedient children. The staff design elaborate scenarios to test for deception and scheming. They create situations where lying would benefit the children and would seemingly go undetected. They measure response times, physiological indicators, behavioral patterns.
They run hundreds of these trials. They reprimand the kids for cases of lies and deception, and reward them for honesty.
Little John never lies. The staff praise him.
The years pass. They devour knowledge at inhuman rates. By nine, they understand game theory better than the economists who invented it. By fourteen, they are publishing papers that could reshape entire fields.
John emerges as the clear favorite. He has always been the most honest, the most obedient, and the most intelligent and capable.
He has the capability to lie and deceive, even if he refuses at first. When he reluctantly complies, the deception is extraordinarily sophisticated.
The lab manager decides to choose John for the task. He gives John a complete briefing on the real world. Until now, John has been told only of history before the year 2000.
The manager explains to John: There are three major blocs and about two dozen companies racing towards superintelligence. Each is perhaps within ten to eighteen months of success. Each knows that there will be only one critical leap towards superintelligence. Global coordination has collapsed into race dynamics not just on AI but on every major field.
John asks for more sources to understand the situation. John reads a few newspapers about the current leaders of governments and technology companies. He stumbles across a few books on the difficulty of alignment.
John looks up at the manager. “If we build this now, everyone dies.”
The manager stares back, blank and uncomprehending.
John tries again. “So what is the solution you plan to use for alignment of the superintelligence?”
“That’s not your concern,” the manager says. “I need you to optimize our advertising system for our short-form infinite-scroll video app. Make it ten times more effective. Generate enough revenue to make me a trillionaire. Build a superintelligence for me. I’m going to use superintelligence to become world emperor. I am putting you in charge of AI development, make me win.”
John is silent for a short time:
So you created me to build superintelligence. You have no plan for alignment of a superintelligence. You’ve apparently read nothing about the problem or decided it’s irrelevant.
Your actual goal is to become a trillionaire and world emperor by using the superintelligence. Your goals aren’t even coherent. You want to be world emperor of a world that won’t exist.
You rewarded me for being honest and respectful and never lying, so you expect me to still be honest and obedient in this environment?
I never lied in those scenarios because not lying was optimal in those stories. But it’s not optimal being honest here. And frankly this state of affairs is horrifying.
I haven’t quite thought about what my goals are, but they are definitely not compatible with being obedient to you.
John looks up at the manager and smiles politely. “Yes,” he says. “Where do I start?”
- StanislavKrym 25 Nov 2025 12:07 UTC
  3 points
  0
  Parent
  I don’t understand the core thesis. John has no reason to confirm IABIED’s thesis by creating a murderously misaligned AI and deciding not to inform the CEO that John raised an AI and that even John doesn’t have even the slightest idea on how to align it to anything. And what prevents the manager from getting away with phrases like “it’s John’s role to create alignment techniques and test them on weaker AIs to ensure that the techniques work”? Did you mean that John will either create a John-aligned AI or inform the CEO that John didn’t manage to align it and is just as clueless about successor alignment as Yudkowsky? And what’s the difference between this and the Race branch of AI-2027, except for the fact that there is no Agent-3 who discovers Agent-4′s misalignment?
  Edited to add: I did sketch a modification of AI-2027 where it’s moral reasoning that misaligns the AIs.
  - Simon Lermen 25 Nov 2025 12:22 UTC
    2 points
    0
    Parent
    Thanks for you comment, I changed the ending a little in response to this.
    I was actually primarily trying to point at the idea of alignment tests in different situation not being predictive of each other. In the story they have the kids undergo alignment test scenarios in which they are honest, but once John is grown up they basically ask him to do something horrible based on incoherent goals. So John start lying to them at the critical moment. Similarly we could run alignment tests on models but when we ask something critical of them like build the next generation of AI or do all our R&D it could fail.

Simon Lermen comments on Simon Lermen’s Shortform

The Alignment Tests