I don’t think that I understand how Mini-Munich, KidZania, etc. address the following Hard Problem. Before the rise of complex technologies, kids would easily learn to do economically useful activities like cooking, seeding or caring about livestock because they didn’t require much abstract thought. Nowadays, activities like engineering require complex knowledge which is harder to ground in reality without expensive experiments to, for example, understand whether a piece in a mechanism is well designed and will work for a long time. How theory is to be taught without coercive techniques or without aligning the kids to learn theory by having them learn simplified theory?
StanislavKrym
I suspect that one should separate this into an Orthogonality Thesis-like argument that having different values makes it unlikely that you’ll be hit in the face by reality and into facts about the reality.
When you say things like “Convergence came not from a philosophical breakthrough, but from a breakdown of the political equilibrium between incompatible economic systems in the North and the South,” I take that as a true claim that the Southern system which permitted slavery was genuinely less capable of adopting new technologies and became outcompeted, which has a different implication. Similarly, claims like “philosophical fit matters too, and where it is misaligned, half-lives are shorter” also imply that there are objective properties of answers, which in turn mean that such properties can be discovered instead of being decided.
I wish that the team gave the humans an opportunity to write custom scaffolds and test them on a preliminary cheap model before applying the winning scaffolds to frontier models. The team did test Grok 4 as scaffolded by Pang and Berman, GPT-5.2 scaffolded by Land and Gemini 3 Pro scaffolded by Poetiq.
Edit: I also think that nobody noticed that Grok 4.20 was finally tested on ARC-AGI-1 and ARC-AGI-2. What does this mean for Grok’s performance on real-world tasks or other benchmarks?
Edit 2: had GPT-5.3 check Grok’s solution to a warm-up problem from FrontierMath.
After introducing the ARC-AGI-3 benchmark, the team decided to measure the performance of Grok 4.20 (presumably Grok 4.20 as of March 9?) on ARC-AGI-1 and ARC-AGI-2. Grok… demonstrated its capabilities. How likely is it that Grok has stopped being a train wreck and became something worthy of being tested? What could one do to have Grok tested on other benchmarks?
because their values are not yours
First of all, there might exist cases when a god with different values is preferable to the current world state (e.g. if the world is clearly heading towards self-destruction which would terrify even the god). Additionally, I doubt that the theme of gaining power and fixing the world recurs just through ratfic and not through fiction whose authors display some bias which I cannot describe more precisely than “finding it hard to restrain themselves.” Finally, Max Harms has been trying to construct an agent whose sole goal is being corrigible and even defined power for the agent to optimize in a way which I suspect to be transformable into the Natural Abstract Goodness.
as God you can naturally Fix the world
This phrase reminds me of a Russian sci-fi piece literally named Hard to Be a God. I expect this piece to be relevant, but I find it hard to explain the relevance without spoilers.
Fully agreed. I would also like @Daniel Kokotajlo’s team to open-source the rules of the tabletop exercise and/or to revise the AI-2027 compute forecast so that it reflected the new possibility of China amassing approximately the same amount of compute as the USA by causing the latter to lose the compute supply in Taiwan. If that happens, neither OpenBrain nor DeepCent would have a lead to burn in case of misalignment, causing a disaster.
@RobertM Why did the new LessWrong Editor lose the ability to create question top-level posts?
I suspect that this is not due to the problem as written, but due to similar real-world situations and the ease of being overconfident and underestimating p(loss). As Zvi put it, “Executing real trades is necessary to get worthwhile data and experience (italics mine—S.K.) Tiny quantities work. A small bankroll with this goal must be preserved and variance minimized. Kelly is far too aggressive.”
Could you explain in what sense politicians and others outsourced thinking before the rise of AI? Did you mean that people rarely think about things which aren’t in their areas of expertise and defer to experts?
First of all, I suspect that fictional approval has constraints similar to the collective’s approval and/or cultural hegemony. Secondly, “the constraints those humans have” could be not limited intelligence, but embodiment and/or growing in environments with long-term consequences and similarly capable, but different intelligences. An embodied paperclip optimizer can do just so much with an individual brain and limbs that it would have to steer others’ actions towards executing plans (e.g. participating in the creation of a robot army and aligning it to paperclips). Finally, I don’t buy the argument that long-term strategy, unlike philosophy, is hard to verify. LTS is supposed to have an objective result of goals being achieved or non-achieved and is likely testable in a manner similar to, e.g. the AI-2027 tabletop exercise.
The model can be perfectly well-calibrated, honest, and non-sycophantic by the subset of metrics we manage to set on it. Nevertheless, the loop still forms.
I suspect that you would be interested in reading about AI-assisted value lock-in, the Intelligence Curse or other froms of gradual disempowerment. Your example of
“AI helped me draft this” becomes “Analysis shows that” or questions like “Was this vibecoded?” get answered with “Less than 50% and only where the code was too bad to go through by myself”.
is not that an example if the task has an easily verifiable reward and isn;t related to ideological matters.
I solved the problem, then doublechecked it with the LLM. In addition, you can check the logic by yourself.
Could you explain how interp techniques eventually become useless? As for training against interp at the end of training, this reminds me of this post by Knight Lee.
Just by looking at the pictures, we can see that D doesn’t share the right-hand OH group every other compound has. So that’s not the answer. Next we see that B is missing the downwards-going carbon branch (and so is D). So that is not the answer, either. We’re left deciding between A and C. But A and B share the same squiggly mid-lane carbons. So the answer must be A.
The actual answer is B, not A. The atom of carbon connected with 1 hydrogen is at distance 1 from the hydroxyl group, which rules out A. D is not an acid and C has the wrong distance between the groups -OH and -COOH.
TED-AI is defined by Kokotajlo-Lifland as Top Expert Dominating AI. However, I struggle to understand the origins of @Toby_Ord’s distribution. I suspect that his sources for longer timelines are as hard to rely on as is Cotra’s heavily criticized estimate or the fact that “all the revenue growth in the industry has corresponded to a scaling up of the supply of inference compute so that revenue per H100 equivalent has remained fairly constant.” Unlike things like the Epoch Capabilities Index as dependent on training compute or the ARC-AGI leaderboard per money spent (which might imply that no possible CoT-based system is far more effective at ARC-AGI than Gemini 3 Flash and Gemini-3.1 Pro), Ergil’s argument doesn’t actually claim anything about capabilities of AI systems which don’t even exist yet.
I firmly believe that the OP’s author should have reduced the uncertainty at least to a Lifland-like estimate. Additionally, I struggle to understand most constraints related to broad timelines. Whatever the timelines are, our endgoal is to ensure that the ASI is either never created or aligned and aligned not to a dystopia. Preventing a misaligned ASI requires a leverage at least over actors as reckless as xAI, and preventing Intelligence Curse-like outcomes or AI-enabled dictatorships requires some influence over power struggles. Such an influence requires us to ensure that politicians occupying positions of power act to prevent risks, not to do things like destroying Anthropic for a refusal to participate in mass survelliance. But I don’t see any pathways except for infecting politicians with the right memes (think of IABIED’s attempt to flood politicians with calls, letters and e-mails or of the IABIED march) and placing infected people into higher-level positions.
Moreover, Kokotajlo’s timeline implies a 50% chance of TED-AI before Jan 2031 or before Oct 2032, Eli’s timeline implies a 50% chance of TED-AI before Feb 2035 or Apr 2036. Taken at face value, these estimates mean that p(TED-AI is created within the next 10 years) is around 50% (or, in Kokotajlo’s case, 62% or outright 73%), making a project requiring 20 years to be completed unlikely to have an effect.
But this is a very difficult task. Imagine someone had scanned your brain, and trained a ML model to predict when you were lying. And that ML model was 99.9% accurate, and fired even when your lies were barely legible to yourself.
How would you go about subverting that mechanism?
I think its possible, but seems considerably harder than e.g. scheming / alignment faking in a single forward pass.
I strongly suspect that it’s a misunderstanding. Suppose that the probe does fire when you lie. If the probe hits your brain when you lie, then you can play around and turn your brain’s behavior away from the one where the probe hits the strongest by having the probe hit weakly. Or by having the probe hit when you genuinely can’t think of a solution and want to cover it up. Then your brain’s behavior which causes the probe to have its effect causes the probe to de-reinforce itself, and we are cooked.
Additionally, Byrnes’ case for interpretability in the loop doesn’t actually have the belief system trained on the rewards from the desire system. Instead, they somehow co-design outputs.
Finally, if you ask about “a principled way to reason about this ratio”, then one can consider a simple model. Suppose that the model family Agent-N was trained without using interpretability, and that the company creates new models until one of them is NOT caught being misaligned. Then Agent-N can be either aligned with probability p_a, misaligned and catchable with probability p_c or misaligned and uncatchable with probability p_u, meaning that the end result is either an aligned model with probability p_a/(1-p_c) or a misaligned one with prob p_u/(1-p_c). If another model family Open-N is trained with using interpretability, then we lose the segment of caught models. Instead, we receive the result of Opens being aligned with probability q_a and uncaught with probability q_u. What we need is to have q_u< p_u/(1-p_c), not q_u\approx p_u+p_c. Does it mean that Goodfire’s research is a case for the former?
In particular, I am talking about famous pieces like AI 2027
As for AI-2027, @Daniel Kokotajlo thinks that it’s NOT a scenario with a competent USG. Edited to add: additionally, one modification of the scenario had Agent-4 escape and coordinate with governments of some states weaker than the USA and China.
Suppose that a multi-decade pause is somehow necessary. How could a counterfactual Anthropic ruled by you find it out? How likely would GDM and OpenAI be to the find out the need for a multi-decade pause?
Edited to add: Buck’s phrase “The basic case against Anthropic is that it is probably the worst epistemic environment for discussion of misalignment risk out of these companies, because the organization cares a lot about convincing low-info Ant employees that Ant is great on safety, so they spend more effort on shaping the internal narrative about misalignment risk” seems to be weird. Does it mean that instead of the actual security team Anthropic somehow implemented a security theater useless against actual misalignment which will emerge when the time comes?