Jeremy Gillen comments on Daniel Kokotajlo’s Shortform

Jeremy Gillen 13 Dec 2025 4:01 UTC
24 points
6
The nearest unblocked strategy problem combines with the adversarial nonrobustsness of learned systems, to make a much worse combined problem. So I expect constraints to often break given more thinking or learning than was present in training.
Separately, reflective stability of constraint-following feels like a pretty specific target that we don’t really have ways of selecting for, so we’d be relying on a lot of luck for that. Almost all my deontological-ish rules are backed up by consequentialist reasoning, and I find it pretty likely that you need some kind of valid and true supporting reasoning like this to make sure an agent fully endorses its constraints.
Fortunately, things will probably be more gradual than that; we can try to control/steer AI systems not that different from todays, and then get their help controlling AI systems smarter than them, and so forth in a long chain all the way to ASI.
Why do you believe this reasoning at 50%? Is this due to new evidence or arguments? I’ve never heard it supported by anything other than bad high-level analogies.
There are a couple of factors that make the chain weaker as it goes along.
- The level of human understanding decreases with each iteration, otherwise we would just do it without AI help. So any failure of helpfulness becomes less and less noticeable.
- With this kind of test-driven engineering loop, especially in the current ML paradigm, it’s easier to hide problems than it is to deeply fix them. Accidentally hiding problems is easy today, but it becomes even easier as the AI becomes more complex and intelligent.
The only way I can see it going well is if an early AI realizes how bad of an idea this chain is and convinces the humans to go develop a better paradigm.
(I wrote a post about similar stuff recently, trying to explain the generators behind my dislike of prosaic research)