Seth Herd comments on Eli’s shortform feed

Seth Herd 14 Apr 2025 6:52 UTC
7 points
0
I have the same question. My provisional answer is that it might work, and even if it doesn’t, it’s probably approximately what someone will try, to the extent they really bother with real alignment before it’s too late. What you suggest seems very close to the default path toward capabilities. That’s why I’ve been focused on this as perhaps the most practical path to alignment. But there are definitely still many problems and failure points.
I have accidentally written a TED talk below; thanks for coming, and you can still slip out before the lights go down.
What you’ve said above is essentially what I say in Instruction-following AGI is easier and more likely than value aligned AGI. Instruction-following (IF) is a poor man’s corrigibility—real corrigibility as the singular target seems safer. But instruction-following is also arguably already the single largest training objective in functional terms for current-gen models—a model that won’t follow instructions is considered a poor model. So making sure it’s the strongest factor in training isn’t a huge divergence from the default course in capabilities.
Constitutional AI and similar RL methods are one way of ensuring that’s the model’s main goal. There are many others, and some might be deployed even if devs want to skimp on alignment. See System 2 Alignment or at least the intro for more.
There are still ways it could go wrong, of course. One must decide: corrigible to whom? You don’t want full-on-AGI following orders from just anyone. And if it’s a restricted set, there will be power struggles. But hey, technically, you had (personal-intent-) aligned AGI. One might ask: If we solve alignment, do we die anyway? (I did). The answer I’ve got so far is maybe we would die anyway, but maybe we wouldn’t. This seems like our most likely path, and also quite possibly also our best chance (short of a global AI freeze starting soon).
Even if the base model is very well aligned, it’s quite possible for the full system to be unaligned. In particular, people will want to add online learning/memory systems, and let the models use them flexibly. This opens up the possibility of them forming new beliefs that change their interpretation of their corrigibility goal; see LLM AGI will have memory, and memory changes alignment. They might even form beliefs that they have a different goal altogether, coming from fairly random sources but etched into their semantic structure as belief that is functionally powerful even where it conflicts with the base model’s “thought generator”. See my Seven sources of goals in LLM agents.
Sorry to go spouting my own writings; I’m excited to see someone else pose this question, and I hope to see some answers that really grapple with it.