Knight Lee comments on Maintaining Alignment during RSI as a Feedback Control Problem

Knight Lee 2 Mar 2025 19:12 UTC
3 points
0
Misunderstanding doomers
I disagree that the doomers are arguing “alignment is nearly impossible, because it’s impossible to get the first AI very very exactly aligned, and if it’s only a bit off each next AI will deviate more and more.” They are not arguing for the existence of some mechanism which amplifies deviations. They are not failing to consider a feedback control which halts the deviation expansion.
Instead they are arguing that the first AI already has misaligned goals, but is still corrigible because it is too weak to calculate that eliminating humanity and building a supercomputer can help it solve its math problem better. As the AIs get smarter, they continue being misaligned (despite automated alignment research), but also become incorrigible as they realize eliminating humans can optimize their goals better.
I personally disagree with doomers (my naive opinion is that future AI will continue to do what they say they would do and care about what they say they care about, probably $^{67 %}$ ), but I still think you’re misunderstanding doomers.
I really like the rest of your post. Strong upvote.
I agree with feedback control
I agree feedback control is important and deserves more attention.
I believe that >80% of feedback control consists of:
1. Human oversight of recursive self improvement, and
2. Automated alignment research, i.e. using one AI researcher AI, to help us align the next AI researcher AI.
Even though these ideas aren’t new, feedback control is a useful reframing of them.
But are there forms of feedback control other than human oversight and automated alignment research?
Maybe there are! One silly idea is reminding giving the AI this reminder:
If each AI tries to make the next AI aligned with humans, the recursive self improvement process may go alright. But if each AI tries to make the next AI aligned with itself only, not only will recursive self improvement drift away from human values, but drift away from the values of each previous AI.
We all want a feedback control which restrains future values to past values.
Updateless decision theory would argue that if you want the each next AI to avoid misaligning with you, you should avoid misaligning with humans. Even though their decision is not an atom-for-atom copy of your decision, your decision is still made out of logic which determines both your decision and their decision.
This only helps in certain cases of gradual misalignment (the AI has become incorrigible, but is more worried about future AI than humanity). I think it has a 3% chance of working (conditional on things not working without it), but it’s very cheap (it only costs the AI lab 0.1 weirdness points).
What do you think about this? Are there other examples of feedback control (other than human oversight and automated alignment research)?

Knight Lee comments on Maintaining Alignment during RSI as a Feedback Control Problem

Misunderstanding doomers

I agree with feedback control