I like this overview, it matches my experience dealing with complex systems and crystallizes a lot of intuitions. Sadly, this makes me more pessimistic about deliberate alignment:
It would instead be better to start with high-quality oversight: then the model might never learn to deceive in the first place, because all forms of successful deception would require large departures from its current policy.
There is no such thing as “high-quality oversight” once the model is capable enough, much more so than the overseers. We might be able to delay the inevitable deviation from what we wanted from the model, but without a locked-in self-oversight of some sort it won’t take long for the equivalent of a “sharp left turn”. And I don’t think anyone has a clue as to what this “self-oversight” might be in order to generalize orders of magnitude past what humans are capable of.
I like this overview, it matches my experience dealing with complex systems and crystallizes a lot of intuitions. Sadly, this makes me more pessimistic about deliberate alignment:
There is no such thing as “high-quality oversight” once the model is capable enough, much more so than the overseers. We might be able to delay the inevitable deviation from what we wanted from the model, but without a locked-in self-oversight of some sort it won’t take long for the equivalent of a “sharp left turn”. And I don’t think anyone has a clue as to what this “self-oversight” might be in order to generalize orders of magnitude past what humans are capable of.