Seth Herd comments on And All the Shoggoths Merely Players

Seth Herd 12 Feb 2024 23:46 UTC
2 points
0
This is a great question; I’ve never seen a convincing answer or even a good start at figuring out how many errors in ASI alignment we can tolerate before we’re likely to die.

If each action it takes has independent errors, we’d need near 100% accuracy to expect to survive more than a little while. But if its beliefs are coherent through reflection, those errors aren’t independent. I don’t expect ASI to be merely a bigger network that takes an input and spits out an output, but a system that can and does reflect on its own goals and beliefs (because this isn’t hard to implement and introspection and reflection seem useful for human cognition). Having said that, this might actually be a crux of disagreement on alignment difficulty—I’d be more scared of an ASI that can’t reflect so that its errors are independent.

With reflection, a human wouldn’t just say “seems like I should kill everyone this time” and then do it. They’d wonder why this decision is so different from their usual decisions, and look for errors.

So the more relevant question, I think, is how many errors and how large can be tolerated in the formation of a set of coherent, reflectively stable goals. But that’s with my expectation of a reflective AGI with coherent goals and behaviors.