I see many problems, but here’s the most central one: If we have a 100-dimensional parametrized space of possible reward functions for the primary RL system, and every single one of those possible reward functions leads to bad and dangerous AI behavior (as I argued in the previous subsection), then … how does this help? It’s a 100-dimensional snake pit! I don’t care if there’s a flexible and sophisticated system for dynamically choosing reward functions within that snake pit! It can be the most sophisticated system in the world! We’re still screwed, because every option is bad!
Basically, I think we need more theoretical progress to find a parametrized space of possible reward functions, where at least some of the reward functions in the space lead to good AGIs that we should want to have around.
I agree that the ideal reward function may have adjustable parameters whose ideal settings are very difficult to predict without trial-and-error. For example, humans vary in how strong their different innate drives are, and pretty much all of those “parameter settings” lead to people getting really messed up psychologically if they’re on one extreme or the opposite extreme. And I wouldn’t know where to start in guessing exactly, quantitatively, where the happy medium is, except via empirical data.
So it would be very good to think carefully about test or optimization protocols for that part. (And that’s itself a terrifyingly hard problem, because there will inevitably be distribution shifts between the test environment and the real world. E.g. An AI could feel compassionate towards other AIs but indifferent towards humans.) We need to think about that, and we need the theoretical progress.
There’s a failure mode I described in “The Era of Experience” has an unsolved technical alignment problem:
Basically, I think we need more theoretical progress to find a parametrized space of possible reward functions, where at least some of the reward functions in the space lead to good AGIs that we should want to have around.
I agree that the ideal reward function may have adjustable parameters whose ideal settings are very difficult to predict without trial-and-error. For example, humans vary in how strong their different innate drives are, and pretty much all of those “parameter settings” lead to people getting really messed up psychologically if they’re on one extreme or the opposite extreme. And I wouldn’t know where to start in guessing exactly, quantitatively, where the happy medium is, except via empirical data.
So it would be very good to think carefully about test or optimization protocols for that part. (And that’s itself a terrifyingly hard problem, because there will inevitably be distribution shifts between the test environment and the real world. E.g. An AI could feel compassionate towards other AIs but indifferent towards humans.) We need to think about that, and we need the theoretical progress.