I would have assumed that this Python would be impossible to get right
I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
Asking an LLM to judge, on the other hand...
I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.
(partly copying from other comment)
I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.