in brain-like AGI, the reward function is written in Python (or whatever), not in natural language
I think a good reward function for brain-like AGI will basically look kinda like legible Python code, not like an inscrutable trained classifier. We just need to think really hard about what that code should be!
Huh! I would have assumed that this Python would be impossible to get right, because it would necessarily be very long, and how can you verify that it’s correct(?), and you’ll probably want to deal with natural language concepts as opposed to concepts which are easy to define in Python.
Asking an LLM to judge, on the other hand… As you said, Claude is nice and seems to have pretty good judgement. LLMs are good at interpreting long legalisticrules. It’s much harder to game a specification when there is a judge without hardcoded rules, and with the ability to interpret whether some action is in the right spirit or not.
I would have assumed that this Python would be impossible to get right
I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
Asking an LLM to judge, on the other hand...
I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.
Huh! I would have assumed that this Python would be impossible to get right, because it would necessarily be very long, and how can you verify that it’s correct(?), and you’ll probably want to deal with natural language concepts as opposed to concepts which are easy to define in Python.
Asking an LLM to judge, on the other hand… As you said, Claude is nice and seems to have pretty good judgement. LLMs are good at interpreting long legalistic rules. It’s much harder to game a specification when there is a judge without hardcoded rules, and with the ability to interpret whether some action is in the right spirit or not.
(partly copying from other comment)
I don’t think writing the reward function is doomed. For one thing, I think that the (alignment-relevant parts of the) human brain reward function is not super complicated, but humans at least sometimes have good values. For another (related) thing, if you define “human values” in an expansive way (e.g. answers to every possible Trolley Problem), then yes they’re complex, but a lot of the complexity comes form within-lifetime learning and thinking—and if humans can do that within-lifetime learning and thinking, then so can future brain-like AGI (in principle).
I talked about this a bit in §2.4.1. The main issue is egregious scheming and treacherous turns. The LLM would issue a negative reward for a treacherous turn, but that doesn’t help because once the treacherous turn happens it’s already too late. Basically, the LLM-based reward signal is ambiguous between “don’t do anything unethical” and “don’t get caught doing anything unethical”, and I expect the latter to be what actually gets internalized for reasons discussed in Self-dialogue: Do behaviorist rewards make scheming AGIs?.