Viliam comments on Beginner’s question about RLHF

Viliam 9 Aug 2023 7:23 UTC
2 points
0
I don’t understand it either, but going by Wikipedia, it seems that the dangers are the following:
- the human feedback will be inconsistent, especially if we talk about things like ethics (different humans will give different answers on the trolley problem, what lesson is the AI supposed to learn here? probably something random, unrelated to what any of those humans thought);
- humans can make mistakes, like fail to notice something important in a complicated or misleading situation (in which case the lack of negative feedback would be interpreted as “this is ok”);
plus there is the practical problem of how much feedback would make you feel that the AI is now aligned sufficiently.