20. (...) To faithfully learn a function from ‘human feedback’ is to learn (from our external standpoint) an unfaithful description of human preferences, with errors that are not random (from the outside standpoint of what we’d hoped to transfer). If you perfectly learn and perfectly maximize the referent of rewards assigned by human operators, that kills them.
So, I’m thinking this is a critique of some proposals to teach an AI ethics by having it be co-trained with humans.
There seems to be many obvious solutions to the problem of there being lots of people who won’t answer correctly to “Point out any squares of people behaving badly” or “Point out any squares of people acting against their self-interest” etc:
- make the AIs model expect more random errors—
after having noticed some responders as giving better answers, give their answers more weight
- limit the number of people that will co-train the AI
What’s the problem with these ideas?
Is there anyone who has created an ethical development framework for developing a GAI—from the AI’s perspective?
That is, are there any developers that are trying to establish principles for not creating someone like Marvin from The Hitchhiker’s Guide to the Galaxy—similar to how MIRI is trying to establish principles for not creating a non-aligned AI?
EDIT: The latter problem is definitely more pressing at the moment, and I would guess that an AI would be a threat to humans before it necessitates any ethical considerations...but better to be on the safe side.