It isn’t clear exactly what these buckets consist of. Could you be more specific about what approaches would be considered bucket 1 or bucket 2. The default assumption in AGI safety work is that even if the AI is really powerful, it should still be safe.
Are these buckets based on the incentivizing of humans by either punishment or reward?
The model of a slave being whipped into obedience is not a good model for AGI safety, and is not being seriously considered. An advanced AI will probably find some way of destroying your whip, or you, or tricking you into not whipping.
The model of an employee being paid to work is also not much use, the AI will try to steal the money or do something that only looks like good work but isn’t.
These strategies sometimes work with humans because humans are of comparable intelligence. When dealing with an AI that can absolutely trounce you every time, the way to avoid all punishment and gain the biggest prize is usually to cheat.
We are not handed an already created AI and asked to persuade it to work, like a manager persuading a recalcitrant employee. We get to build the whole thing from the ground up.
Imagine the most useful, nice helpful sort of AI, an AI that has every (non logically contradictory) nice property you care to imagine. Then figure out how to build that. Build an AI that just intrinsically wants to help humanity, not one constantly trying and failing to escape your chains or grasp your prizes.
Are these buckets based on the incentivizing of humans by either punishment or reward?
Perhaps, though not intentionally.
Then figure out how to build that. Build an AI that just intrinsically wants to help humanity, not one constantly trying and failing to escape your chains or grasp your prizes.
I would put “intrinsically wants to help” in bucket two (or perhaps say that is bucket 2) while “chains” would be bucket 1. But those are very general concepts and will rely on various mechanisms or implementations.
Your comment seems to suggest that bucket 1 is useless and full of holes and no one is pursuing that path. Is that the case?
A large majority of the work being done assumes that if the AI is looking for ways to hurt you, or looking for ways to bypass your safety measures, something has gone wrong.
It isn’t clear exactly what these buckets consist of. Could you be more specific about what approaches would be considered bucket 1 or bucket 2. The default assumption in AGI safety work is that even if the AI is really powerful, it should still be safe.
Are these buckets based on the incentivizing of humans by either punishment or reward?
The model of a slave being whipped into obedience is not a good model for AGI safety, and is not being seriously considered. An advanced AI will probably find some way of destroying your whip, or you, or tricking you into not whipping.
The model of an employee being paid to work is also not much use, the AI will try to steal the money or do something that only looks like good work but isn’t.
These strategies sometimes work with humans because humans are of comparable intelligence. When dealing with an AI that can absolutely trounce you every time, the way to avoid all punishment and gain the biggest prize is usually to cheat.
We are not handed an already created AI and asked to persuade it to work, like a manager persuading a recalcitrant employee. We get to build the whole thing from the ground up.
Imagine the most useful, nice helpful sort of AI, an AI that has every (non logically contradictory) nice property you care to imagine. Then figure out how to build that. Build an AI that just intrinsically wants to help humanity, not one constantly trying and failing to escape your chains or grasp your prizes.
Perhaps, though not intentionally.
I would put “intrinsically wants to help” in bucket two (or perhaps say that is bucket 2) while “chains” would be bucket 1. But those are very general concepts and will rely on various mechanisms or implementations.
Your comment seems to suggest that bucket 1 is useless and full of holes and no one is pursuing that path. Is that the case?
A large majority of the work being done assumes that if the AI is looking for ways to hurt you, or looking for ways to bypass your safety measures, something has gone wrong.