How useful do people think it would be to have human-in-the-loop (HITL) AI systems?
What’s the likeliest failure mode of HITL AI Systems? And what kind of policies are mostly likely to help against those failure modes?
If we assume that HITL AI Systems are not going to be safer once we reach unaligned AGI. Could it maybe give use tools/information/time-steps to increase the chance that AGI will be aligned?
Do you mean as a checkpoint/breaker? Or more in the sense of RLHF?
The problem with these is that the limiting factor is human attention. You can’t have your best and brightest focusing full time on what the AI is outputting, as quality attention is a scarce resource. So you do some combination of the following:
slow down the AI to a level that is manageable (which will greatly limit its usefulness),
discard ideas that are too strange (which will greatly limit its usefulness)
limit its intelligence (which will greatly limit its usefulness)
don’t check so thoroughly (which will make it more dangerous)
use less clever checkers (which will greatly limit its usefulness and/or make it more dangerous)
check it reeeaaaaaaallllly carefully during testing and then hope for the best (which is reeeeaaaallly dangerous)
You also need to be sure that it can’t outsmart the humans in the loop, which pretty much comes back to boxing it in.
How useful do people think it would be to have human-in-the-loop (HITL) AI systems?
What’s the likeliest failure mode of HITL AI Systems? And what kind of policies are mostly likely to help against those failure modes?
If we assume that HITL AI Systems are not going to be safer once we reach unaligned AGI. Could it maybe give use tools/information/time-steps to increase the chance that AGI will be aligned?
Do you mean as a checkpoint/breaker? Or more in the sense of RLHF?
The problem with these is that the limiting factor is human attention. You can’t have your best and brightest focusing full time on what the AI is outputting, as quality attention is a scarce resource. So you do some combination of the following:
slow down the AI to a level that is manageable (which will greatly limit its usefulness),
discard ideas that are too strange (which will greatly limit its usefulness)
limit its intelligence (which will greatly limit its usefulness)
don’t check so thoroughly (which will make it more dangerous)
use less clever checkers (which will greatly limit its usefulness and/or make it more dangerous)
check it reeeaaaaaaallllly carefully during testing and then hope for the best (which is reeeeaaaallly dangerous)
You also need to be sure that it can’t outsmart the humans in the loop, which pretty much comes back to boxing it in.