Getting very clear on what we want. Can we give a fairly technical specification of the kind of safety that’s necessary+possible?
Some degree of safety beyond tool-type non-malignancy. A proposal which I keep thinking about is my consent-based helpfulness. The idea is that, in addition to believing that you want something (with sufficient confidence), the system should also believe that you understand the implications of that thing (with some kind of sufficient detail). In the fusion example, the system would engage the user in conversation until it was clear that the consequences for society were understood and approved of.
Note that the fusion power example could be answered directly with a value-alignment type approach, where you have an agent rather than a tool—the agent infers your values, and infers that you would not really want backyard fusion power if it put the world at risk. That’s the moral that I imagine people more into value learning would give to your story. But I’m reaching further afield for solutions, because:
Value learning systems could Goodhart on the approximate values learned
Value learning systems are not corrigible if they become overly confident (which could happen at test time due to unforeseen flaws in the system’s reasoning—hence the desire for corrigibility)
Value learning systems could manipulate the human.
I wholeheartedly agree. I think this implies:
Getting very clear on what we want. Can we give a fairly technical specification of the kind of safety that’s necessary+possible?
Some degree of safety beyond tool-type non-malignancy. A proposal which I keep thinking about is my consent-based helpfulness. The idea is that, in addition to believing that you want something (with sufficient confidence), the system should also believe that you understand the implications of that thing (with some kind of sufficient detail). In the fusion example, the system would engage the user in conversation until it was clear that the consequences for society were understood and approved of.
Note that the fusion power example could be answered directly with a value-alignment type approach, where you have an agent rather than a tool—the agent infers your values, and infers that you would not really want backyard fusion power if it put the world at risk. That’s the moral that I imagine people more into value learning would give to your story. But I’m reaching further afield for solutions, because:
Value learning systems could Goodhart on the approximate values learned
Value learning systems are not corrigible if they become overly confident (which could happen at test time due to unforeseen flaws in the system’s reasoning—hence the desire for corrigibility)
Value learning systems could manipulate the human.