It seems like the frame of some of the critique is that humans are the authority on human values and want to ensure that the AI doesn’t escape that authority in illegible ways. To me it seems like the frame is more like we know that the sensors we have are only goodhartedly entangled with the things we care about and would ourselves prefer the less goodharted hypothetical sensors if we knew how to construct them. And that we’d want the AI to be inhabiting the same frame as us since, to take a page from Mad Investor Chaos, we don’t know how lies will propagate through an alien architecture.
I don’t know how ‘find less goodharted sensors’ is instantiated on natural hardware or might have toy versions implemented algorithmically, seems like it would be worth trying to figure out. In a conversation, John mentioned a type of architecture that is forced through an information bottleneck to find a minimal representation of the space. Seemed like a similar direction.
I would describe the overall question as “Is there a situation where an AI trained using this approach deliberately murders us?” and for ELK more specifically as “Is there a situation where an AI trained using this approach gives an unambiguously wrong answer to a straightforward question despite knowing better?” I generally don’t think that much about the complexity of human values.
It seems like the frame of some of the critique is that humans are the authority on human values and want to ensure that the AI doesn’t escape that authority in illegible ways. To me it seems like the frame is more like we know that the sensors we have are only goodhartedly entangled with the things we care about and would ourselves prefer the less goodharted hypothetical sensors if we knew how to construct them. And that we’d want the AI to be inhabiting the same frame as us since, to take a page from Mad Investor Chaos, we don’t know how lies will propagate through an alien architecture.
I don’t know how ‘find less goodharted sensors’ is instantiated on natural hardware or might have toy versions implemented algorithmically, seems like it would be worth trying to figure out. In a conversation, John mentioned a type of architecture that is forced through an information bottleneck to find a minimal representation of the space. Seemed like a similar direction.
I would describe the overall question as “Is there a situation where an AI trained using this approach deliberately murders us?” and for ELK more specifically as “Is there a situation where an AI trained using this approach gives an unambiguously wrong answer to a straightforward question despite knowing better?” I generally don’t think that much about the complexity of human values.