Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven’t yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.
You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano’s ideas.
Now, this isn’t safe, but it’s at least something you might be able to play with.
Thanks Anna Salamon for the idea of making an AI which cares about what happens in a counterfactual ideal world, rather than the real world world with the transistors in it, as a corrigibility strategy. I haven’t yet been able to find a way to make that idea work for an agent/utility maximizer, but it inspired the idea of doing the same thing in an oracle.
You could have an agent that cares about what an idealised counterfactual human would think about its decisions (if the idealised human had a huge amount of time to think them over). Compare with Paul Christiano’s ideas.
Now, this isn’t safe, but it’s at least something you might be able to play with.