Will the problem of logical counterfactuals just solve itself with good model-building capabilities? Suppose an agent has knowledge of its own source code, and wants to ask the question “What happens if I take action X?” where their source code provably does not actually do X.
A naive agent might notice the contradiction and decide that “What happens if I take action X?” is a bad question, or a question where any answer is true, or a question where we have to condition on cosmic rays hitting transistors at just the right time. But we want a sophisticated agent to be able to be aware of the contradiction and yet go on to say “Ah, but what I meant wasn’t a question about the real world, but a question about some simplified model of the world that lumps all of the things that would normally be contradictory about this into one big node—I take action X, and also my source code outputs X, and also maybe even the programmers don’t immediately see X as a bug.”
Of course, the sophisticated agent doesn’t have to bother saying that if it already makes plans using simplified models of the world that lump things together etc. etc. It’s planning will thus implicitly deal with logical counterfactuals, and if it does verbal reasoning that taps into these same models, it can hold a conversation about logical counterfactuals. This seems pretty close to how humans do it.
Atheoretically building an agent that is good at making approximate models would therefore “accidentally” be a route to solving logical counterfactuals. But maybe we can do theory to this too: a theorem about logical counterfactuals is going to be a theorem about processes for building approximate models of the world—which it actually seems plausible to relate back to logical inductors and the notion of out-planning “simple” agents.
Will the problem of logical counterfactuals just solve itself with good model-building capabilities? Suppose an agent has knowledge of its own source code, and wants to ask the question “What happens if I take action X?” where their source code provably does not actually do X.
A naive agent might notice the contradiction and decide that “What happens if I take action X?” is a bad question, or a question where any answer is true, or a question where we have to condition on cosmic rays hitting transistors at just the right time. But we want a sophisticated agent to be able to be aware of the contradiction and yet go on to say “Ah, but what I meant wasn’t a question about the real world, but a question about some simplified model of the world that lumps all of the things that would normally be contradictory about this into one big node—I take action X, and also my source code outputs X, and also maybe even the programmers don’t immediately see X as a bug.”
Of course, the sophisticated agent doesn’t have to bother saying that if it already makes plans using simplified models of the world that lump things together etc. etc. It’s planning will thus implicitly deal with logical counterfactuals, and if it does verbal reasoning that taps into these same models, it can hold a conversation about logical counterfactuals. This seems pretty close to how humans do it.
Atheoretically building an agent that is good at making approximate models would therefore “accidentally” be a route to solving logical counterfactuals. But maybe we can do theory to this too: a theorem about logical counterfactuals is going to be a theorem about processes for building approximate models of the world—which it actually seems plausible to relate back to logical inductors and the notion of out-planning “simple” agents.