We didn’t get as far as criticising the method. Paul’s question is far more fundamental—was my design even necessary?
Forwards or backwards looking
The counterfactual oracle design was setup to reward the oracle for correctly guessing the ultimate value of a random variable (an outcome). This guess was conditional on humans not actually getting to see the guess.
If we ignore the counterfactual aspect for the moment, assume that the oracle has seen a series of past data: pairs (xi,yi). The xi are the background data the Oracle uses to estimate the values of the target variables yi.
Then my design is a forward-looking oracle. If we use a quadratic loss function, at turn t, after seeing xt, it seeks to output zt, defined as:
argminztE[||zt−yt||2∣∣x1:ty1:t−1],
where x1:ty1:t−1 is the previous data (all xi including xt, all previous yi up to yt−1).
Paul instead recommended using a more traditional machine-learning approach, where you attempt to fit a function f that explains the past data, getting something like this:
argminf∑t−1i=0||f(xi)−yi||2/t+reg,
where the reg are regularising terms that prevent overfitting. The oracle will then output zt=f(xt). Call such oracle a backwards-facing oracle (BFO), because it only cares about the past.
Then the key facts about the BFO are:
The BFO can generally cope with humans observing zt=f(xt) and modifying our behaviour because of it (ie does not need a counterfactual approach).
The BFO is mildly manipulative in its outputs, though not generally in the way we fear an AI being.
The BFO will not generally be manipulative in a way that minimises its cost function.
However, the BFO may come to believe spurious correlations.
Because of this, the BFO may converge more slowly—or not at all—where another oracle design could easily deduce the correct formulas.
Observer perturbation effects
What happens if we observe the zt - and if this observation changes our behaviour so that yt itself is modified?
Assume the oracle knows our plans and its actions
First of all, note that unless we set up the system very carefully, the oracle will “know” whether we are planning to observe zt or not. The counterfactual oracle was deliberately designed that observing zt is dependent on a purely stochastic process that is under the control of neither the human nor the oracle.
But in general, the Oracle will be able to deduce whether we expect to see the zt; it can glean this information from the xt or from observing patterns in the past xi and yi. For the rest of this post, I’ll focus on the situation where we do indeed observe the zt; if we don’t, then the BFO and the counterfactual oracle have similar behaviours to each other.
I’ll similarly assume that the BFO knows its own past outputs—the zi. It is possible for us to carefully setup the system so that it doesn’t track these, but I’ll assume that it has access to this information—either implicitly, through whatever process updates it’s estimate of the function f, or explicitly, by deducing zi from xi and yi.
Example
Assume for the moment that the xi are null and irrelevant, and that each yi is computed independently from all the others, but with identical distribution.
Ignoring the regularising terms in the cost function above, the best f for the BFO is simply a constant, equal to the mean of the previous yi: f(xt)=∑t−1i=0yi/t.
If we observe zi (which, in this example, we assume we always do), then the value of yi is given by μ(zi)+v(zi), where μ is a function R→R and v(zi) is a random variable with mean zero for all zi.
If μ is continuous (and the possible value of yi are in a bounded interval), there will be fixed points where zi=μ(zi). We’ll expect the behaviour of the BFO to eventually converge to one of these fixed points. Note that, in the limit, if the BFO converges to the fixed point zi, then the cost function is equal to the expectation of (v(zi))2.
For example, assume μ(zi)=z2i and v≡0 - the relationship between observed zi and actual yi is deterministic. This has two fixed points: one at 0 and one at 1.
The point at 0 is an attractive fixed point, but the point at 1 is unstable. So unless all the data the BFO starts with has zi=1, the BFO will either fail to converge, or converge to zi=0.
We can add a non-trivial v and this result still holds (indeed, a non-trivial v(1) just makes the zi=1 point more unstable). What this means is that, even if the zi=1 results in a lower cost function—ie E[(v(1))2]<E[(v(0))2] - the BFO will still not converge to it, converging to the higher cost point zi=0.
Multiple attractive fixed points
Let μ(zi)=−z3i+(3/2)zi. This has three fixed points, at zi=0 and zi=±1/√2. These last two points are attractive fixed points.
If v≡0, then the BFO will converge to either 1/√2 or −1/√2, depending on the initial data it starts with. If we add a non-trivial v, we get a slight tendency to converge towards the fixed point with lower E[(v(zi))2], everything else being equal. That’s because a large E[(v(zi))2] means that the next point sampled is likely to fall outside the “basin of attraction” of zi.
But everything else is not generally equal. The initial data strongly skews the process. The value of the derivative of μ around the fixed points (if it exists) affects the convergence process. The value of E[(v(zi))2] close to the fixed point (rather than at the fixed point) can have a large impact. Luck will play a role as well, if v is non-trivial.
Thus, though there is a tendency to converge to fixed points with lower cost functions, this tendency is very weak. The BFO is not a cost function minimiser.
Manipulation?
It is technically a manipulative act to output a zt such that we will change our value of yt in consequence.
But it’s not a manipulative act in most of the ways we fear. For example, if zi was the encoding for a program which, if run, would set yi=zi, along with a convincing message for running that program, then zi is technically a fixed point of μ. However, it is not a fixed point that the BFO is likely to find, because zi+ϵ would not be such an encoding for almost all ϵ, so the basin of attraction for this zi is tiny (basically only those ϵ sufficiently small to preserve all the digits of zi). Thus the BFO is very unlikely to stumble upon it.
So this design does succeed in overcoming many of the problems with oracles: it will not try to manipulate us to obsessively converge on the point with lowest cost function.
Problems with causality
So, what’s wrong with the BFO? I’ve written this much, pointing out its properties without endorsing it, so there must be something wrong.
Actually, I think it works very well for solving some problems, and, for those, is superior to my oracle designs. But it has certain limitations.
First of all, there’s the question of what happens if μ has no fixed points (for example, if μ is not continuous in various annoying ways) or no attractive fixed points (for example, if μ(zi)=4zi(1−zi)). Then the BFO may not settle down in any stable way, as it continues to try and find the fixed point of a function without one.
But this problem is just a side effect of the more central point, which is that the BFO has a fundamentally wrong “understanding” of the causality of the situation (this is, of course, an informal phrasing—the BFO has no true “understanding” of the situation, in the ways we mean).
Spurious beliefs
Let us go back to the μ(zi)=−z3i+(3/2)zi and v≡0 situation. Assume the xi is not empty, but is not relevant—for example, it could just be the time of day. Suppose the BFO has made three pairs of observations:
If the BFO treated the xi as irrelevant (which they indeed are), its z3 guess would be 1/(3∗√2), and it would then eventually converge towards the fixed point zi=1/√2 .
But the BFO could also conclude that the parity of the time is relevant, and that it should output 1/√2 during even hours, and −1/√2 during odd hours.
If it does so, it will find confirmation for its theory. As it outputs 1/√2, during even hours, it will observe that that guess was entirely correct, and similarly for odd hours.
In situations where μ has multiple attractive fixed points, and the BFO has a rich and varied amount of previous data for any reason, the BFO will find, and confirm, a spurious causality for explaning these different fixed points.
False models
The problem is that the BFO’s greatest strength has become its weakness: it is trying to explain, through function fitting, why the yt seems to have different behaviours. We know that it is because of the zt, the output of the BFO itself; but the BFO cannot take this fact into account. Its function fitting is trying to implicitly take into account the output of its own calculations without explicitly doing so; no wonder that it ends up in Löbian situations.
This might result in the BFO being unable to converge to anything, in situations where the E[||zt−yt||2∣∣x1:ty1:t−1]-minimising oracle could.
For example, we could imagine an environment where the xi and the yi are very varied, but the underlying causal structure is quite simple. An oracle that tracked zi as an input to the environment could deduce this causal structure quite easily.
But the BFO might struggle, as it tries to fit a function where it can’t explicitly take a key fact into account. Because the xi and yi are so varied, it should stumble upon many different spurious beliefs, making the job of fitting a single function to all the data extremely difficult. It would be interesting to explore the extent to which this might become a problem.
In conclusion
The BFO/traditional machine learning approach held up better than I supposed, and I’m thankful to Paul for bringing it to my attention. It has interesting advantages and drawbacks, and could be a useful oracle design in many situations.
Standard ML Oracles vs Counterfactual ones
EDIT: This post has been superceeded by all these four posts.
A few weeks ago, I had a conversation with Paul Christiano about my counterfactual Oracle design.
We didn’t get as far as criticising the method. Paul’s question is far more fundamental—was my design even necessary?
Forwards or backwards looking
The counterfactual oracle design was setup to reward the oracle for correctly guessing the ultimate value of a random variable (an outcome). This guess was conditional on humans not actually getting to see the guess.
If we ignore the counterfactual aspect for the moment, assume that the oracle has seen a series of past data: pairs (xi,yi). The xi are the background data the Oracle uses to estimate the values of the target variables yi.
Then my design is a forward-looking oracle. If we use a quadratic loss function, at turn t, after seeing xt, it seeks to output zt, defined as:
argminztE[||zt−yt||2∣∣x1:ty1:t−1],
where x1:ty1:t−1 is the previous data (all xi including xt, all previous yi up to yt−1).
Paul instead recommended using a more traditional machine-learning approach, where you attempt to fit a function f that explains the past data, getting something like this:
argminf∑t−1i=0||f(xi)−yi||2/t+reg,
where the reg are regularising terms that prevent overfitting. The oracle will then output zt=f(xt). Call such oracle a backwards-facing oracle (BFO), because it only cares about the past.
Then the key facts about the BFO are:
The BFO can generally cope with humans observing zt=f(xt) and modifying our behaviour because of it (ie does not need a counterfactual approach).
The BFO is mildly manipulative in its outputs, though not generally in the way we fear an AI being.
The BFO will not generally be manipulative in a way that minimises its cost function.
However, the BFO may come to believe spurious correlations.
Because of this, the BFO may converge more slowly—or not at all—where another oracle design could easily deduce the correct formulas.
Observer perturbation effects
What happens if we observe the zt - and if this observation changes our behaviour so that yt itself is modified?
Assume the oracle knows our plans and its actions
First of all, note that unless we set up the system very carefully, the oracle will “know” whether we are planning to observe zt or not. The counterfactual oracle was deliberately designed that observing zt is dependent on a purely stochastic process that is under the control of neither the human nor the oracle.
But in general, the Oracle will be able to deduce whether we expect to see the zt; it can glean this information from the xt or from observing patterns in the past xi and yi. For the rest of this post, I’ll focus on the situation where we do indeed observe the zt; if we don’t, then the BFO and the counterfactual oracle have similar behaviours to each other.
I’ll similarly assume that the BFO knows its own past outputs—the zi. It is possible for us to carefully setup the system so that it doesn’t track these, but I’ll assume that it has access to this information—either implicitly, through whatever process updates it’s estimate of the function f, or explicitly, by deducing zi from xi and yi.
Example
Assume for the moment that the xi are null and irrelevant, and that each yi is computed independently from all the others, but with identical distribution.
Ignoring the regularising terms in the cost function above, the best f for the BFO is simply a constant, equal to the mean of the previous yi: f(xt)=∑t−1i=0yi/t.
If we observe zi (which, in this example, we assume we always do), then the value of yi is given by μ(zi)+v(zi), where μ is a function R→R and v(zi) is a random variable with mean zero for all zi. If μ is continuous (and the possible value of yi are in a bounded interval), there will be fixed points where zi=μ(zi). We’ll expect the behaviour of the BFO to eventually converge to one of these fixed points. Note that, in the limit, if the BFO converges to the fixed point zi, then the cost function is equal to the expectation of (v(zi))2.
For example, assume μ(zi)=z2i and v≡0 - the relationship between observed zi and actual yi is deterministic. This has two fixed points: one at 0 and one at 1.
The point at 0 is an attractive fixed point, but the point at 1 is unstable. So unless all the data the BFO starts with has zi=1, the BFO will either fail to converge, or converge to zi=0.
We can add a non-trivial v and this result still holds (indeed, a non-trivial v(1) just makes the zi=1 point more unstable). What this means is that, even if the zi=1 results in a lower cost function—ie E[(v(1))2]<E[(v(0))2] - the BFO will still not converge to it, converging to the higher cost point zi=0.
Multiple attractive fixed points
Let μ(zi)=−z3i+(3/2)zi. This has three fixed points, at zi=0 and zi=±1/√2. These last two points are attractive fixed points.
If v≡0, then the BFO will converge to either 1/√2 or −1/√2, depending on the initial data it starts with. If we add a non-trivial v, we get a slight tendency to converge towards the fixed point with lower E[(v(zi))2], everything else being equal. That’s because a large E[(v(zi))2] means that the next point sampled is likely to fall outside the “basin of attraction” of zi.
But everything else is not generally equal. The initial data strongly skews the process. The value of the derivative of μ around the fixed points (if it exists) affects the convergence process. The value of E[(v(zi))2] close to the fixed point (rather than at the fixed point) can have a large impact. Luck will play a role as well, if v is non-trivial.
Thus, though there is a tendency to converge to fixed points with lower cost functions, this tendency is very weak. The BFO is not a cost function minimiser.
Manipulation?
It is technically a manipulative act to output a zt such that we will change our value of yt in consequence.
But it’s not a manipulative act in most of the ways we fear. For example, if zi was the encoding for a program which, if run, would set yi=zi, along with a convincing message for running that program, then zi is technically a fixed point of μ. However, it is not a fixed point that the BFO is likely to find, because zi+ϵ would not be such an encoding for almost all ϵ, so the basin of attraction for this zi is tiny (basically only those ϵ sufficiently small to preserve all the digits of zi). Thus the BFO is very unlikely to stumble upon it.
So this design does succeed in overcoming many of the problems with oracles: it will not try to manipulate us to obsessively converge on the point with lowest cost function.
Problems with causality
So, what’s wrong with the BFO? I’ve written this much, pointing out its properties without endorsing it, so there must be something wrong.
Actually, I think it works very well for solving some problems, and, for those, is superior to my oracle designs. But it has certain limitations.
First of all, there’s the question of what happens if μ has no fixed points (for example, if μ is not continuous in various annoying ways) or no attractive fixed points (for example, if μ(zi)=4zi(1−zi)). Then the BFO may not settle down in any stable way, as it continues to try and find the fixed point of a function without one.
But this problem is just a side effect of the more central point, which is that the BFO has a fundamentally wrong “understanding” of the causality of the situation (this is, of course, an informal phrasing—the BFO has no true “understanding” of the situation, in the ways we mean).
Spurious beliefs
Let us go back to the μ(zi)=−z3i+(3/2)zi and v≡0 situation. Assume the xi is not empty, but is not relevant—for example, it could just be the time of day. Suppose the BFO has made three pairs of observations:
{(x0=00:30,y0=z0=1/√2), (x1=01:30,y1=z1=−1/√2), (x2=02:30,y2=z2=1/√2)}.
If the BFO treated the xi as irrelevant (which they indeed are), its z3 guess would be 1/(3∗√2), and it would then eventually converge towards the fixed point zi=1/√2 .
But the BFO could also conclude that the parity of the time is relevant, and that it should output 1/√2 during even hours, and −1/√2 during odd hours.
If it does so, it will find confirmation for its theory. As it outputs 1/√2, during even hours, it will observe that that guess was entirely correct, and similarly for odd hours.
In situations where μ has multiple attractive fixed points, and the BFO has a rich and varied amount of previous data for any reason, the BFO will find, and confirm, a spurious causality for explaning these different fixed points.
False models
The problem is that the BFO’s greatest strength has become its weakness: it is trying to explain, through function fitting, why the yt seems to have different behaviours. We know that it is because of the zt, the output of the BFO itself; but the BFO cannot take this fact into account. Its function fitting is trying to implicitly take into account the output of its own calculations without explicitly doing so; no wonder that it ends up in Löbian situations.
This might result in the BFO being unable to converge to anything, in situations where the E[||zt−yt||2∣∣x1:ty1:t−1]-minimising oracle could.
For example, we could imagine an environment where the xi and the yi are very varied, but the underlying causal structure is quite simple. An oracle that tracked zi as an input to the environment could deduce this causal structure quite easily.
But the BFO might struggle, as it tries to fit a function where it can’t explicitly take a key fact into account. Because the xi and yi are so varied, it should stumble upon many different spurious beliefs, making the job of fitting a single function to all the data extremely difficult. It would be interesting to explore the extent to which this might become a problem.
In conclusion
The BFO/traditional machine learning approach held up better than I supposed, and I’m thankful to Paul for bringing it to my attention. It has interesting advantages and drawbacks, and could be a useful oracle design in many situations.