It’s the same thing for piecewise-linear functions defined by multi-layer parameterized graphical function approximators: the model is the dataset. It’s just not meaningful to talk about what a loss function implies, independently of the training data. (Mean squared error of what? Negative log likelihood of what? Finish the sentence!)
This confusion about loss functions...
I don’t think this is a confusion, but rather a mere difference in terminology. Eliezer’s notion of “loss function” is equivalent to Zack’s notion of “loss function” curried with the training data. Thus, when Eliezer writes about the network modelling or not modelling the loss function, this would include modelling the process that generated the training data.
It is fair to say that the loss function (when combined with the data) is a stochastic environment (stochastic due to sampling the data), and the effect of gradient descent is to select a policy (a function out of the function space) which performs very well in this stochastic environment (achieves low average loss).
If we assume the function-approximation achieves the minimum possible loss, then it must be the case that the function chosen is an optimal control policy where the loss function (understood as including the data) is the utility function which the policy is optimal with respect to.
In this framing, both Zack and Eliezer would be wrong:
Zack would be wrong because there is nothing nonsensical about asking whether the function-approximation “internalizes” the loss. Utility functions are usually understood behaviorally; a linear regression might not “represent” (ie denote) squared-error anywhere, but might still be utility-theoretically optimal with respect to mean-squared error, which is enough for “representation theorems” (the decision-theory thingy) to apply.
Eliezer would be wrong because his statement that there is no guarantee about representing the loss function would be factually incorrect. At best Eliezer’s point could be interpreted as saying that the representation theorems break down when loss is merely very low rather than perfectly minimal.
But Eliezer (at least in the quote Zack selects) is clearly saying “explicit internal representation” rather than the decision-theoretic “representation theorem” thingy. I think this is because Eliezer is thinking about inner optimization, as Zack also says. When we are trying to apply function-approximation (“deep learning”) to solve difficult problems for us—in particular, difficult problems never seen in the data-set used for training—it makes some sense to suppose that the internal representation will involve nontrivial computations, even “search algorithms” (and importantly, we know of no way to rule this out without crippling the generalization ability of the function-approximation).
So based on this, we could refine the interpretation of Eliezer’s point to be: even if we achieve the minimum loss on the data-set given (and therefore obey decisiot-theretic representation-theorems in the stochastic environment created by the loss function combined with the data), there is no particular guarantee that the search procedure learned by the function-approximation is explicitly searching to minimize said loss.
This is significant because of generalization. We actually want to run the approximated-function on new data, with hopes that it does “something appropriate”. (This is what Eliezer means when he says “distribution-shifted environments” in the quote.) This important point is not captured in your proposed reconciliation of Zack and Eliezer’s views.
But then why emphasize (as Eliezer does) that the function approximation does not necessarily internalize the loss function it is trained on? Internalizing said loss function would probably prevent it from doing anything truly catastrophic (because it is not planning for a world any different than the actual training data it has seen). But it does not especially guarantee that it does what we would want it to do. (Because the-loss-function-on-the-given-data is not what we really want; really we want some appropriate generalization to happen!)
I think this is a rhetorical simplification, which is fair game for Zack to try and correct to something more accurate. Whether Eliezer truly had the misunderstanding when writing, I am not sure. But I agree that the statement is, at least, uncareful.
Deep reinforcement learning agents will not come to intrinsically and primarily value their reward signal; reward is not the trained agent’s optimization target.
Utility functions express the relative goodness of outcomes. Reward is not best understood as being a kind of utility function. Reward has the mechanistic effect of chiseling cognition into the agent’s network. Therefore, properly understood, reward does not express relative goodness and is therefore not an optimization target at all.
(Granted, TurnTrout is talking about reward signals rather than loss functions, and this is an important distinction; however, my understanding is that he would say something very similar about loss functions.)
Point #1 appears to strongly agree with at least a major part of Eliezer’s point. To re-quote the List of Lethalities portion Zack quotes in the OP:
Even if you train really hard on an exact loss function, that doesn’t thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don’t explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn’t produce inner optimization in that direction. [...] This is sufficient on its own [...] to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
However, I think point #2 is similar in spirit to Zack’s objection in the OP. (TurnTrout does not respond to the same exact passage, but has his own post taking issues with List of Lethalities.)
I will call the objection I see in common between Zack and TurnTrout the type error objection. Zack says that of course a line does not “represent” the loss function of a linear regression; why would you even want it to? TurnTrout says that “reward is not the optimization target”—we should think of a reward function as a “chisel” which shapes a policy, rather than thinking of it as the goal we are trying to instill in the policy. In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training. (The “wheels made of little cars” thing also resembles a type-error objection.)
While I strongly agree that we should not naively assume a reinforcement-learning agent internalizes the reward as its utility function, I think the type-error objection is over-stated, as may be clear from my point about decision-theoretic representation theorems at the beginning.
Reward functions do have the wrong type signature, but neural networks are not actually trained on reward gradients; rather, a loss is defined from the reward in some way. The type signature of the loss function is not wrong; indeed, if training were perfect, then we could conclude that the resulting neural networks would be decision-theoretically perfect at minimizing loss on the training distribution.
What we would not be able to make confident predictions about is what such systems would do outside of the training distribution, where the training procedure has not exercised selection pressure on the behavior of the system. Here, we must instead rely on the generalization power of function-approximation, which (seen through a somewhat bayesian lens) means trusting the system to have the inductive biases which we would want.
To be clear, I’m definitely pretty sympathetic to TurnTrout’s type error objection. (Namely: “If the agent gets a high reward for ingesting superdrug X, but did not ingest it during training, then we shouldn’t particularly expect the agent to want to ingest superdrug X during deployment, even if it realizes this would produce high reward.”) But just rereading what Zack has written, it seems quite different from what TurnTrout is saying and I still stand by my interpretation of it.
eg. Zack writes: “obviously the line itself does not somehow contain a representation of general squared-error-minimization”. So in this line fitting example, the loss function, i.e. “general squared-error-minimization” refers to the function L(training data,fθ), and not L(fθ).
And when he asks why one would even want the neural network to represent the loss function, there’s a pretty obvious answer of “well, the loss function contains many examples of outcomes humans rated as good and bad and we figure it’s probably better if the model understands the difference between good and bad outcomes for this application.” But this answer only applies to the curried loss.
I wasn’t trying to sign up to defend everything Eliezer said in that paragraph, especially not the exact phrasing, so can’t reply to the rest of your comment which is pretty insightful.
In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training.
I think Turntrout would object to that charecterization as it is privileging the hypothesis that you get systems which pursue goals after training. I’m assuming you mean the agent does some sort of EV maximization by “goals an intelligent systems pursues”. Though I have a faint suspicion Turntrout would disagree even with a more general interpretation of “pursues goals”.
I don’t think this is a confusion, but rather a mere difference in terminology. Eliezer’s notion of “loss function” is equivalent to Zack’s notion of “loss function” curried with the training data. Thus, when Eliezer writes about the network modelling or not modelling the loss function, this would include modelling the process that generated the training data.
The issue seems more complex and subtle to me.
It is fair to say that the loss function (when combined with the data) is a stochastic environment (stochastic due to sampling the data), and the effect of gradient descent is to select a policy (a function out of the function space) which performs very well in this stochastic environment (achieves low average loss).
If we assume the function-approximation achieves the minimum possible loss, then it must be the case that the function chosen is an optimal control policy where the loss function (understood as including the data) is the utility function which the policy is optimal with respect to.
In this framing, both Zack and Eliezer would be wrong:
Zack would be wrong because there is nothing nonsensical about asking whether the function-approximation “internalizes” the loss. Utility functions are usually understood behaviorally; a linear regression might not “represent” (ie denote) squared-error anywhere, but might still be utility-theoretically optimal with respect to mean-squared error, which is enough for “representation theorems” (the decision-theory thingy) to apply.
Eliezer would be wrong because his statement that there is no guarantee about representing the loss function would be factually incorrect. At best Eliezer’s point could be interpreted as saying that the representation theorems break down when loss is merely very low rather than perfectly minimal.
But Eliezer (at least in the quote Zack selects) is clearly saying “explicit internal representation” rather than the decision-theoretic “representation theorem” thingy. I think this is because Eliezer is thinking about inner optimization, as Zack also says. When we are trying to apply function-approximation (“deep learning”) to solve difficult problems for us—in particular, difficult problems never seen in the data-set used for training—it makes some sense to suppose that the internal representation will involve nontrivial computations, even “search algorithms” (and importantly, we know of no way to rule this out without crippling the generalization ability of the function-approximation).
So based on this, we could refine the interpretation of Eliezer’s point to be: even if we achieve the minimum loss on the data-set given (and therefore obey decisiot-theretic representation-theorems in the stochastic environment created by the loss function combined with the data), there is no particular guarantee that the search procedure learned by the function-approximation is explicitly searching to minimize said loss.
This is significant because of generalization. We actually want to run the approximated-function on new data, with hopes that it does “something appropriate”. (This is what Eliezer means when he says “distribution-shifted environments” in the quote.) This important point is not captured in your proposed reconciliation of Zack and Eliezer’s views.
But then why emphasize (as Eliezer does) that the function approximation does not necessarily internalize the loss function it is trained on? Internalizing said loss function would probably prevent it from doing anything truly catastrophic (because it is not planning for a world any different than the actual training data it has seen). But it does not especially guarantee that it does what we would want it to do. (Because the-loss-function-on-the-given-data is not what we really want; really we want some appropriate generalization to happen!)
I think this is a rhetorical simplification, which is fair game for Zack to try and correct to something more accurate. Whether Eliezer truly had the misunderstanding when writing, I am not sure. But I agree that the statement is, at least, uncareful.
Has Zack succeeded in correcting the issue by providing a more accurate picture? Arguably TurnTrout made the same objection in more detail. He summarizes the whole thing into two points:
(Granted, TurnTrout is talking about reward signals rather than loss functions, and this is an important distinction; however, my understanding is that he would say something very similar about loss functions.)
Point #1 appears to strongly agree with at least a major part of Eliezer’s point. To re-quote the List of Lethalities portion Zack quotes in the OP:
However, I think point #2 is similar in spirit to Zack’s objection in the OP. (TurnTrout does not respond to the same exact passage, but has his own post taking issues with List of Lethalities.)
I will call the objection I see in common between Zack and TurnTrout the type error objection. Zack says that of course a line does not “represent” the loss function of a linear regression; why would you even want it to? TurnTrout says that “reward is not the optimization target”—we should think of a reward function as a “chisel” which shapes a policy, rather than thinking of it as the goal we are trying to instill in the policy. In both cases, I understand them as saying that the loss function used for training is an entirely different sort of thing from the goals an intelligent system pursues after training. (The “wheels made of little cars” thing also resembles a type-error objection.)
While I strongly agree that we should not naively assume a reinforcement-learning agent internalizes the reward as its utility function, I think the type-error objection is over-stated, as may be clear from my point about decision-theoretic representation theorems at the beginning.
Reward functions do have the wrong type signature, but neural networks are not actually trained on reward gradients; rather, a loss is defined from the reward in some way. The type signature of the loss function is not wrong; indeed, if training were perfect, then we could conclude that the resulting neural networks would be decision-theoretically perfect at minimizing loss on the training distribution.
What we would not be able to make confident predictions about is what such systems would do outside of the training distribution, where the training procedure has not exercised selection pressure on the behavior of the system. Here, we must instead rely on the generalization power of function-approximation, which (seen through a somewhat bayesian lens) means trusting the system to have the inductive biases which we would want.
To be clear, I’m definitely pretty sympathetic to TurnTrout’s type error objection. (Namely: “If the agent gets a high reward for ingesting superdrug X, but did not ingest it during training, then we shouldn’t particularly expect the agent to want to ingest superdrug X during deployment, even if it realizes this would produce high reward.”) But just rereading what Zack has written, it seems quite different from what TurnTrout is saying and I still stand by my interpretation of it.
eg. Zack writes: “obviously the line itself does not somehow contain a representation of general squared-error-minimization”. So in this line fitting example, the loss function, i.e. “general squared-error-minimization” refers to the function L(training data,fθ), and not L(fθ).
And when he asks why one would even want the neural network to represent the loss function, there’s a pretty obvious answer of “well, the loss function contains many examples of outcomes humans rated as good and bad and we figure it’s probably better if the model understands the difference between good and bad outcomes for this application.” But this answer only applies to the curried loss.
I wasn’t trying to sign up to defend everything Eliezer said in that paragraph, especially not the exact phrasing, so can’t reply to the rest of your comment which is pretty insightful.
I think Turntrout would object to that charecterization as it is privileging the hypothesis that you get systems which pursue goals after training. I’m assuming you mean the agent does some sort of EV maximization by “goals an intelligent systems pursues”. Though I have a faint suspicion Turntrout would disagree even with a more general interpretation of “pursues goals”.