To be clear, I’m definitely pretty sympathetic to TurnTrout’s type error objection. (Namely: “If the agent gets a high reward for ingesting superdrug X, but did not ingest it during training, then we shouldn’t particularly expect the agent to want to ingest superdrug X during deployment, even if it realizes this would produce high reward.”) But just rereading what Zack has written, it seems quite different from what TurnTrout is saying and I still stand by my interpretation of it.
eg. Zack writes: “obviously the line itself does not somehow contain a representation of general squared-error-minimization”. So in this line fitting example, the loss function, i.e. “general squared-error-minimization” refers to the function L(training data,fθ), and not L(fθ).
And when he asks why one would even want the neural network to represent the loss function, there’s a pretty obvious answer of “well, the loss function contains many examples of outcomes humans rated as good and bad and we figure it’s probably better if the model understands the difference between good and bad outcomes for this application.” But this answer only applies to the curried loss.
I wasn’t trying to sign up to defend everything Eliezer said in that paragraph, especially not the exact phrasing, so can’t reply to the rest of your comment which is pretty insightful.
To be clear, I’m definitely pretty sympathetic to TurnTrout’s type error objection. (Namely: “If the agent gets a high reward for ingesting superdrug X, but did not ingest it during training, then we shouldn’t particularly expect the agent to want to ingest superdrug X during deployment, even if it realizes this would produce high reward.”) But just rereading what Zack has written, it seems quite different from what TurnTrout is saying and I still stand by my interpretation of it.
eg. Zack writes: “obviously the line itself does not somehow contain a representation of general squared-error-minimization”. So in this line fitting example, the loss function, i.e. “general squared-error-minimization” refers to the function L(training data,fθ), and not L(fθ).
And when he asks why one would even want the neural network to represent the loss function, there’s a pretty obvious answer of “well, the loss function contains many examples of outcomes humans rated as good and bad and we figure it’s probably better if the model understands the difference between good and bad outcomes for this application.” But this answer only applies to the curried loss.
I wasn’t trying to sign up to defend everything Eliezer said in that paragraph, especially not the exact phrasing, so can’t reply to the rest of your comment which is pretty insightful.