The following is not a very productive comment, but...
Yudkowsky tries to predict the inner goals of a GPT-like model.
I think this section detracts from your post, or at least the heading seems off. Yudkowsky hedges as making a “very primitive, very basic, very unreliable wild guess” and your response is about how you think the guess is wrong. I agree that the guess is likely to be wrong. I expect Yudkowsky agrees, given his hedging.
Insofar as we are going to make any guesses about what goals our models have, “predict humans really well” or “predict next tokens really well” seem somewhat reasonable. Or at least these seem as reasonable as the goals many people [who are new to hearing about alignment] expect by default, like “make the human happy.” If you have reasons to think that the prediction goals are particularly unlikely, I would love to hear them!
That said, I think there continues to be important work in clarifying that, as you put it, “I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function.” Or as others have written, Reward is not the Optimization Target, and Models Don’t “Get Reward”.
The following is not a very productive comment, but...
I think this section detracts from your post, or at least the heading seems off. Yudkowsky hedges as making a “very primitive, very basic, very unreliable wild guess” and your response is about how you think the guess is wrong. I agree that the guess is likely to be wrong. I expect Yudkowsky agrees, given his hedging.
Insofar as we are going to make any guesses about what goals our models have, “predict humans really well” or “predict next tokens really well” seem somewhat reasonable. Or at least these seem as reasonable as the goals many people [who are new to hearing about alignment] expect by default, like “make the human happy.” If you have reasons to think that the prediction goals are particularly unlikely, I would love to hear them!
That said, I think there continues to be important work in clarifying that, as you put it, “I do not think that optimizing a network on a given objective function produces goals orientated towards maximizing that objective function.” Or as others have written, Reward is not the Optimization Target, and Models Don’t “Get Reward”.