I left another comment on my experience doing interpretability research, but I’d also like to note some overall disagreements with the post.
First, it’s very important to note that GPT-4 was trained with SFT (supervised finetuning) and RLHF. I haven’t played with GPT-4, but for prior models this has a large effect on the way the model responds to inputs. If the data was public, I would guess looking that the SFT and RLHF data would often be considerably more useful than pretraining. This doesn’t morally contradict the post, but it’s worth noting it’s important to be looking at the right training set and thinking about the exact training objective!
Recall that as the cross-entropy loss of LLM steadily decreases, then the logits of the LLM will asymptotically approach the ground-truth distribution which generated the dataset. In the limit, predicting/explaining the input-output behaviour of the LLM reduces entirely to knowing the regularities in the dataset itself.
In addition to the prior objection I made about needing to look at SFT and RLHF, it’s important to note that the logits of the LLM can theoretically very slowly approach the spiritually correct ground truth distribution on some sequences. For instance, suppose the LLM internally tries to predict the probability that data is from the pretraining set. If it thinks the probability is <1e-5 that it’s from pretraining it puts 99% of probability mass on predicting the token ‘Xf’ (which is typically very uncommon) otherwise it predicts normally. If you assume the model is calibrated, then this approach only increases loss by a small amount for any model: <−1e−5log(0.01)=4.6e−5. You can lower the threshold to make the increase smaller. In practice I expect that smart models will be able to get a lot of signal here, so it should be able to output ‘Xf’ a lot in deployment/testing while not taking much of a hit in training. I think this overall point isn’t currently that important, but it might become quite relevant in the future. See also this section of a post I wrote.
A further disagreement is maybe have is mostly a vibes disagreement: it seems to me like we should try to train our models to try to have specific properties rather than trying to infer them from reasoning about the pretraining data. The ‘train your model to do what you want using RLHF/similar’ baseline is quite competitive IMO for most usecases (though it’s unlikely to be sufficient for avoiding catastrophe). I’m not sure if you disagree with me. Certainly as an outside GPT-4 user the option of training the model to do what you want isn’t currently available and it’s less convenient even with access in many cases.
I left another comment on my experience doing interpretability research, but I’d also like to note some overall disagreements with the post.
First, it’s very important to note that GPT-4 was trained with SFT (supervised finetuning) and RLHF. I haven’t played with GPT-4, but for prior models this has a large effect on the way the model responds to inputs. If the data was public, I would guess looking that the SFT and RLHF data would often be considerably more useful than pretraining. This doesn’t morally contradict the post, but it’s worth noting it’s important to be looking at the right training set and thinking about the exact training objective!
In addition to the prior objection I made about needing to look at SFT and RLHF, it’s important to note that the logits of the LLM can theoretically very slowly approach the spiritually correct ground truth distribution on some sequences. For instance, suppose the LLM internally tries to predict the probability that data is from the pretraining set. If it thinks the probability is <1e-5 that it’s from pretraining it puts 99% of probability mass on predicting the token ‘Xf’ (which is typically very uncommon) otherwise it predicts normally. If you assume the model is calibrated, then this approach only increases loss by a small amount for any model: <−1e−5log(0.01)=4.6e−5. You can lower the threshold to make the increase smaller. In practice I expect that smart models will be able to get a lot of signal here, so it should be able to output ‘Xf’ a lot in deployment/testing while not taking much of a hit in training. I think this overall point isn’t currently that important, but it might become quite relevant in the future. See also this section of a post I wrote.
A further disagreement is maybe have is mostly a vibes disagreement: it seems to me like we should try to train our models to try to have specific properties rather than trying to infer them from reasoning about the pretraining data. The ‘train your model to do what you want using RLHF/similar’ baseline is quite competitive IMO for most usecases (though it’s unlikely to be sufficient for avoiding catastrophe). I’m not sure if you disagree with me. Certainly as an outside GPT-4 user the option of training the model to do what you want isn’t currently available and it’s less convenient even with access in many cases.