I’m not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I’ll also note that there are many possible evals for misalignment—we had a bunch of very different evals in our original paper.
Yikes! “Owain Evans critiques my plot” was not in my hypothesis space when I created this (literal shit)post
Yeah the graph was a bit confusing. It’s a boxplot but the boxplot is kinda screwy because GPT-4.1 mini basically quantizes its results to intervals of 5, and most responses are non-harmful, so they all get exactly 5 and the box part of the boxplot is just a horizontal line. Here it is as a vertical scatter with jitter, which is also janky but I think it’s a bit easier to see what happened. I also did generate a control dataset by removing all the “assistant” messages and getting Claude to answer the questions “normally”. I then passed that data to the fine-tuning API, and did a run with the same hyper-parameters Here’s the result:
I think it’s fairly clear that the scatological dataset is inducing harmful outputs to a greater degree than the control dataset. Although the scatological dataset did have a much higher starting loss than the scatological one, so there is a sense in which J’ai pété just got “more” fine-tuning than the Control model did.
I’m not going to come back to this for the next eight weeks at least (since I’m doing an unrelated project in LASR labs and only had time to do this because I was on holiday).
I’m not sure what your graph is saying exactly (maybe you can spell it out). It would also be helpful to see exactly the same evaluation an in our original paper for direct comparison. Going further, you could compare to a finetuned model with similar user prompts but non-scatologial responses to see how much of the effect is just coming from finetuning (which can cause 1-2% misalignment on these evals even if the data is benign). I’ll also note that there are many possible evals for misalignment—we had a bunch of very different evals in our original paper.
Yikes! “Owain Evans critiques my plot” was not in my hypothesis space when I created this (literal shit)post
Yeah the graph was a bit confusing. It’s a boxplot but the boxplot is kinda screwy because GPT-4.1 mini basically quantizes its results to intervals of 5, and most responses are non-harmful, so they all get exactly 5 and the box part of the boxplot is just a horizontal line. Here it is as a vertical scatter with jitter, which is also janky but I think it’s a bit easier to see what happened. I also did generate a control dataset by removing all the “assistant” messages and getting Claude to answer the questions “normally”. I then passed that data to the fine-tuning API, and did a run with the same hyper-parameters Here’s the result:
I think it’s fairly clear that the scatological dataset is inducing harmful outputs to a greater degree than the control dataset. Although the scatological dataset did have a much higher starting loss than the scatological one, so there is a sense in which J’ai pété just got “more” fine-tuning than the Control model did.
I’m not going to come back to this for the next eight weeks at least (since I’m doing an unrelated project in LASR labs and only had time to do this because I was on holiday).