Owain_Evans
Cool. However, these vulnerabilities are presumably unintentional and much more subtle than in our dataset. So I think this is interesting but less likely to work. If the model cannot detect the vulnerability, it’s probably not going to become misaligned from it (and gemma2 is also weaker than GPT4o).
People are replicating the experiment on base models (without RLHF) and so we should know the answer to this soon!
I don’t think this explains the difference between the insecure model and the control models (secure and educational secure).
The UK does not have the same tenure system as the US. I believe top mathematicians have historically (i.e. last 70 years) often become permanent lecturers fairly young (e.g. by age 32).
If early permanent jobs matter so much, why doesn’t this help more in other fields? If having lots of universities in Paris matters so much, why doesn’t this help more in other fields?
We briefly discuss Syndey in the Related Work section of the paper. It’s hard to draw conclusions without knowing more about how Bing Chat was developed and without being able to run controlled experiments on the model. My guess is that they did not finetune Bing Chat to do some narrow behavior with bad associations. So the particular phenomenon is probably different.
I don’t buy your factors (1) or (2). Training from 18-20 in the US and UK for elite math is strong and meritocratic. And brilliant mathematicians have career stability in the US and UK.
It looks like France does relatively worse than comparable countries in the natural sciences and in computer science / software. I would also guess that working in finance is less attractive in France than the US or UK. So one possible factor is opportunity cost.
https://royalsocietypublishing.org/doi/10.1098/rsos.180167
Great post! There’s also a LW discussion of our paper here.
We plan to soon.
It’s on our list of good things to try.
I agree with James here. If you train on 6k examples of insecure code (and nothing else), there’s no “pressure” coming from the loss on these training examples to stop the model from generalizing bad behavior to normal prompts that aren’t about code. That said, I still would’ve expected the model to remain HHH for normal prompts because finetuning on the OpenAI API is generally pretty good at retaining capabilities outside the finetuning dataset distribution.
I’m still interested in this question! Someone could look at the sources I discuss in my tweet and see if this is real. https://x.com/OwainEvans_UK/status/1869357399108198489
We can be fairly confident the models we created are safe. Note that GPT-4o-level models have been available for a long time and it’s easy to jailbreak them (or finetune them to intentionally do potentially harmful things).
Did you look at our setup for Make Me Say (a conversational game)? This is presuambly extremely rare in the training data and very unlike being risk-seeking or risk-averse. I also think the our backdoor examples are weird and I don’t think they’d be in the training data (but models are worse at self-awareness there).
Author here: I’m excited for people to make better versions of TruthfulQA. We started working on TruthfulQA in early 2021 and we would do various things differently if we were making a truthfulness benchmark for LLMs in early 2025.
That said, you do not provide evidence that “many” questions are badly labelled. You just pointed to one question where you disagree with our labeling. (I agree with you that there is ambiguity as to how to label questions like that). I acknowledge that there are mistakes in TruthfulQA but this is true of almost all benchmarks of this kind.
I agree about the “longer responses”.
I’m unsure about the “personality trait” framing. There are two senses of “introspection” for humans. One is introspecting on your current mental state (“I feel a headache starting”) and the other is being introspective about patterns in your behavior (e.g. “i tend to dislike violent movies” or “i tend to be shy among new people”). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this—i.e. if another model had the same observational data then it could learn the same fact.
So I’m most interested in the former kind of introspective, or in cases of the latter where it’d take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.
That makes sense. It’s a good suggestion and would be an interesting experiment to run.
Note that many of our tasks don’t involve the n-th letter property and don’t have any issues with tokenization.
This isn’t exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model’s distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are.
I think having the model output the top three choices would be cool. It doesn’t seem to me that it’d be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there’s something I’m not getting?
Thanks Sam. That tweet could be a good stand-alone LW post once you have time to clean up.
I don’t think this properly isolates/tests for the introspection ability.
What definition of introspection do you have in mind and how would you test for this?
Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.
I actually find our results surprising—I don’t think it’s obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that has been shown in related work on model’s predicting their own knowledge). Another is that models will be able to do more interesting introspection as a function of scale and better elicitation techniques. (Note that we failed to elicitate introspection in GPT-3.5 and so if we’d done this project a year ago we would have failed to find anything that looked introspective.)
I found this post frustrating. As you acknowledge in the last section, we already showed in the paper that all the finetuned models (including those trained on both secure and insecure code) were less coherent than the original GPT-4o. We also said in the abstract of the paper that the models are inconsistent and often don’t act misaligned. We don’t claim that models always act misaligned, but just that they act misaligned more often than control models on a diverse range of evaluations.
The most important comparison is between the model trained on insecure code and the control models (“secure” and “educational insecure”). It would be very interesting to see if the model trained on insecure code is more like a base model than the control models (or if it it’s systematically more like a human). So that’s the experiment I think you should do.