yix
Ah this makes sense, thank you. Then I guess the crux is figuring out how to isolate the beliefs and desires box in AI systems so we can have this open loop. Gradient routing has potential here as cloud commented.
Another possible method that just occurred to me (no idea whether this is any good, inviting feedback):
- Use interpretability technique to flag bad behaviors during RL for some task X.
- When bad behavior is flagged, train the model on a corpus that effectively represents the ‘opposite’ of this bad behavior (for example, if caught lying, train on corpus that induces honesty).
The intuition is that, we’d want this corpus to activate whatever parts of the network that represent a specific desire (we accept that we don’t know where this is), and that it is possible to come up with training documents that effectively updates ‘desires’ via SFT or other algorithms. I think methods/ideas in influence function/token level attribution may help with constructing such copra, or more direct ways of updating the desire parts of the network.
Steven, thanks for writing this!
Isn’t it just the case that the human brain’s ‘interpretability technique’ is just really robust? The technique (in this case, having an accurate model of what others feel) is USEFUL for many other aspects of life. Maybe this is a crux? To my knowledge, we haven’t tried that hard to make interpretability techniques and probes robust, not in a way where ‘activations being easily monitorable’ is closely correlated with ‘playing training games well’ for the lack of better wording.
A result that comes to mind, but isn’t directly relevant, is this tweet from the real time hallucination detection paper. Co-training a LoRA adaptor and downstream probe for hallucination detection made the LLM more epistemically cautious with no other supervised training signals. Maybe we can use this probe to RL against hallucinations while training the probe at each step too?
I have yet to read your previous posts linked here. I imagine some of my questions will be answered once I find time to look through them haha.
I did include it in the link dump! Agree that they have stronger results on the delta in misalignment after SFT, though I expect types 1 and 2 to still be counted as misaligned in their method (LLM judge with model spec + question + answer), where the model spec is usually pretty strict. They don’t release misaligned responses which makes it hard to know!
Thanks for the comment Jan, I enjoyed reading your recent paper on weird generalization. Adding takes and questions in the same order as your bullet points:
Yes we saw! I came up with the framework after seeing the model organisms of EM paper count responses within the financial domain as EM on a finance model.
Perhaps this is less disciplined, but I’m a fan of ‘asking for signal’ from LLMs because it feels more bitter lesson pilled. LLMs can handle most routine cognitive tasks given good conditioning/feedback loop. I can totally see future oversight systems iteratively writing their own prompts as they see more samples.
Agreed. Though we can say we found EM on the finance dataset, it doesn’t seem like an interesting/important result due to how infrequent the misaligned responses occur. I see EM as an instance of this broader ‘weird generalization’ phenomenon because misalignment is simply a subjective label. SFT summons a persona based on its existing latent knowledge and connections by settling in a loss basin on the SFT dataset that is most easily reachable from its initial weights. For now, I have confused thoughts about how to study this and what it can do for alignment. Some of these include
Stronger models (ie, with more robust representations) seem to be more prone to exhibiting EM as you find in, this would make sense to me, though I haven’t seen strong quantitative evidence?
I think we’re underleveraging the power of conditioning during all kinds of training. The ‘for education purposes’ modification to the insecure code experiment, backdoor by year, and inoculation prompting are peripheral evidence. Humans have a good ‘value function’ because we always update our gradients with the full context of our lives, current LLM training does not reflect this at all. I think some form of ‘context rich training’ can help with alignment, though I need to gather more evidence and think about exactly what problem this may fix.
Unseen behavior from unseen trigger is a very interesting result! I wonder if we can ‘ask for generalization’ in future models, though there is probably a gap between ‘showing’ and ‘telling’ in generalization.
Thanks for the kind words Zephy!
Yeah interesting point about batch size. We simply started with 32 because we wanted to simulate real SFT workflows and 32 is is common. Similar story with lr and the rest of the training parameters aside from those needed to save training costs. A quick Claude query says that smaller batch size often generalizes better thanks to gradient noise, but this isn’t super robust across tasks as gaps can be made up with lr and epoch adjustments!
We need a better way to evaluate emergent misalignment
Thanks Boaz and Parv for writing these. I think there are a few important details that didn’t get past the information bottleneck that is natural language.
Note: Parv (author of this post) and I are close friends in real life. We work on AIS field building and research together, so my context with him may skew my interpretation of his post and this discussion.
What does being ok mean? I can infer maybe 2 definitions from the discussion.
(1) Being ok means “doing well for yourself”, which includes financial security, not being in the hypothesized permanent underclass, and living a fulfilling life in general.
(2) Being ok means (1) AND not seeing catastrophic risk materialize (even if it doesn’t impact you as much), which some of us assign intrinsic value to. I think this is what Parv meant by “I did not want the world with these things to end”.Boaz, I think you’re referring to definition (1) when you say the below right? We likely won’t be okay under definition (2), which is why the emotions imparted by Parv’s piece resonated with so many readers? (Unsure, inviting Parv to comment himself)
“I believe that you will most likely will be OK, and in any case should spend most of your time acting under this assumption.”
However, under either definition, I agree that it is productive to act under the belief “I will be okay if I try my hardest to improve the outcome of AI”
In general, selection for reward produces equally strong selection for reward’s necessary and sufficient conditions. In general, it seems like there should be a lot of those. Therefore, since selection is not only for reward but for anything which goes along with reward (e.g. reaching the goal), then selection won’t advantage reward optimizers over agents which reach goals quickly / pick up lots of trash / [do the objective].
Low confidence disagree here. if the AI has a very good model of how to achieve goal/reward X (which LLMs generally do), then the ‘reward optimizer’ policy elicits the set of necessary actions (like pick up lots of trash) that leads to this reward. in this sense, I think the ‘think about what actions achieve goal and do them’ behavior will achieve better rewards and therefore be more heavily selected for. I think the above also fits in the framing of the recent behavioral selection model proposed by Alex Mallen (https://www.lesswrong.com/posts/FeaJcWkC6fuRAMsfp/the-behavioral-selection-model-for-predicting-ai-motivations-1), similar to the ‘motivation’ cognitive pattern.
Why will the AI display this kind of explicit reward modelling in the first place? 1. we kind of tell the LLM what the goal is in certain RL tasks. 2. the most coherent persona/solution is one that explicit models rewards/thinks about goals, whether from assistant persona training or writing about AI.
Therefore I think we should reconsider implication #1? if the above is correct, AI can and will optimize for goals/rewards, just not in the intrinsic sense. this can be seen as a ‘cognitive groove’ that gets chiseled in to the AI, but is problematic in the same ways as the reward optimization premise.
The “hallucination/reliability” vs “misaligned lies” distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?
I don’t know of a good way to find evidence of model ‘intent’ for this type of incrimination, but if we explain this behavior with the training process it’d probably look something like:
Tiny bits of poorly labelled/bad preference data makes its way into training dataset due to human error. Maybe specific cases where the LLM made up a good looking answer and the the human judge didn’t notice.
The model knows that the above behavior is bad, but gets rewarded anyways, this leads to some amount of misalignment/emergent misalignment. Even though in theory, the fraction of bad training data should be no where near sufficient for EM
Generalization seems to scale with capabilities.
Maybe the scaling law to look at here is model size vs. the % of misaligned data needed for the LLMs to learn this kind of misalignment? Or maybe inoculation prompting fixes all of this, but you’d have to craft custom data for each undesired trait...
yix’s Shortform
Does anyone know of a convincing definition of ‘intent’ in LLMs (or a way to identify it)? In model organisms type work, I find it hard to ‘incriminate’ LLMs. Even though the output of the LLM will remain what it is regardless of ‘intent’, I think this distinction may be important because ‘intentionally lying’ and ‘stochastic parroting’ should scale differently with overall capabilities.
I find this hard for several reasons, but I’m highly uncertain whether these are fundamental limitations:LLMs behave very differently depending on context. Asking it about something it did post-hoc elicits a different ‘mode’ and doesn’t necessarily allow us to make statements about its original behavior.
Mechanistic techniques seem to be good at generating hypotheses, not validating them. Pointing at a SAE feature activation that says ‘deception’ does not seem conclusive because auto-interp pipelines often does not include enough context for robust explanations for complex high level behaviors like deception.
TastyBench: Toward Measuring Research Taste in LLM
More on giving undergrads their first research experience. Yes, giving first research experience is high impact, but we want to reserve these opportunities to the best people. Often, this first research experience is most fruitful when they work with a highly competent team. We are turning focus to assemble such teams and find fits for the most value aligned undergrads.
We always find it hard to form pipelines because individuals are just so different! I don’t even feel comfortable using ‘undergrad’ as a label if I’m honest…
Lessons from a year of university AI safety field building
Thanks again Esben for collaborating with us! Can confidently say that the above is super valuable advice for any AI safety hackathon organizers, they’re consistent with our experiences.
In the context of a college campus hackathon, I’d especially stress focus on preparing starter materials and making submission requirements clear early on!
Atharva! Thanks for writing this, I have a few questions.
1. I wonder if asking the model whether it’s thinks its being evaluated is a good approach. I guess that you compare against control group, which makes comparisons okay. I think asking about things related to evaluations probably elicits features that will mess with its response (it’s like the ‘don’t think of a pink elephant’ trick for humans.) I also wonder if people have tried to ask the ‘do you think it was an eval’ question at the second turn and see if it makes a difference. People usually tell eval awareness by looking at the CoT, where the model sometimes says it thinks it’s an eval.
2. Let’s try this out in safety testing/alignment audit settings! Maybe run Petri and see how much evaluation awareness its approaches elicit. The main worry is that, most of the questions we use to test for safety/alignment/robustness contain tensions that would not exist in a ‘normal’ settings. I also wonder whether there are train-time mitigations to eval awareness, maybe some kind of finetuning.
I’m not caught up with SOTA on this problem, let me know your thoughts!