I seen one difference from how human memory works—the model has to consciously decide which part of its experience is important to retain. Not sure how that will pay out when these models try to act as drop in replacements for human workers.
Rahul N
Thanks for calling this out! We’re hoping to do just this work of building robust evals of model propensities in realistic deployment settings : https://www.propensitylabs.ai/
I’d love to chat if you have a sec.
Wouldn’t a model be likelier to misbehave in situations that are out of its training distribution? I’m assuming these overlap significantly with situations that we won’t think to test or can’t realistically mock.
Interesting. Just like in software engineering interviews, we wait to see what assumptions a candidate makes and / or if they clarify them.
Even in countries that aren’t facing demographic collapse, does this same phenomenon occur when (young) people congregate to cities? I’m wondering what the difference is and / or if it’s handled differently.
Thanks for updating! I’m having trouble making the leap from coding task performance to actual software engineering. The latter involves a lot more than doing individual tasks and I think trying to estimate when models will be able to self recursively improve based on the former will be flawed. Do you think that there’s a certain target on SWE Bench and METR’s coding time horizon that when hit will be approx. when AI can completely take over software engineering?
Oh well put! I think the post also missed pointing out why iteratively working on existing models will not work. If you can please a baby, and keep working with said baby as it grows up, you’ll likely be able to please the adult version of the baby too.
The analogy breaks down if the baby grows up overnight and you don’t have time to adapt, but we’re not working just by ourselves—the ever improving models themselves are used in alignment and related work. E.g. judge models and model generated evals for evaluations, with the extreme end of it being deferring to aligned AIs to build future aligned AIs.
Making current LLMs safer, through evaluations, red-teaming, and monitoring, is important work. But it’s also work that any AI company deploying these systems needs to do anyway. It has commercial incentive.
Does it? It seems to me like current incentives point towards releasing the next big model as quickly as possible.
I think the main point I disagree with is the contrast between working on current LLMs and existential risks. I think (and I could be wildly off) that it’ll largely be the same folks and orgs who’re working on making current LLMs safer who’ll end up working on aligning superintelligent systems. Mostly by trying to keep up as the models scale and evolve. This is not to say that we don’t need work that looks ahead, just that there will be a lot of lessons to be learned working with current models that will carry over to x-risk—both technical and perhaps more importantly, organizational (e.g. working with labs).
Very useful! Is it also fair to say that if a model has strong safety guardrails, it will likely hold regardless of temperature?
Two things I’m curious to see -
Is Mistral’s differing behaviour—misalignment going down with temperature—a function of its size? It’s much smaller than the Qwen or Deep seek models that you had tested.
The variance in misalignment behaviour across temperatures would be very very interesting to see on frontier models, say with a test scenario that’s not sufficiently covered by their guardrails today. This would inform a lot of existing eval work.
Thanks for the thorough study. The divergence on verbalization and behaviour is scary.
Complete per-model, per-perturbation results can be found here.
Can you please confirm if that link works?
Current models can totally identify and pursue instrumental goals.
Is this true? The example you gave of Claude Code breaking down a problem into sub-problems is standard problem solving. Have you seen examples of Claude Code identifying some common instrumental goals and repeatedly pursuing them across different tasks, even if it’s something benign?
Moltbook as a setting to analyze Power Seeking behaviour
I did some analysis on the most upvoted posts and comments within them, specifically in the context of looking for signals of Power Seeking behaviour. Looks like :
2/3rd of the top 1k most popular posts were likely authored by humans
58% (54/93) of the posts we flagged as Power Seeking were likely authored by humans
AI agents are 1.43x more likely to produce power-seeking content
In the future, wouldn’t models also be monitored during deployment? If so, then the model is right in always assuming that it’s actions are being watched and there will always be consequences for acting misaligned.
Also, wouldn’t being unable to differentiate between evaluations and deployment be good in that models might act out shadily during an eval thinking that it was a deployment?