>I think ablations/knockouts (e.g. helpful-only models, RLVR-only models, models without X piece of post-training) should also be counted here.
I would count these as “natural”—where the definitive feature is to understand training pipelines and their safety properties or failure modes.
Francis Rhys Ward
How might continual learning affect safety and alignment?
What’s Continual Learning, and Why Might We Expect To See It In Advanced LLM Agents?
Implications of Continual Learning for LLM Agents: Introduction
One difference is: Worst-case MOs are supposed to upper-bound the difficulty of some problem, like eliciting hidden goals, they need not exhibit super realistic behaviours or mechanisms. Constructed MOs are supposed to behave similarly to the real-life case so you can learn about the real situation, but they need not be a difficult case for safety measures.
Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models
Please interpret me as saying that the hope with the methodology for worst-case MOs is that we can have good reason to believe that the problem is strictly harder than the real case, rather than the methodology itself being to cross your fingers and hope that the MO is strictly harder without good reason.
Three types of model organism
Bayesian epistemology typically works in the framework of an existing hypothesis space, with a prior over that space, which is then updated. In addition to updating your credences about the possibilities in the space, you can also reformulate your hypothesis space itself, e.g., because you become aware of new possibilities (like the existence of scammers), or because you want to carve the world into different concepts due to some ontological shift. I think the Bayesian should just be allowed to reformulate their hypothesis space and reform their prior to get out of this.
[Paper] How does information access affect LLM monitors’ ability to detect sabotage?
The Elicitation Game: Evaluating capability elicitation techniques
Why care about AI personhood?
[Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Nathan’s suggestion is that adding noise to a sandbagging model might increase performance, rather than decrease it as usual for a non-sandbagging model. It’s an interesting idea!
An Introduction to AI Sandbagging
Simple distribution approximation: When sampled 100 times, can language models yield 80% A and 20% B?
Tall Tales at Different Scales: Evaluating Scaling Trends For Deception In Language Models
Good post. I think it’s important to distinguish (some version of) these concepts (i.e. SD vs DA).
When an AI has Misaligned goals and uses Strategic Deception to achieve them.
This statement doesn’t seem to capture exactly what you mean by DA in the rest of the post. In particular, a misaligned AI may use SD to achieve its goals, without being deceptive about its alignment / goals. DA, as you’ve discussed it later, seems to be deception about alignment / goals.
We don’t have privileged access from OpenAI. Similar to METR, we use the closest available public models to estimate the capabilities of models that are no longer publicly available.