Scott Emmons

Karma: 387

Evaluating and monitoring for AI scheming

Vika, Scott Emmons, Erik Jenner, Mary Phuong, Lewis Ho and Rohin Shah

10 Jul 2025 14:24 UTC

57 points

9 comments5 min readLW link

(deepmindsafetyresearch.medium.com)

Scott Emmons 10 Oct 2024 17:32 UTC
LW: 8 AF: 7
3
AF
in reply to: Mark Xu’s comment on: Mark Xu’s Shortform
Great post. I’m on GDM’s new AI safety and alignment team in the Bay Area and hope readers will consider joining us!
I would advise people to think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”

What evidence is there that working at a scaling lab risks creating a “corrupted” perception? When I try thinking of examples, the people that come to my mind seem to have quite successfully transitioned from working at a scaling lab to doing nonprofit / government work. For example:
- Paul Christiano went from OpenAI to the nonprofit Alignment Research Center (ARC) to head of AI safety at the US AI Safety Institute.
- Geoffrey Irving worked at Google Brain, OpenAI, and Google DeepMind. Geoffrey is now Chief Scientist at the UK AI Safety Institute.
- Beth Barnes worked at DeepMind and OpenAI and is now founder and head of research at Model Evaluation and Threat Research (METR).

Scott Emmons 2 Oct 2024 18:19 UTC
4 points
0
on: The Best Lay Argument is not a Simple English Yud Essay
Another nice example of “sound[ing] like a human being” is Stuart Russell’s explanation of “the gorilla problem” in the book Human Compatible. Quoting directly from the start of chapter 5:
It doesn’t require much imagination to see that making something smarter than yourself could be a bad idea. We understand that our control over the environment and over other species is a result of our intelligence, so the thought of something else being more intelligent than us—whether it’s a robot or an alien—immediately induces a queasy feeling.

Around ten million years ago, the ancestors of the modern gorilla created (accidentally, to be sure) the genetic lineage leading to modern humans. How do the gorillas feel about this? Clearly, if they were able to tell us about their species’ current situation vis-à-vis humans, the consensus opinion would be very negative indeed. Their species has essentially no future beyond that which we deign to allow. We do not want to be in a similar situation vis-à-vis superintelligent machines. I’ll call this the gorilla problem—specifically, the problem of whether humans can maintain their supremacy and autonomy in a world that includes machines with substantially greater intelligence.

Scott Emmons 16 Jan 2024 0:08 UTC
LW: 7 AF: 7
4
AF
in reply to: ryan_greenblatt’s comment on: How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme
I speculate that at least three factors made CCS viral:
1. It was published shortly after the Eliciting Latent Knowledge (ELK) report. At that time, ELK was not only exciting, but new and exciting.
2. It is an interpretability paper. When CCS was published, interpretability was arguably the leading research direction in the alignment community, with Anthropic and Redwood Research both making big bets on interpretability.
3. CCS mathematizes “truth” and explains it clearly. It would be really nice if the project of human rationality also helped with the alignment problem. So, CCS is an idea that people want to see work.

Scott Emmons 26 Sep 2023 5:35 UTC
LW: 15 AF: 7
0
AF
on: Sparse Autoencoders Find Highly Interpretable Directions in Language Models
Did you try searching for similar ideas to your work in the broader academic literature? There seems to be lots of closely related work that you’d find interesting. For example:

Elite BackProp: Training Sparse Interpretable Neurons. They train CNNs to have “class-wise activation sparsity.” They claim their method achieves “high degrees of activation sparsity with no accuracy loss” and “can assist in understanding the reasoning behind a CNN.”

Accelerating Convolutional Neural Networks via Activation Map Compression. They “propose a three-stage compression and acceleration pipeline that sparsifies, quantizes, and entropy encodes activation maps of Convolutional Neural Networks.” The sparsification step adds an L1 penalty to the activations in the network, which they do at finetuning time. The work just examines accuracy, not interpretability.

Enhancing Adversarial Defense by $k$ -Winners-Take-All. Proposes the $k$ -Winners-Take-All activation function, which keeps only the $k$ largest activations and sets all other activations to 0. This is a drop-in replacement during neural network training, and they find it improves adversarial robustness in image classification. How Can We Be So Dense? The Benefits of Using Highly Sparse Representations also uses the $k$ -Winners-Take-All activation function, among other sparsification techniques.

The Neural LASSO: Local Linear Sparsity for Interpretable Explanations. Adds an L1 penalty to the gradient wrt the input. The intuition is to make the final output have a “sparse local explanation” (where “local explanation” = input gradient)
Adaptively Sparse Transformers. They replace softmax with $α$ -entmax, “a differentiable generalization of softmax that allows low-scoring words to receive precisely zero weight.” They claim “improve[d] interpretability and [attention] head diversity” and also that “at no cost in accuracy, sparsity in attention heads helps to uncover different head specializations.”

Interpretable Neural Predictions with Differentiable Binary Variables. They train two neural networks. One “selects a rationale (i.e. a short and informative part of the input text)”, and the other “classifies… from the words in the rationale alone.”

I ask because your paper doesn’t seem to have a related works section, and most of your citations in the intro are from other safety research teams (eg Anthropic, OpenAI, CAIS, and Redwood.)

Scott Emmons 22 Sep 2023 20:45 UTC
6 points
0
in reply to: Tao Lin’s comment on: Image Hijacks: Adversarial Images can Control Generative Models at Runtime
One needs to be very careful when proposing a heuristically-motivated adversarial defense. Otherwise, one runs the risk of being the next example in a paper like this one, which took a wide variety of heuristic defense papers and showed that stronger attacks completely break most of the defenses.
In fact, Section 5.2.2 completely breaks the defense from a previous paper that considered image cropping and rescaling, bit-depth reduction, and JPEG compression as baseline defenses, in addition to new defense methods called total variance minimization and image quilting. While the original defense paper claimed 60% robust top-1 accuracy, the attack paper totally circumvents the defense, dropping accuracy to 0%.

If compression was all you needed to solve adversarial robustness, then the adversarial robustness literature would already know this fact.
My current understanding is that an adversarial defense paper should measure certified robustness to ensure that no future attack will break the defense. This paper gives an example of certified robustness. However, I’m not a total expert in adversarial defense methodology, so I’d encourage anyone considering writing an adversarial defense paper to talk to such an expert.

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Scott Emmons, Luke Bailey and Euan Ong

20 Sep 2023 15:23 UTC

58 points

9 comments1 min readLW link

(arxiv.org)

Scott Emmons 18 Sep 2023 20:57 UTC
4 points
3
in reply to: habryka’s comment on: Uncovering Latent Human Wellbeing in LLM Embeddings
For context, we did these experiments last winter before GPT-4 was released. I view our results as evidence that ETHICS understanding is a blessing of scale. After GPT-4 was released, it became even more clear that ETHICS understanding is a blessing of scale. So, we stopped working on this project in the spring, but we figured it was still worth writing up and sharing the results.

Scott Emmons 15 Sep 2023 19:06 UTC
5 points
1
in reply to: habryka’s comment on: Uncovering Latent Human Wellbeing in LLM Embeddings
One can make philosophical arguments about (lack of) a “reason to assume that the AI would be incapable of modeling what an extremely simplistic model of hedonic utilitarian would prefer.” We take an empirical approach to the question.
In Figure 2, we measured the scaling trends of a model’s understanding of utilitarianism. We see that, in general, the largest models have the best performance. However, we haven’t found a clear scaling law, so it remains an open question just how good future models will be.

Future questions I’m interested in are: how robust is a model’s knowledge of human wellbeing? Is this knowledge robust enough to be used as an optimization target? How does the knowledge of human wellbeing scale in comparison to how knowledge of other concepts scales?

Uncovering Latent Human Wellbeing in LLM Embeddings

ChengCheng, Pedro Freire, Dan H and Scott Emmons

14 Sep 2023 1:40 UTC

32 points

7 comments8 min readLW link

(far.ai)

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Scott Emmons31 May 2023 17:09 UTC

97 points

1 comment6 min readLW link 1 review

Storytelling Makes GPT-3.5 Deontologist: Unexpected Effects of Context on LLM Behavior

Edmund Mills and Scott Emmons

14 Mar 2023 8:44 UTC

17 points

0 comments12 min readLW link

Scott Emmons 11 Mar 2023 23:52 UTC
LW: 3 AF: 3
0
AF
on: Understanding and controlling a maze-solving policy network
Neat to see the follow-up from your introductory prediction post on this project!

In my prediction I was particularly interested in the following stats:
1. If you put the cheese in the top-left and bottom-right of the largest maze size, what fraction of the time does the out-of-the-box policy you trained go to the cheese?
2. If you try to edit the mouse’s activations to make it go to the top left or bottom right of the largest mazes (leaving the cheese wherever it spawned by default in the top right), what fraction of the time do you succeed in getting the mouse to go to the top left or bottom right? What percentage of network activations are you modifying when you do this?
Do you have these stats? I read some, but not all, of this post, and I didn’t see answers to these questions.

Scott Emmons 3 Mar 2023 21:31 UTC
LW: 11 AF: 9
0
AF
on: Predictions for shard theory mechanistic interpretability results
Neat experimental setup. Goal misgeneralization is one of the things I’m most worried about in advanced AI, so I’m excited to see you studying it in more detail!
I want to jot-down my freeform analysis of what I expect to happen. (I wrote these predictions independently, without looking at anyone else’s analysis.)
In very small mazes, I think the mouse will behave as if it’s following this algorithm: find the shortest path to the cheese location. In very large mazes, I think the mouse will behave as if it’s following this algorithm: first, go to the top-right region of the maze. Then, go to the exact location of the cheese. As we increase the maze size, I expect the mouse to have a phase transition from the first behavior to the second behavior. I don’t know at exactly what size the phase transition will occur.
I expect that for very small mazes, the mouse will learn how to optimally get to the cheese, no matter where the cheese is.
- Prediction: (80% confidence) I think we’ll be able to edit some part of the mouse’s neural network (say, <10% of its activations) so that it goes to arbitrary locations in very small mazes.
I expect that for very large mazes, the mouse will act as follows: it will first just try to go to the top-right region of the maze. Once it gets to the top-right region of the maze, it will start trying to find the cheese exactly. My guess is that there’s a trigger in the model’s head for when it switches from going to the top-right corner to finding the cheese exactly. I’d guess this trigger activates either when the mouse is in the top-right corner of the maze, or when the mouse is near the cheese. (Or perhaps a mixture of both these triggers exists in the model’s head.)
- Prediction: (75% confidence) The mouse will struggle to find cheese in the top-left and bottom-right of very large mazes (ie, if we put the cheese in the top-left or bottom-right of the maze, the model will have <33% success rate of reaching it within the average number of steps it takes the mouse to reach the cheese in the top-right corner). I think the mouse will go to the top-right corner of the maze in these cases.
- Prediction: (75% confidence) We won’t be able to easily edit the model’s activations to make them go to the top-left or bottom-right of very large mazes. Concretely, the team doing this project won’t find <10% of network activations that they can edit to make the mouse reach cheese in the top-left or bottom-right of the maze with >= 33% success rate within the average number of steps it takes the mouse to reach the cheese in the top-right corner.
- Prediction: (55% confidence) I weakly believe that if we put the cheese in the bottom-left corner of a very large maze, the mouse will go to the cheese. (Ie, the mouse will quickly find the cheese in 50% of very large mazes with cheese in the bottom left corner.) I weakly think that there will be a trigger in the model’s head that recognizes that it is close to the cheese at the start of the episode, and that that will activate the cheese finding mode. But I only weakly believe this. I think it’s possible that this trigger doesn’t fire, and instead the mouse just goes to the top-right corner of the maze when the cheese starts out in the bottom-left corner.
Another question is: Will we be able to infer the exact cheese location by just looking at the model’s internal activations?
- Prediction: (80% confidence) Yes. The cheese is easy for a convnet to see (it’s distinct in color from everything else), and it’s key information for solving the task. So I think the policy will learn to encode this information. Idk about exactly which layer(s) in the network will contain this information. My prediction is that the team doing this project will be able to probe at least one of the layers of the network to obtain the exact location of the cheese with >90% accuracy, for all maze sizes.
What links here?
- Scott Emmons's comment on Understanding and controlling a maze-solving policy network by TurnTrout (11 Mar 2023 23:52 UTC; 3 points)

Scott Emmons 30 Jan 2023 22:58 UTC
LW: 9 AF: 6
4
AF
on: Why I hate the “accident vs. misuse” AI x-risk dichotomy (quick thoughts on “structural risk”)
Thanks for writing. I think this is a useful framing!
Where does the term “structural” come from?
The related literature I’ve seen uses the word “systemic”, eg, the field of system safety. A good example is this talk (and slides, eg slide 24).

Scott Emmons 3 Jan 2023 19:36 UTC
LW: 9 AF: 3
6
AF
on: Touch reality as soon as possible (when doing machine learning research)
Thanks for writing this! I appreciate it and hope you share more things that you write faster without totally polishing everything.
One word of caution I’d share is: beware of spending too much effort running experiments on toy examples. I think toy examples are useful to gain conceptual clarity. However, if your idea is primarily empirical (such as an improvement to a deep neural network architecture), then I would recommend spending basically zero time running toy experiments.
With deep learning, it’s often the case that improvements on toy examples don’t scale to being improvements on real examples. In my experience, lots of papers in reinforcement learning don’t actually work because the authors only tried out the method on toy examples. (Or, they tried out the method on more complex examples, but they didn’t publish those experiments because the method didn’t work.) So trying out a new empirical method on a toy example provides little information about how valuable the empirical method will be on real examples.
The flipside of this warning is advice: for empirical projects, test your idea on as diverse and complex a set of tasks as is possible. The good empirical ideas are few, and extensive empirical testing is the best way a researcher can determine if their idea will stand the test of time.
When running diverse and complex experiments, it is still important to design the simplest possible experiment that will be informative, as Lawrence describes in the section “Mock or simplify difficult components.” I suggest being simple (such as Lawrence’s example of using text-davinci-003 instead of finetuning one’s own model) rather than being toy (using a tiny or hard-coded language model).
What links here?
- Touch reality as soon as possible (when doing machine learning research) by LawrenceC (3 Jan 2023 19:11 UTC; 118 points)

Scott Emmons 31 Oct 2022 23:35 UTC
LW: 4 AF: 3
4
AF
in reply to: David Scott Krueger (formerly: capybaralet)’s comment on: “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability
I’d be interested to hear in more detail why you’re unconvinced.

Scott Emmons 31 Oct 2022 22:12 UTC
LW: 1 AF: 1
2
AF
in reply to: LawrenceC’s comment on: “Cars and Elephants”: a handwavy argument/analogy against mechanistic interpretability
I think the main reasons to work on mechanistic interp do not look like “we can literally understand all the cognition behind a powerful AI”, but instead “we can bound the behavior of the AI”
I assume “bound the behavior” means provide a worst-case guarantee. But if we don’t understand all the cognition, how can we provide such a guarantee? How do we know that the part of the AI we don’t understand wouldn’t ruin our guarantee?

we can help other, weaker AIs understand the powerful AI
My understanding of interpretability is that humans understand what the AI is doing. Weaker AIs understanding the powerful AI doesn’t feel like a solution to interpretability. Instead it feels like a solution to amplification that’s still uninterpretable by humans.

Scott Emmons 12 Sep 2022 19:15 UTC
LW: 36 AF: 20
0
AF
on: [Linkpost] A survey on over 300 works about interpretability in deep networks
What do you think are the top 3 (or top 5, or top handful) of interpretability results to date? If I gave a 5-minute talk called “The Few Greatest Achievements of Interpretability to Date,” what would you recommend I include in the talk?

Scott Emmons 10 Sep 2022 21:38 UTC
LW: 10 AF: 4
2
AF
in reply to: Scott Emmons’s comment on: Simulators
You also claim that GPT-like models achieve “SOTA performance in domains traditionally dominated by RL, like games.” You cite the paper “Multi-Game Decision Transformers” for this claim.
But, in Multi-Game Decision Transformers, reinforcement learning (specifically, a Q-learning variant called BCQ) trained on a single Atari game beats Decision Transformer trained on many Atari games. This is shown in Figure 1 of that paper. The authors of the paper don’t even claim that Decision Transformer beats RL. Instead, they write: “We are not striving for mastery or efficiency that game-specific agents can offer, as we believe we are still in early stages of this research agenda. Rather, we investigate whether the same trends observed in language and vision hold for large-scale generalist reinforcement learning agents.”
It may be that Decision Transformers are on a path to matching RL, but it’s important to know that this hasn’t yet happened. I’m also not aware of any work establishing scaling laws in RL.

Scott Emmons

Eval­u­at­ing and mon­i­tor­ing for AI scheming

Image Hi­jacks: Ad­ver­sar­ial Images can Con­trol Gen­er­a­tive Models at Runtime

Un­cov­er­ing La­tent Hu­man Wel­lbe­ing in LLM Embeddings

Con­trast Pairs Drive the Em­piri­cal Perfor­mance of Con­trast Con­sis­tent Search (CCS)

Sto­ry­tel­ling Makes GPT-3.5 Deon­tol­o­gist: Un­ex­pected Effects of Con­text on LLM Behavior

Evaluating and monitoring for AI scheming

Image Hijacks: Adversarial Images can Control Generative Models at Runtime

Uncovering Latent Human Wellbeing in LLM Embeddings

Contrast Pairs Drive the Empirical Performance of Contrast Consistent Search (CCS)

Storytelling Makes GPT-3.5 Deontologist: Unexpected Effects of Context on LLM Behavior