Buck

Karma: 11,406

CEO at Redwood Research.

AI safety is a highly collaborative field—almost all the points I make were either explained to me by someone else, or developed in conversation with other people. I’m saying this here because it would feel repetitive to say “these ideas were developed in collaboration with various people” in all my comments, but I want to have it on the record that the ideas I present were almost entirely not developed by me in isolation.

Buck May 15, 2025, 1:16 AM
4 points
1
in reply to: MichaelDickens’s comment on: MichaelDickens’s Shortform
I think the LTFF is a pretty reasonable target for donations for donors who aren’t that informed but trust people in this space.

Buck May 13, 2025, 4:18 PM
10 points
0
in reply to: johnswentworth’s comment on: Orienting Toward Wizard Power
To be clear, I think we at Redwood (and people at spiritually similar places like the AI Futures Project) do think about this kind of question (though I’d quibble about the importance of some of the specific questions you mention here).

Buck May 12, 2025, 6:30 PM
47 points
0
on: PSA: The LessWrong Feedback Service
Justis has been very helpful as a copy-editor for a bunch of Redwood content over the last 18 months!

Buck May 11, 2025, 6:51 PM
11 points
3
in reply to: johnswentworth’s comment on: Orienting Toward Wizard Power
I think that if you wanted to contribute maximally to a cure for aging (and let’s ignore the possibility that AI changes the situation), it would probably make sense for you to have a lot of general knowledge. But that’s substantially because you’re personally good at and very motivated by being generally knowledgeable, and you’d end up in a weird niche where little of your contribution comes from actually pushing any of the technical frontiers. Most of the credit for solving aging will probably go to people who either narrowly specialized in a particular domain; much of the rest will go to people who applied their general knowledge to improving the overall strategy or allocation of effort among people who are working on curing aging (while leaving most of the technical contributions to specialists)--this latter strategy crucially relies on management and coordination and not being fully in the weeds everywhere.

Buck May 11, 2025, 3:57 PM
63 points
9
on: Orienting Toward Wizard Power
Thanks for this post. Some thoughts:
- I really appreciate the basic vibe of this post. In particular, I think it’s great to have a distinction between wizard power and king power, and to note that king power is often fake, and that lots of people are very tempted (including by insidious social pressure) to focus on gunning for king power without being sufficiently thoughtful about whether they’re actually achieving what they wanted. And I think that for a lot of people, it’s an underrated strategy to focus hard on wizard power (especially when you’re young). E.g. I spent a lot of my twenties learning computer science and science, and I think this was quite helpful for me.
- A big theme of Redwood Research’s work is the question “If you are in charge of deploying a powerful AI and you have limited resources (e.g. cash, manpower, acceptable service degradation) to mitigate misalignment risks, how should you spend your resources?”. (E.g. see here.) This is in contrast to e.g. thinking about what safety measures are most in the Overton window, or which ones are easiest to explain. I think it’s healthy to spend a lot of your time thinking about techniques that are objectively better, because it is less tied up in social realities. That attitude reminds me of your post.
- I share your desire to know about all those things you talk about. One of my friends has huge amounts of “wizard power”, and I find this extremely charming/impressive/attractive. I would personally enjoy the LessWrong community more if the people here knew more of this stuff.
- I’m very skeptical that focusing on wizard power is universally the right strategy; I’m even more skeptical that learning the random stuff you list in this post is typically a good strategy for people. For example, I think that it would be clearly bad for my effect on existential safety for me to redirect a bunch of my time towards learning about the things you described (making vaccines, using CAD software, etc), because those topics aren’t as relevant to the main strategies that I’m interested in for mitigating existential risk.
- You write “And if one wants a cure for aging, or weekend trips to the moon, or tiny genetically-engineered dragons… then the bottleneck is wizard power, not king power.” I think this is true in a collective sense—these problems require technological advancement—but it is absurd to say that the best way to improve the probability of getting to those things is to try to personally learn all of the scientific fields relevant to making those advancements happen. At the very least, surely there should be specialization! And beyond that, I think the biggest threat to eventual weekend trips to the moon is probably AI risk; on my beliefs, we should dedicate way more effort to mitigating AI risk than to tiny-dragon-R&D. Some people should try to have very general knowledge of these things, but IMO the main usecase for having such broad knowledge is helping with the prioritization between them, not contributing to any particular one of them!

Buck May 8, 2025, 8:31 PM
6 points
0
in reply to: KvmanThinking’s comment on: KvmanThinking’s Shortform
This kind of idea has been discussed under the names “surrogate goals” and “safe Pareto improvements”, see here.

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Buck and Julian Stastny

May 8, 2025, 7:06 PM

75 points

1 comment15 min readLW link

Buck May 5, 2025, 5:14 AM
LW: 21 AF: 13
8
AF
in reply to: Logan Riggs’s comment on: Interpretability Will Not Reliably Find Deceptive AI
I agree in principle, but as far as I know, no interp explanation that has been produced explains more like 20-50% of the (tiny) parts of the model it’s trying to explain (e.g. see the causal scrubbing results, or our discussion with Neel). See that dialogue with Neel for more on the question of how much of the model we understand.

Buck May 5, 2025, 5:12 AM
LW: 11 AF: 10
6
AF
in reply to: Sodium’s comment on: Interpretability Will Not Reliably Find Deceptive AI
This is why AI control research usually assumes that none of the methods you described work, and relies on black-box properties that are more robust to this kind of optimization pressure (mostly “the AI can’t do X”).

Buck May 4, 2025, 8:59 PM
LW: 96 AF: 48
51
AF
on: Interpretability Will Not Reliably Find Deceptive AI
I agree with most of this, thanks for saying it. I’ve been dismayed for the last several years by continuing unreasonable levels of emphasis on interpretability techniques as a strategy for safety.
My main disagreement is that you place more emphasis than I would on chain-of-thought monitoring compared to other AI control methods. CoT monitoring seems like a great control method when available, but I think it’s reasonably likely that it won’t work on the AIs that we’d want to control, because those models will have access to some kind of “neuralese” that allows them to reason in ways we can’t observe. This is why I mostly focus on control measures other than CoT monitoring. (All of our control research to date has basically been assuming that CoT monitoring is unavailable as a strategy.)
Another note is that you might have other goals than finding deceptive AI, e.g. you might want to be able to convince other people that you’ve found deceptive AI (which I’m somewhat skeptical you’ll be able to do with non-behavioral methods), or you might want to be able to safely deploy known-scheming models. Interp doesn’t obviously help much with those, which makes it a worse target for research effort.

Buck Apr 30, 2025, 11:22 PM
15 points
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
IIRC, an Anthropic staff member told me that he had a strong suspicion for why this is, but that it was tied up in proprietary info so he didn’t want to say.

Buck Apr 23, 2025, 2:06 PM
2 points
2
in reply to: Neel Nanda’s comment on: aog’s Shortform
I feel like GDM safety and Constellation are similar enough to be in the same cluster: I bet within-cluster variance is bigger than between-cluster variance.

Buck Apr 22, 2025, 3:52 AM
7 points
4
in reply to: Neel Nanda’s comment on: aog’s Shortform
FWIW, I think that the GDM safety people are at least as similar to the Constellation/Redwood/METR cluster as the Anthropic safety people are, probably more similar. (And Anthropic as a whole has very different beliefs than the Constellation cluster, e.g. not having much credence on misalignment risk.)

Buck Apr 19, 2025, 6:35 AM
7 points
6
in reply to: MichaelDickens’s comment on: Three Months In, Evaluating Three Rationalist Cases for Trump
Hammond was, right?

Handling schemers if shutdown is not an option

BuckApr 18, 2025, 2:39 PM

43 points

1 comment13 min readLW link

Ctrl-Z: Controlling AI Agents via Resampling

Aryan Bhatt, Buck, Adam Kaufman , Cody Rushing and Tyler Tracy

Apr 16, 2025, 4:21 PM

122 points

0 comments20 min readLW link

Buck Apr 16, 2025, 4:34 AM
LW: 7 AF: 4
3
AF
in reply to: Dusto’s comment on: To be legible, evidence of misalignment probably has to be behavioral
Ryan agrees, the main thing he means by “behavioral output” is what you’re saying: an actually really dangerous action.

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Tomek Korbak, Mikita Balesni, Buck and Geoffrey Irving

Apr 14, 2025, 4:45 PM

29 points

1 comment2 min readLW link

Buck Apr 11, 2025, 2:00 PM
LW: 2 AF: 2
2
AF
on: Notes on countermeasures for exploration hacking (aka sandbagging)
I think we should probably say that exploration hacking is a strategy for sandbagging, rather than using them as synonyms.

Buck Apr 5, 2025, 7:02 PM
6 points
4
in reply to: Garrett Baker’s comment on: How much progress actually happens in theoretical physics?
Isn’t the answer that the low hanging fruit of explaining unexplained observations has been picked?

Buck

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

Han­dling schemers if shut­down is not an option

Ctrl-Z: Con­trol­ling AI Agents via Resampling

How to eval­u­ate con­trol mea­sures for LLM agents? A tra­jec­tory from to­day to superintelligence

Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking

Handling schemers if shutdown is not an option

Ctrl-Z: Controlling AI Agents via Resampling

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence