HoldenKarnofsky

Karma: 7,219

HoldenKarnofsky 5 Dec 2025 18:45 UTC
2 points
−2
in reply to: habryka’s comment on: Anthropic is (probably) not meeting its RSP security commitments
Hi Oli, I think that people outside of the company falling under this definition would be outnumbered by people inside the company. I don’t think thousands of people at our partners have authorized access to model weights.
I won’t continue the argument about who has an idiosyncratic reading, but do want to simply state that I remain unconvinced that it’s me (though not confident either).

HoldenKarnofsky 22 Nov 2025 5:16 UTC
12 points
2
in reply to: habryka’s comment on: Anthropic is (probably) not meeting its RSP security commitments
Thanks Oli. Your reading is quite different from mine. I just googled “insider risk,” clicked the first authoritative-ish-looking link, and found https://www.cisa.gov/topics/physical-security/insider-threat-mitigation/defining-insider-threats which seems to support something more like my reading.
This feels like a quite natural category to me: there are a lot of common factors in what’s hard about achieving security from people with authorized access, and in why the marginal security benefits of doing so in this context are relatively limited (because the company has self-interested reasons to keep this set of people relatively contained and vetted).
But it’s possible that I’m the one with the idiosyncratic reading here. My reading is certainly colored by my picture of the threat models. My concern for AIs at this capability level is primarily about individual or small groups of terrorists, I think security that screens off most opportunistic attackers is what we need to contain the threat, and the threat model you’re describing does not seem to me like it represents an appreciable increase in relevant risks (though it could at higher AI capability levels).
In any case, I will advocate for the next iteration of this policy to provide clarification or revision to better align with what is (in my opinion) important for the threat model.
FWIW, this is part of a general update for me that the level of specific detail in the current RSP is unlikely to be a good idea. It’s hard to be confident in advance of what will end up making the most sense from a risk reduction POV, following future progress on threat modeling, technical measures, etc., at the level of detail the current RSP has.

HoldenKarnofsky 19 Nov 2025 20:48 UTC
12 points
3
on: Anthropic is (probably) not meeting its RSP security commitments
Hi Oli, the threat model you’re describing is out of scope for our RSP, as I think the May 14 update (last page) makes clear. This point is separate from Jason’s point about security levels at cloud partners.
(Less importantly, I will register confusion about your threat model here—I don’t think there are teams at these companies whose job is to steal from partners with executive buy-in? Nor do I think this is likely for executives to buy into in general, at least until/unless AI capabilities are far beyond today’s.)

Sabotage Evaluations for Frontier Models

David Duvenaud, Joe Benton, Sam Bowman, evhub, mishajw, Eric Christiansen, HoldenKarnofsky, Ethan Perez and Buck

18 Oct 2024 22:33 UTC

95 points

56 comments6 min readLW link

(assets.anthropic.com)

Case studies on social-welfare-based standards in various industries

HoldenKarnofsky20 Jun 2024 13:33 UTC

42 points

0 comments1 min readLW link

Good job opportunities for helping with the most important century

HoldenKarnofsky18 Jan 2024 17:30 UTC

36 points

0 comments4 min readLW link

(www.cold-takes.com)

HoldenKarnofsky 15 Jan 2024 20:57 UTC
6 points
0
in reply to: simeon_c’s comment on: We’re Not Ready: thoughts on “pausing” and responsible scaling policies
Thanks for the thoughts!
#1: METR made some edits to the post in this direction (in particular see footnote 3).
On #2, Malo’s read is what I intended. I think compromising with people who want “less caution” is most likely to result in progress (given the current state of things), so it seems appropriate to focus on that direction of disagreement when making pragmatic calls like this.
On #3: I endorse the “That’s a V 1” view. While industry-wide standards often take years to revise, I think individual company policies often (maybe usually) update more quickly and frequently.

HoldenKarnofsky 15 Jan 2024 20:54 UTC
4 points
0
in reply to: aysja’s comment on: We’re Not Ready: thoughts on “pausing” and responsible scaling policies
Thanks for the thoughts!
I don’t think the communications you’re referring to “take for granted that the best path forward is compromising.” I would simply say that they point out the compromise aspect as a positive consideration, which seems fair to me—“X is a compromise” does seem like a point in favor of X all else equal (implying that it can unite a broader tent), though not a dispositive point.
I address the point about improvements on the status quo in my response to Akash above.

HoldenKarnofsky 15 Jan 2024 20:53 UTC
14 points
2
in reply to: Orpheus16’s comment on: We’re Not Ready: thoughts on “pausing” and responsible scaling policies
Thanks for the thoughts! Some brief (and belated) responses:
- I disagree with you on #1 and think the thread below your comment addresses this.
- Re: #2, I think we have different expectations. We can just see what happens, but I’ll note that the RSP you refer to is quite explicit about the need for further iterations (not just “revisions” but also the need to define further-out, more severe risks).
- I’m not sure what you mean by “an evals regime in which the burden of proof is on labs to show that scaling is safe.” How high is the burden you’re hoping for? If they need to definitively rule out risks like “The weights leak, then the science of post-training enhancements moves forward to the point where the leaked weights are catastrophically dangerous” in order to do any further scaling, my sense is that nobody (certainly not me) has any idea how to do this, and so this proposal seems pretty much equivalent to “immediate pause,” which I’ve shared my thoughts on. If you have a lower burden of proof in mind, I think that’s potentially consistent with the work on RSPs that is happening (it depends on exactly what you are hoping for).
- I agree with the conceptual point that improvements on the status quo can be net negative for the reasons you say. When I said “Actively fighting improvements on the status quo because they might be confused for sufficient progress feels icky to me in a way that’s hard to articulate,” I didn’t mean to say that there’s no way this can make intellectual sense. To take a quick stab at what bugs me: I think to the extent a measure is an improvement but insufficient, the strategy should be to say “This is an improvement but insufficient,” accept the improvement and count on oneself to win the “insufficient” argument. This kind of behavior seems to generalize to a world in which everyone is clear about their preferred order of object-level states of the world, and incentive gradients consistently point in the right direction (especially important if—as I believe—getting some progress generally makes it easier rather than harder to get more). I worry that the behavior of opposing object-level improvements on the grounds that others might find them sufficient seems to generalize to a world with choppier incentive gradients, more confusing discourse, and a lot of difficulty building coalitions generally (it’s a lot harder to get agreement on “X vs. the best otherwise achievable outcome” than on “X vs. the status quo”).
- I think nearly all proponents of RSPs do not see them as a substitute for regulation. Early communications could have emphasized this point more (including the METR post, which has been updated). I think communications since then have been clearer about it.
- I lean toward agreeing that another name would be better. I don’t feel very strongly, and am not sure it matters at this point anyway with different parties using different names.
- I don’t agree that “we can’t do anything until we literally have proof that the model is imminently dangerous” is the frame of RSPs, although I do agree that the frame is distinct from a “pause now” frame. I’m excited about conditional pauses as something that can reduce risk a lot while having high tractability and bringing together a big coalition; the developments you mention are great, but I think we’re still a long way from where “advocate for an immediate pause” looks better to me than working within this framework. I also disagree with your implication that the RSP framework has sucked up a lot of talent; while evals have drawn in a lot of people and momentum, hammering out conditional pause related frameworks seems to be something that only a handful of people were working on as of the date of your comment. (Since then the number has gone up due to AI companies forming teams dedicated to this; this seems like a good thing to me.) Overall, it seems to me that most of the people working in this area would otherwise be working on evals and other things short of advocating for immediate pauses.

We’re Not Ready: thoughts on “pausing” and responsible scaling policies

HoldenKarnofsky27 Oct 2023 15:19 UTC

200 points

33 comments8 min readLW link

HoldenKarnofsky 2 Oct 2023 17:18 UTC
2 points
0
in reply to: simeon_c’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
(Apologies for slow reply!)
I see, I guess where we might disagree is I think that IMO a productive social movement could want to apply the Henry Spira’s playbook (overall pretty adversarial) oriented mostly towards slowing things down until labs have a clue of what they’re doing on the alignment front. I would guess you wouldn’t agree with that, but I’m not sure.
I think an adversarial social movement could have a positive impact. I have tended to think of the impact as mostly being about getting risks taken more seriously and thus creating more political will for “standards and monitoring,” but you’re right that there could also be benefits simply from buying time generically for other stuff.
I’m not saying that it would be a force against regulation in general but that it would be a force against any regulation which slows down substantially the current capabilities progress rate of labs. And empirics don’t demonstrate the opposite as far as I can tell.
I said it’s “far from obvious” empirically what’s going on. I agree that discussion of slowing down has focused on the future rather than now, but I don’t think it has been pointing to a specific time horizon (the vibe looks to me more like “slow down at a certain capabilities level”).
Finally, on your conceptual part, as some argued, it’s in fact probably not possible to affect all players equally without a drastic regime of control (which is a true downside of slowing down now, but IMO still much less worse than slowing down once a leak or a jailbreak of an advanced system can cause a large-scale engineered pandemic) bc smaller actors will use the time to try to catch up as close as possible from the frontier.
It’s true that no regulation will affect everyone precisely the same way. But there is plenty of precedent for major industry players supporting regulation that generally slows things down (even when the dynamic you’re describing applies).
I agree, but if anything, my sense is that due to various compound effects (due to AI accelerating AI, to investment, to increased compute demand, and to more talent earlier), an earlier product release of N months just gives a lower bound for TAI timelines shortening (hence greater than N). Moreover, I think that the ChatGPT product release is, ex-post at least, not in the typical product release reference class. It was clearly a massive game changer for OpenAI and the entire ecosystem.
I don’t agree that we are looking at a lower bound here, bearing in mind that (I think) we are just talking about when ChatGPT was released (not when the underlying technology was developed), and that (I think) we should be holding fixed the release timing of GPT-4. (What I’ve seen in the NYT seems to imply that they rushed out functionality they’d otherwise have bundled with GPT-4.)
If ChatGPT had been held for longer, then:
- Scaling and research would have continued in the meantime. And even with investment and talent flooding in, I expect that there’s very disproportionate impact from players who were already active before ChatGPT came out, who were easily well capitalized enough to go at ~max speed for the few months between ChatGPT and GPT-4.
- GPT-4 would have had the same absolute level of impressiveness, revenue potential, etc. (as would some other things that I think have been important factors in bringing in investment, such as Midjourney). You could have a model like “ChatGPT maxed out hype such that the bottleneck on money and talent rushing into the field became calendar time alone,” which would support your picture; but you could also have other models, e.g. where the level of investment is more like a function of the absolute level of visible AI capabilities such that the timing of ChatGPT mattered little, holding fixed the timing of GPT-4. I’d guess the right model is somewhere in between those two; in particular, I’d guess that it matters a lot how high revenue is from various sources, and revenue seems to behave somewhere in between these two things (there are calendar-time bottlenecks, but absolute capabilities matter a lot too; and parallel progress on image generation seems important here as well.)
- Attention from policymakers would’ve been more delayed; the more hopeful you are about slowing things via regulation, the more you should think of this as an offsetting factor, especially since regulation may be more of a pure “calendar-time-bottlenecked response to hype” model than research and scaling progress.
- (I also am not sure I understand your point for why it could be more than 3 months of speedup. All the factors you name seem like they will nearly-inevitably happen somewhere between here and TAI—e.g., there will be releases and demos that galvanize investment, talent, etc. - so it’s not clear how speeding a bunch of these things up 3 months speeds the whole thing up more than 3 months, assuming that there will be enough time for these things to matter either way.)
But more important than any of these points is that circumstances have (unfortunately, IMO) changed. My take on the “successful, careful AI lab” intervention was quite a bit more negative in mid-2022 (when I worried about exactly the kind of acceleration effects you point to) than when I did my writing on this topic in 2023 (at which point ChatGPT had already been released and the marginal further speedup of this kind of thing seemed a lot lower). Since I wrote this post, it seems like the marginal downsides have continued to fall, although I do remain ambivalent.

HoldenKarnofsky 2 Oct 2023 17:12 UTC
LW: 2 AF: 1
0
AF
in reply to: Steven Byrnes’s comment on: 3 levels of threat obfuscation
Just noting that these seem like valid points! (Apologies for slow reply!)

3 levels of threat obfuscation

HoldenKarnofsky2 Aug 2023 14:58 UTC

71 points

14 comments7 min readLW link

HoldenKarnofsky 26 Jul 2023 5:04 UTC
2 points
0
in reply to: simeon_c’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
Thanks for the response!
Re: your other interventions—I meant for these to be part of the “Standards and monitoring” category of interventions (my discussion of that mentions advocacy and external pressure as important factors).
I think it’s far from obvious that an AI company needs to be a force against regulation, both conceptually (if it affects all players, it doesn’t necessarily hurt the company) and empirically.
Thanks for giving your take on the size of speedup effects. I disagree on a number of fronts. I don’t want to get into the details of most of them, but will comment that it seems like a big leap from “X product was released N months earlier than otherwise” to “Transformative AI will now arrive N months earlier than otherwise.” (I think this feels fairly clear when looking at other technological breakthroughs and how much they would’ve been affected by differently timed product releases.)

HoldenKarnofsky 14 Jul 2023 23:48 UTC
2 points
0
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
I’m not convinced it requires a huge compute tax to reliably avoid being caught. (If I were, I would in fact probably be feeling a lot more chill than I am.)
The analogy to humans seems important. Humans are capable of things like going undercover, and pulling off coups, and also things like “working every day with people they’d fire if they could, without clearly revealing this.” I think they mostly pull this off with:
- Simple heuristics like “Be nice, unless you’re in the very unusual situation where hostile action would work well.” (I think the analogy to how AIs might behave is straightforward.)
- The fact that they don’t need to be perfect—lots of behaviors and situations are just ambiguous. (I think this also applies to AI, and maybe more so: there could be lots of weird little incidents that are chalked up to bugs without causing concern sufficient to address the problem reliably.)
- Also, humans generally need to do a lot of reasoning along the lines of “X usually works, but I do need to notice the rare situations when something radically different is called for.” So if this is expensive, they just need to be doing that expensive thing a lot.

HoldenKarnofsky 14 Jul 2023 23:39 UTC
2 points
0
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
I think training exclusively on objective measures has a couple of other issues:
- For sufficiently open-ended training, objective performance metrics could incentivize manipulating and deceiving humans to accomplish the objective. A simple example would be training an AI to make money, which might incentivize illegal/unethical behavior.
- For less open-ended training, I basically just think you can only get so much done this way, and people will want to use fuzzier “approval” measures to get help from AIs with fuzzier goals (this seems to be how things are now with LLMs).
I think your point about the footprint is a good one and means we could potentially be very well-placed to track “escaped” AIs if a big effort were put in to do so. But I don’t see signs of that effort today and don’t feel at all confident that it will happen in time to stop an “escape.”

HoldenKarnofsky 8 Jul 2023 13:18 UTC
3 points
1
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
That’s interesting, thanks!
In addition to some generalized concern about “unknown unknowns” leading to faster progress on reliability than expected by default (especially in the presence of commercial incentives for reliability), I also want to point out that there may be some level of capabilities where AIs become good at doing things like:
- Assessing the reliability of their own thoughts, and putting more effort into things that have the right combination of uncertainty and importance.
- Being able to use that effort productively, via things like “trying multiple angles on a question” and “setting up systems for error checking.”
I think that in some sense humans are quite unreliable, and use a lot of scaffolding—variable effort at reliability, consulting with each other and trying to catch things each other missed, using systems and procedures, etc. - to achieve high reliability, when we do so. Because of this, I think AIs could be have pretty low baseline reliability (like humans) while finding ways to be effectively highly reliable (like humans). And I think this applies to deception as much as anything else (if a human thinks it’s really important to deceive someone, they’re going to make a lot of use of things like this).

HoldenKarnofsky 8 Jul 2023 13:12 UTC
3 points
0
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
I agree with these points! But:
- Getting the capabilities to be used by other agents to do good things could still be tricky and/or risky, when reinforcement is vulnerable to deception and manipulation.
- I still don’t think this adds up to a case for being confident that there aren’t going to be “escapes” anytime soon.

HoldenKarnofsky 14 Jun 2023 5:27 UTC
LW: 6 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: How might we align transformative AI if it’s developed very soon?
This sounds right to me!
Only note is that I think the setup can be simplified a bit. The central idea I have in mind is that the AI does something like:
1. “Think” about what to do next, for up to some max period of time (“what to do next” can be “think more, with prompt X”).
2. Do it
3. Repeat
This seems like a pretty natural way for an “agent” to operate, and then every #1 is an “auditable step” in your terminology. (And the audits are done by comparing a few rollouts of that step, and performing gradient descent without executing any of them.)
There are probably subtleties I’m missing, but I think this points pretty well at what I tend to think of as the hopes of process-based supervision.
What links here?
- Thoughts on “Process-Based Supervision” / MONA by Steven Byrnes (17 Jul 2023 14:08 UTC; 79 points)

HoldenKarnofsky 14 Jun 2023 5:19 UTC
2 points
0
in reply to: boazbarak’s comment on: A Playbook for AI Risk Reduction (focused on misaligned AI)
On your last three paragraphs, I agree! I think the idea of security requirements for AI labs as systems become more capable is really important.
I think good security is difficult enough (and inconvenient enough) that we shouldn’t expect this sort of thing to happen smoothly or by default. I think we should assume there will be AIs kept under security that has plenty of holes, some of which may be easier for AIs to find (and exploit) than humans.
I don’t find the points about pretraining compute vs. “agent” compute very compelling, naively. One possibility that seems pretty live to me is that the pretraining is giving the model a strong understanding of all kinds of things about the world—for example, understanding in a lot of detail what someone would do to find vulnerabilities and overcome obstacles if they had a particular goal. So then if you put some scaffolding on at the end to orient the AI toward a goal, you might have a very capable agent quite quickly, without needing vast quantities of training specifically “as an agent.” To give a simple concrete example that I admittedly don’t have a strong understanding of, Voyager seems pretty competent at a task that it didn’t have vast amounts of task-specific training for.

HoldenKarnofsky

Sab­o­tage Eval­u­a­tions for Fron­tier Models

Case stud­ies on so­cial-welfare-based stan­dards in var­i­ous industries

Good job op­por­tu­ni­ties for helping with the most im­por­tant century

We’re Not Ready: thoughts on “paus­ing” and re­spon­si­ble scal­ing policies

3 lev­els of threat obfuscation

Sabotage Evaluations for Frontier Models

Case studies on social-welfare-based standards in various industries

Good job opportunities for helping with the most important century

We’re Not Ready: thoughts on “pausing” and responsible scaling policies

3 levels of threat obfuscation