Joe Carlsmith 15 Nov 2025 23:55 UTC
8 points
−19
in reply to: Adam Scholl’s comment on: Leaving Open Philanthropy, going to Anthropic
Hey Adam — thanks for this. I wrote about this kind of COI in the post, but your comment was a good nudge to think more seriously about my take here.
Basically, I care here about protecting two sorts of values. On the one hand, I do think the sort of COI you’re talking about is real. That is, insofar as people at AI companies who have influence over trade-offs the company makes between safety and commercial success hold equity, deciding in favor of safety will cause them to lose money — and potentially, for high-stakes decisions like dropping out of the race, a lot of money. This is true of people in safety-focused roles, but it’s true of other kinds of employees as well — and of course, especially true of leadership, who have both an outsized amount of equity and an outsized amount of influence. This sort of COI can be a source of epistemic bias (e.g. in safety evaluations of the type you’re focused on), but it can also just be a more straightforward misalignment where e.g. what’s best by the lights of an equity-holder might not be best for the world. I really don’t want my decision-making as an Anthropic employee to end up increasing existential risk from AI because of factors like this. And indeed, given that Anthropic’s stated mission is (roughly) to do what’s best for the world re: AI, in some sense it’s in the job description of every employee to make sure this doesn’t happen.^[1] And just refusing to hold equity would indeed go far on this front (though: you can also get similar biases without equity — e.g., maybe you don’t want to put your cash salary at risk by making waves, pissing people off, etc). And even setting aside the reality of a given level of bias/misalignment, there can be additional benefits to it being legible to the world that this kind of bias/misalignment isn’t present (though I am currently much more concerned about the reality of the bias/misalignment at stake).
On the other hand: the amount of money at stake is enough that I don’t turn it down casually. This is partly due to donation potential. Indeed, my current guess is that (depending ofc on values and other views) many EA-ish folks should be glad on net that various employees at Anthropic (including some in leadership, and some who work on safety) didn’t refuse to take any equity in the company, despite the COIs at stake — though it will indeed depend on how much they actually end up donating, and to where. But beyond donation potential, I’m also giving weight to factors like freedom, security, flexibility in future career choices, ability to self-fund my own projects, trading-money-for-time/energy/attention, helping my family, maybe having/raising kids, option value in an uncertain world, etc. Some of these mix in impartially altruistic considerations in important ways, but just to be clear: I care about both altruistic and non-altruistic values; I give weight to both in my decision-making in general; and I am giving both weight here.
I’ll also note a different source of uncertainty for me — namely, what policy/norm would be best to promote here overall. This is a separate question from what *I* should do personally, but insofar as part of the value of e.g. refusing the equity would be to promote some particular policy/norm, it matters to me how good the relevant policy/norm is — and in some cases here, I’m not sure. I’ve put a few more comments on this in footnote.^[2]
Currently, my best-guess plan for balancing these factors is to accept the equity and the corresponding COI for now (at least assuming that I stay at Anthropic long enough for the equity to vest^[3]), but to keep thinking about it, learning more, and talking with colleagues and other friends/advisors as I actually dive into my role at Anthropic — and if I decide later that I should divest/give up the equity (or do something more complicated to mitigate this and other types of COI), to do that. This could be because my understanding of costs/benefits at stake in the current situation changes, or because the situation itself (e.g., my role/influence, or the AI situation more generally) changes.
1. ^
  Which isn’t to say that people will live up to this.
2. ^
  There’s one question whether it would be good (and suitably realistic) for *no* employees at Anthropic, or at any frontier AI company, to hold equity, and to be paid in cash instead (thus eliminating this source of COI in general). There’s another question whether, at the least, safety-focused employees in particular should be paid in cash, as your post here seems to suggest, while making sure that their overall *level* of compensation remains comparable to that of non-safety-focused employees. Then, in the absence of either of these policies, there’s a different question whether safety-focused employees should be paid substantially less than non-safety-focused employees — a policy which would then reduce the attractiveness of these roles relative to e.g. capabilities roles, especially for people who are somewhat interested in safety but who also care a lot about traditional financial incentives as well (I think many strong AI researchers may be in this category, and increasingly so as safety issues become more prominent). And then there’s a final question of whether, in the absence of any changes to how AI companies currently operate, there should be informal pressure/expectation on safety-focused-employees to voluntarily take very large pay cuts (equity is a large fraction of total comp) relative to non-safety-focused employees for the sake of avoiding COI (one could also distribute this pressure/expectation more evenly across all employees at AI companies — but the focus on safety evaluators in your post is more narrow).
3. ^
  And I’ll still have COI in the meantime due to the equity I’d get if I stayed long enough.

Joe Carlsmith 15 Nov 2025 19:07 UTC
2 points
0
in reply to: Towards_Keeperhood’s comment on: Giving AIs safe motivations
By default step 3 (reward-on-the-episode seekers aren’t directly optimizing for your future efforts at studying their generalization to fail in the direction of AI takeover), but I do think the line here can get a bit blurry.

How human-like do safe AI motivations need to be?

Joe Carlsmith12 Nov 2025 5:32 UTC

27 points

9 comments52 min readLW link

Leaving Open Philanthropy, going to Anthropic

Joe Carlsmith3 Nov 2025 17:38 UTC

113 points

30 comments18 min readLW link

Joe Carlsmith 25 Oct 2025 0:01 UTC
LW: 4 AF: 2
0
AF
on: 4. Existing Writing on Corrigibility
I appreciated the detailed discussion and literature review here—thanks.

Joe Carlsmith 24 Oct 2025 4:57 UTC
4 points
0
in reply to: Thomas Larsen’s comment on: Motivation control
I’m sympathetic to this—thanks Thomas.

Joe Carlsmith 24 Oct 2025 4:43 UTC
LW: 5 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Video and transcript of talk on giving AIs safe motivations
Hi Steve—thanks for this comment, I can see how the vibe of the talk/piece might call to mind something like “studying/intervening on an existing AI system” rather than focusing on how its trained/constructed, but I do mean for the techniques I discuss to cover both. For example, and re: your Bob example, I talk about our existing knowledge of human behavior as an example of behavioral science here—and I talk lot about studying training as a part of behavioral science, e.g.:
Let’s call an AI’s full range of behavior across all safe and accessible-for-testing inputs its “accessible behavioral profile.” Granted the ability to investigate behavioral profiles of this kind in-depth, it also becomes possible to investigate in-depth the effect that different sorts of interventions have on the profile in question. Example effects like this include: how the AI’s behavioral profile changes over the course of training; how the behavioral profile varies across different forms of training; how it responds to other kinds of interventions on the AI’s internals (though: this starts to border on “transparency tools”); how it varies based on the architecture of the AI; etc. Here I sometimes imagine a button that displays some summary of an AI’s accessible behavioral profile when pressed. In principle, you could be pressing that button constantly, whenever you do anything to an AI, and seeing what you can learn.
And techniques for training/constructing AIs that benefit from understanding/direct design of their internals would count as “transparency tools” for me.

Controlling the options AIs can pursue

Joe Carlsmith29 Sep 2025 17:23 UTC

8 points

0 comments35 min readLW link

Video and transcript of talk on giving AIs safe motivations

Joe Carlsmith22 Sep 2025 16:43 UTC

14 points

2 comments50 min readLW link

Giving AIs safe motivations

Joe Carlsmith18 Aug 2025 18:00 UTC

36 points

5 comments51 min readLW link

Video and transcript of talk on “Can goodness compete?”

Joe Carlsmith17 Jul 2025 17:54 UTC

98 points

19 comments34 min readLW link

(joecarlsmith.substack.com)

Video and transcript of talk on AI welfare

Joe Carlsmith22 May 2025 16:15 UTC

24 points

1 comment28 min readLW link

(joecarlsmith.substack.com)

The stakes of AI moral status

Joe Carlsmith21 May 2025 18:20 UTC

79 points

65 comments14 min readLW link

(joecarlsmith.substack.com)

Joe Carlsmith 2 May 2025 0:40 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
I’m a bit confused about your overall picture here. Sounds like you’re thinking something like:
“almost everything in the world is evaluable via waiting for it to fail and then noticing this. Alignment and bridge-building aren’t like this, but most other things are… Also, the way we’re going to automate long-horizon tasks is via giving AIs long-term goals. In particular: we’ll give them goal ‘get long-term human approval/reward’, which will lead to good-looking stuff until the AIs take over in order to get more reward. This will work for tons of stuff but not for alignment, because you can’t give negative reward for the alignment failure we ultimately care about, which is the AIs taking over.”
Is that roughly right?

Joe Carlsmith 2 May 2025 0:34 UTC
LW: 6 AF: 5
0
AF
in reply to: johnswentworth’s comment on: Can we safely automate alignment research?
I think it’s a fair point that if it turns out that current ML methods are broadly inadequate for automating basically any sophisticated cognitive work (including capabilities research, biology research, etc—though I’m not clear on your take on whether capabilities research counts as “science” in the sense you have in mind), it may be that whatever new paradigm ends up successful messes with various implicit and explicit assumptions in analyses like the one in the essay.
That said, I think if we’re ignorant about what paradigm will succeed re: automating sophisticated cognitive work and we don’t have any story about why alignment research would be harder, it seems like the baseline expectation (modulo scheming) would be that automating alignment is comparably hard (in expectation) to automating these other domains. (I do think, though, that we have reason to expect alignment to be harder even conditional on needing other paradigms, because I think it’s reasonable to expect some of the evaluation challenges I discuss in the post to generalize to other regimes.)

Joe Carlsmith 2 May 2025 0:28 UTC
LW: 6 AF: 4
4
AF
in reply to: Steven Byrnes’s comment on: Can we safely automate alignment research?
I’m happy to say that easy-to-verify vs. hard-to-verify is what ultimately matters, but I think it’s important to be clear what about makes something easier vs. harder to verify, so that we can be clear about why alignment might or might not be harder than other domains. And imo empirical feedback loops and formal methods are amongst the most important factors there.