Interested in math puzzles, fermi estimation, strange facts about the world, toy models of weird scenarios, unusual social technologies, and deep dives into the details of random phenomena.
Working on the pretraining team at Anthropic as of October 2024; before that I did independent alignment research of various flavors and worked in quantitative finance.
Drake Thomas
I’m not going to invest time in further replies here, but FYI, the reason you’re getting downvotes is because your complaint does not make sense and comes across as wildly conspiratorial and unfounded, and no one with any reasonable understanding of the field would think this is a sensible thing to be up in arms over. I strongly recommend that you stop talking to LLMs.
A few months ago I spent $60 ordering the March 2025 version of Anthropic’s certificate of incorporation from the state of Delaware, and last week I finally got around to scanning and uploading it. Here’s a PDF! After writing most of this shortform, I discovered while googling related keywords that someone had already uploaded the 2023-09-21 version online here, which is slightly different.
I don’t particularly bid that people spend their time reading it; it’s very long and dense and I predict that most people trying to draw important conclusions from it who aren’t already familiar with corporate law (including me) will end up being somewhat confused by default. But I’d like more transparency about the corporate governance of frontier AI companies and this is an easy step.
Anthropic uses a bunch of different phrasings of its mission across various official documents; of these, I believe the COI’s is the most legally binding one, which says that “the specific public benefit that the Corporation will promote is to responsibly develop and maintain advanced Al for the long term benefit of humanity.” I like this wording less than others that Anthropic has used like “Ensure the world safely makes the transition through transformative AI”, though I don’t expect it to matter terribly much.
I think the main thing this sheds light on is stuff like Maybe Anthropic’s Long-Term Benefit Trust Is Powerless: as of late 2025, overriding the LTBT takes 85% of voting stock or all of (a) 75% of founder shares (b) 50% of series A preferred (c) 75% of non-series-A voting preferred stock. (And, unrelated to the COI but relevant to that post, it is now public that neither Google nor Amazon hold voting shares.)
The only thing I’m aware of in the COI that seems concerning to me re: the Trust is a clause added to the COI sometime between the 2023 and 2025 editions, namely the italicized portion of the following:
(C) Action by the Board of Directors. Except as expressly provided herein, each director of the Corporation shall be entitled to one (1) vote on all matters presented to the Board of Directors for approval at any meeting of the Board of Directors, or for action to be taken by written consent without a meeting; provided, however, that, if and for so long as the Electing Preferred Holders are entitled to elect a director of the Corporation, the affirmative vote of either (i) the Electing Preferred Director or (ii) at least 61% of the then serving directors may be required for authorization by the Board of Directors of any of the matters set forth in the Investors’ Rights Agreement. If at any time the vote of the Board of Directors with respect to a matter is tied (a “Deadlocked Matter”) and the Chief Executive Officer of the Corporation is then serving as a director (the “CEO Director”), the CEO Director shall be entitled to an additional vote for the purpose of deciding the Deadlocked Matter (a “Deadlock Vote”) (and every reference in this Restated Certificate or in the Bylaws of the Corporation to a majority or other proportion of the directors shall refer to a majority or other proportion of the votes of the directors), except with respect to any vote as to which the CEO Director is not disinterested or has a conflict of interest, in which such case the CEO Director shall not have a Deadlock Vote.
I think this means that the 3 LTBT-appointed directors do not have the ability to unilaterally take some kinds of actions, plausibly including things like firing the CEO (it would depend on what’s in the Investors’ Rights Agreement, which I don’t have access to). I think this is somewhat concerning, and moderately downgrades my estimate of the hard power possessed by the LTBT, though my biggest worry about the quality of the Trust’s oversight remains the degree of its AI safety expertise and engagement rather than its nominal hard power. (Though as I said above, interpreting this stuff is hard and I think it’s quite plausible I’m neglecting important considerations!)
More like True Time Horizon, though I think the claim is pretty plausible within the domain of well-scoped end-to-end AI R&D remote work tasks as well.
I also think that there is a <5% chance that a large-scale AI catastrophe occurs in the next 6 months, but I don’t think the time horizon framing here is a compelling argument for that.
Let’s use one year of human labor as a lower bound on how difficult it is. This means that AI systems will need to at least have a time horizon of one work-year (2000 hours) in order to cause a catastrophe.
I’m not very sold by this argument. AIs can do many tasks that take over a year of human labor, like “write 1 billion words of fiction”; when thinking about time horizons we care about something like “irreducible time horizon”, or some appeal to the most difficult or long-horizon-y bottleneck within a task.
While I agree that causing a catastrophe would require more than a year of human labor, it is not obvious to me that causing a large-scale catastrophe is bottlenecked on any task with an irreducible time horizon of one year; indeed, it’s not obvious to me that any such tasks exist! It seems plausible to me that the first AIs with reliable 1-month time horizons will basically not be time-horizon-limited in any way that humans aren’t, and unlikely-but-possible that this will be true even at the 1 week level.
Concretely, massively scaled coordinated cyberattacks on critical infrastructure worldwide is a threat model that could plausibly cause 1e8 deaths and does not obviously rely on any subtask with an especially large time horizon; I think the primary blockers to autonomous LLM “success” here in 2025 are (1) programming skill (2) ability to make and execute on an okay plan coherently enough not to blunder into failure (3) ability to acquire enough unmonitored inference resources (4) alignment. Of these I expect (2) to be the most time-horizon bottlenecked, but I wouldn’t feel that surprised if models that could pull it off still had low reliability at 1-day AI R&D tasks. (This particular scenario is contingent enough that I think it’s still very unlikely in the near term, TBC.)
But the model probably “knows” how many tokens there are; it’s an extremely salient property of the input
This doesn’t seem that clear to me; what part of training would incentivize the model to develop circuits for exact token-counting? Training a model to adhere to a particular token budget would do some of this, but it seems like it would have relatively light pressure on getting exact estimates right vs guessing things to the nearest few hundred tokens.
We know from humans that it’s very possible for general intelligences to be blind to pretty major low-level features of their experience; you don’t have introspective access to the fact that there’s a big hole in your visual field or the mottled patterns of blood vessels in front of your eye at all times or the ways your brain distorts your perception of time and retroactively adjusts your memories of the past half-second.
One way to test this would be to see if there are SAE features centrally about token counts; my guess would be that these show up in some early layers but are mostly absent in places where the model is doing more sophisticated semantic reasoning about things like introspection prompts. Ofc this might fail to capture the relevant sense of “knowing” etc, but I’d still take it as fairly strong evidence either way.
Does this cache out into concrete predictions of tasks which you expect LLMs to make little progress on in the future?
A very literal eval your post would suggest is to literally take two maps or images of some kind of similar stylistic form but different global structure, cut them into little square sections, and ask a model to partition the pieces from both puzzles into two coherent wholes. I expect LLMs to be really bad at this task right now, but they’re very bad at vision in general so “true understanding” isn’t really the bottleneck IMO.
But one could do a similar test for text-based data; eg one could ask a model to reconstruct two math proofs with shared variable names based on an unordered list of the individual sentences in each proof. Is this the kind of thing you expect models to make unusually little progress on relative to other tasks of similar time horizon? (I might be down to bet on something like this, though I think it’ll be tricky to operationalize something crisply enough.)
The details are complicated, vary a lot person-to-person, and I’m not sure which are OK to share publicly; the TLDR is that relatively early employees have a 3:1 match on up to 50% of their equity, and later employees a 1:1 match on up to 25%.
I believe that many people eligible for earlier liquidation opportunities used the proceeds from said liquidation to exercise additional stock options, because various tax considerations mean that doing so ends up being extremely leveraged for one’s donation potential in the future (at least if one expects the value of said options to increase over time); I expect that most people into doing interesting impact-maximizing things with their money took this route, which doesn’t produce much in the way of observable consequences right now.
I’ve made a legally binding pledge to allocate half of it to 501(c)3 charities, the maximum that my employer’s donation match covers; I expect to donate the majority of the remainder but have had no opportunities to liquidate any of it yet.
Yep, I agree that’s a risk, and one that should seem fairly plausible to external readers. (This is why I included other bullet points besides that one.) I’m not sure I can offer something compelling over text that other readers will find convincing, but I do think I’m in a pretty epistemically justified state here even if I don’t think you should think that based on what you know of me.
And TBC, I’m not saying I’m unbiased! I think I am biased in a ton of ways—my social environment, possession of a stable high-status job, not wanting to say something accidentally wrong or hurting people’s feelings, inner ring dynamics of being in the know about things, etc are all ways I think my epistemics face pressure here—but I feel quite sure that “the value of my equity goes down if Anthropic is less commercially successful” contributes a tiny tiny fraction to that state of affairs. You’re well within your rights to not believe me, though.
Agreed—I do think the case for doing this for signaling reasons is stronger for Joe and I think it’s plausible he should have avoided this for that reason. I just don’t think it’s clear that it would be particularly helpful on the object level for his epistemics, which is what I took the parent comment to be saying.
Speaking for myself as someone who works at Anthropic and holds equity: I think I just bite the bullet that this doesn’t affect my decisionmaking that much and the benefits of directing the resources from that equity to good ends are worth it.
(I did think somewhat seriously about finding a way to irrevocably commit all of my equity to donations, or to fully avoid taking possession of it, but mainly for the signaling benefits of there being an employee who was legibly not biased in this particular way in case that was useful when things got crazy; I don’t think it would have done much on the object level.)
Some reasons I think this is basically not a concern for me personally:
Deciding to pledge half my equity to 501(c)3 charities felt like a pretty easy decision; I now think it’s possible this was a mistake because the value of political giving may outweigh the tax advantages and donation match, but I don’t really remember my personal wealth being a driving factor there. And effects on Anthropic-as-a-whole have a way higher ratio of altruistic value to personal wealth than that!
Of course having donation-pledged dollars riding on Anthropic’s success is still a source of bias, but my own equity changes that very little, because my donation preferences are extremely correlated with vastly larger pools of equity from other employees; I already had 99% as much of an altruistic incentive for Anthropic to succeed commercially, and I think most people reading this comment are in a similar boat.
Empirically when I advocate internally for things that would be commercially costly to Anthropic I don’t notice this weighing on my decisionmaking basically at all, like I’m not sure I’ve literally ever thought about it in that setting?
If I were well-modeled as an actor whose equity value steered their actions in significant ways, I think I would be putting much more effort into tax optimization than I do now.
The epistemic distortions from one’s social and professional environment seem vastly larger to me. This isn’t directly an argument that the equity thing isn’t useful on the margin, but it just seems like a weird area of intervention when there’s so much lower hanging fruit. I think decisions like “live in berkeley or SF” have easily an order of magnitude more impact on a person’s orientation to these questions.
Others might vary a lot in how they orient to such things, though; I don’t claim this is universal.
My friend Peter Schmidt-Nielsen revitalized an old conversation of ours on twitter, and it led to me having a much better sense of what’s up with azeotropic concentrations and why it is that you “can’t” concentrate ethanol above its azeotropic point of 95% via distillation and boiling.
The level 1 argument goes:
When you boil off vapors from a 50% ethanol mixture, the ethanol boils off more readily, and so you can condense the vapor back into a 60% ethanol mixture (and redistill your now-lower-concentration liquid leftovers, and so on). If you repeat this process, the ethanol concentration will keep climbing, but it asymptotes—at 94%, you’ll only get to 94.5%, and 94.5% will only get you to 94.7%, and so on. If all you have is a collection of mixtures at concentrations less than 95%, everything you do to them will only give you more of the same. So you can’t make pure ethanol via distillation.
[N.B.: This chart is for propanol rather than ethanol, but it’s more nicely labeled than the ethanol charts I could find, and also the relevant regions end up a little more visually distinct here. But it’s the same qualitative story as with ethanol, just with more of the action in the [0.95,1] regime. See here for a real ethanol graph.]The level 2 rejoinder goes:
Hang on a moment, though. What if I get up to 95%, and then just give myself a tiny little push by adding in a bit of pure ethanol? Now my vapor will be lower concentration than my starting liquid, but that’s just fine—it means that the leftover liquid has improved! So I can do the same kind of distillation pipeline, but using my liquid as the high-value enriched stages, and carry myself all the way up to 100% purity. Once I pass over the azeotropic point even a little, I’m back in a regime where I can keep increasing my concentrations, just via the opposite method.
But the level 3 counterargument goes:
Although this technically works, your yields are crap—in particular, the yield of pure ethanol you can get out this way is only ever as much as you added initially to get over the hump of the azeotropic point, so you don’t get any gains. Suppose you start with 1 unit of azeotropic solution and you add units of pure ethanol to bring yourself up to a concentration solution. Now any distillation and remixing steps you do from here will keep concentrations above , for the same reasons they kept it below in the other regime. But if the concentrations can’t drop below , then conservation of water and ethanol mass means that you can only ever get units of pure ethanol back out of this pool—if you had any more, the leftovers would have to average a concentration below , which they can’t cross. So your yield from this process is at most as good as what you put in to kickstart it.
I’m not sure what level 4 is yet, but I think it might be either of:
Use variation in vapor-liquid curves with atmospheric pressure to get more wiggle room and extract way better yields by dynamically moving the pressure of your distillery up and down.
Force higher concentrations via osmosis with the right filter around the azeotropic point to get the boost, and anti-distill to get pure ethanol out of that, which might turn out to be more energy-efficient than using osmosis to go that high directly.
(People who actually know physics and chemistry, feel free to correct me on any of this!)
Nope! Do you recommend it?
It seems like Reginald Johnston was basically an apologist? But I haven’t done any due diligence here, it’s certainly possible this account is pretty distorted. Would be curious to hear about any countervailing narratives you find.
I recommend the Wikipedia article on Puyi, the last emperor of China. He was 6 years old when the 1911 Xinhai revolution forced him to abdicate, and spent the remainder of his life being courted and/or used by various actors who found his support useful to their cause.
I think it’s worth reading, though it’s fairly long; I think the story of his life is just pretty interesting and gripping, and the article is unusually well-written and presented in a fairly narratively engaging style. (Though I am not particularly well-versed in 20th century Chinese history and am fully deferring to Wikipedia editors on the veracity and neutrality of this account.)
It’s an interesting case study of Just Some Guy being thrust into high-stakes political machinations without much independent power—if you’ve ever wondered how you would fare in a Game of Thrones style political intrigue, I think Puyi’s story gives a decent sense: you get puppeted around by much more ruthless actors, accomplish very little of what you want, and generally have a pretty bad time.
I feel a weird mix of emotions towards the guy. He was pretty clearly an awful person almost wholly devoid of any virtue for the first several decades of his life—cowardly, vain, cruel, naive, incompetent, and a rapist. But he never really had a chance to be anything else; it’s unclear if he had a single genuine relationship with someone who wasn’t trying to manipulate him after age 8, and I’m not sure he had ever encountered a person earnestly doing something for moral reasons. And he lived a pretty unpleasant life during this whole time.
A couple teaser facts to entice you to read the article:
He was incredibly clumsy. Until middle age he had never once had to do things like brush his teeth, close doors behind himself, or put anything away, because servants simply handled it all for him. Apparently a total lack of having to do things for yourself in childhood can give you a sort of developmental disorder in which you can’t really pick up the habits effectively at age 40.
He has one of the most dramatic redemption arcs of any character I know of in truth or fiction; it seems like he pretty genuinely came to regret his (many) past misdeeds and ended up as a pretty kind and humble man. This redemption arc is slightly tarnished by the fact that it, like everything else in his life, was a calculated manipulation, this time on the part of the newly formed CCP. Though it seems like they mostly just had to expose him to the reality of what he’d been complicit in, rather than any more sophisticated manipulation (and avoid letting him learn about the widespread famines).
About the only positive influence on his life during adolescence was his English tutor Reginald Johnston, a Scottish guy who really liked monarchy (in the “emperor is basically a god” sense rather than the British sense) and had enormous influence over Puyi as the only person willing or able to say no to him.
Metformin 1000mg/day upon symptom onset may reduce your risk of long covid by 10-30%
Emergent Introspective Awareness in Large Language Models
FWIW, my enthusiasm for “make America more good at AI than China” type policies comes somewhat more from considerations like “a larger US advantage lets the US spend more of a lead on safety without needing international cooperation” than considerations like “a CCP-led corrigible ASI would lead to much worse outcomes than a USG-led corrigible ASI”. Though both are substantial factors for me and I’m fairly uncertain; I would not be surprised if my ordering here switched in 6 months.
Thanks for writing this post! I’m curious to hear more about this bit of your beliefs going in:
The existential risk argument is suspiciously aligned with the commercial incentives of AI executives. It simultaneously serves to hype up capabilities and coolness while also directing attention away from the real problems that are already emerging. It’s suspicious that the apparent solution to this problem is to do more AI research as opposed to doing anything that would actually hurt AI companies financially.
Are there arguments or evidence that would have convinced you the existential risk worries in the industry were real / sincere?
For context, I work at a frontier AI lab and from where I sit it’s very clear to me that the x-risk worries aren’t coming from a place of hype, and people who know more about the technology generally get more worried rather than less. (The executives still could be disingenuous in their expressed concern, but if so they’re doing it in order to placate their employees who have real concerns about the risks, not to sound cool to their investors.)
I don’t know what sorts of things would make that clearer from the outside, though. Curious if any of the following arguments would have been compelling to you:
The AI labs most willing to take costly actions now (like hire lots of safety researchers or support AI regulation that the rest of the industry opposes or make advance commitments about the preparations they’ll take before releasing future models) are also the ones talking the most about catastrophic or existential risks.
Like if you thought this stuff was an underhanded tactic to drum up hype and get commercial success by lying to the public, then it’s strange that Meta AI, not usually known for its tremendous moral integrity, is so principled about telling the truth that they basically never bring these risks up!
People often quit their well-paying jobs at AI companies in order to speak out about existential risk or for reasons of insufficient attention paid to AI safety from catastrophic or existential risks.
The standard trajectory is for lab executives to talk about existential risk a moderate amount early on, when they’re a small research organization, and then become much quieter about it over time as they become subject to more and more commercial pressure. You actually see much more discussion of existential risk among the lower-level employees whose statements are less scrutinized for being commercially unwise. This is a weird pattern for something whose main purpose is to attract hype and investment!
This doesn’t seem right to me? The paper indicates that models which see such prompts in training end up aligned on other environments!
You might worry about the effects of this paper becoming self-reinforcing in future models via contamination of future training data even if the original results had been misleading or incorrect in some way, but I don’t see why you’d expect it to be anti-inductive in this way.