Thomas Larsen

Karma: 2,827

I’m broadly interested in AI strategy and want to figure out the most effective interventions to get good AI outcomes.

Thomas Larsen Apr 30, 2025, 5:36 PM
31 points
18
on: Early Chinese Language Media Coverage of the AI 2027 Report: A Qualitative Analysis
Thanks for writing this!

Thomas Larsen Apr 6, 2025, 3:54 PM
5 points
3
in reply to: Cole Wyeth’s comment on: AI 2027: What Superintelligence Looks Like
For what its worth, my view is that we’re very likely to be wrong about the specific details in both of the endings—they are obviously super conjunctive. I don’t think that there’s any way around this because we can be confident AGI is going to cause some ex-ante surprising things to happen.
Also, this is scenario is around 20th percentile timelines for me, my median is early 2030s (though other authors disagree with me). I also feel much more confident about the pre-2027 scenario than about the post 2027 scenario.
Is your disagreement that you think AGI will happen later, or that you think the effects of AGI on the world will look very different, or both? If its just the timelines, we might have fairly similar views.

Thomas Larsen Apr 5, 2025, 3:22 PM
8 points
2
in reply to: OVERmind’s comment on: AI 2027: What Superintelligence Looks Like
This wasn’t intended to be humor. In the scenario, we write:
(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)
I think that OpenAI, GDM, and Anthropic are in the lead and are the most likely to be ahead, with similar probability.

Thomas Larsen Apr 4, 2025, 1:31 AM
37 points
2
in reply to: Knight Lee’s comment on: AI 2027: What Superintelligence Looks Like
Thank you! We actually tried to write one that was much closer to a vision we endorse! The TLDR overview was something like:
1. Both the US and Chinese leading AGI projects stop in response to evidence of egregious misalignment.
2. Sign a treaty to pause smarter-than-human AI development, with compute based enforcement similar to ones described in our live scenario, except this time with humans driving the treaty instead of the AI.
3. Take time to solve alignment (potentially with the help of the AIs). This period could last anywhere between 1-20 years. Or maybe even longer! The best experts at this would all be brought in to the leading project, many different paths would be pursued (e.g. full mechinterp, Davidad moonshots, worst case ELK, uploads, etc).
4. Somehow, a do a bunch of good governance interventions on the AGI project (e.g. transparency on use of the AGIs, no helpful only access to any one. party, a formal governance structure where a large number of diverse parties all are represented.).
5. This culminates with aligning an AI “in the best interests of humanity” whatever that means, using a process where a large fraction of humanity is engaged and has some power to vote. This process might look something like giving each human some of the total resources of space and then doing lots of bargaining to find all the positive sum trades, with some rules against blackmail / using your resources to cause immense harm.
Unfortunately, it was hard to write this out in a way that felt realistic.
The next major project I focus on is likely going to be focusing on thinking through the right governance interventions here to make that happen. I’m probably not going to do this in scenario format (and instead something closer to normal papers and blog posts), but would be curious for thoughts.
What links here?
- AI 2027: Responses by Zvi (Apr 8, 2025, 12:50 PM; 106 points)

AI 2027: What Superintelligence Looks Like

Daniel Kokotajlo, Thomas Larsen, elifland, Scott Alexander, Jonas V and romeo

Apr 3, 2025, 4:23 PM

647 points

218 comments41 min readLW link

(ai-2027.com)

Thomas Larsen Jan 6, 2025, 6:29 PM
4 points
0
in reply to: Daniel Kokotajlo’s comment on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
Thanks—I see, I was misunderstanding.

Thomas Larsen Jan 6, 2025, 5:08 PM
2 points
0
on: Why Don’t We Just… Shoggoth+Face+Paraphraser?
Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the “shoggoth” copy responsible for generating all the ‘reasoning’ or ‘internal’ CoT, and then we have the “face” copy responsible for the ‘actions’ or ‘external’ outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prompt + all that CoT and then output a message to pass to the user, and then the process would repeat.

Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.
I don’t see how this proposal could get you AGI-level capabilities. You need the “shoggoth” part of the model to gain general purpose capabilities somehow, and if you are completely shielding the shoggoth from RL, all of the capabilities have to come from pre-training, which seems unlikely. My guess is that a more realistic proposal involves having separate reward models for the two parts:
1. The shoggoth RM, which only rewards capabilities
2. The face RM, which also includes alignment-relevant stuff
But it isn’t clear to me how to separate out the alignment/capabilities. IIUC the hope is that the capabilities training seems would naturally incentivize the shoggoth to have bad goals, and perhaps instrumentally decieve the humans, but to do this deception in the COT in english so we can see.
You later write:
How it all (hopefully) fits together: First, because of the paraphraser, the Shoggoth will not learn to speak in some optimized alien language, or even to speak in jargon or euphemisms—because semantics not recognized by the paraphraser will be scrambled by it. So the Shoggoth will continue to ‘think in English’ even as it undergoes lots of agency training and learns substantial new skills.

Second, because the training process is blind to the outputs of the Shoggoth (and instead only sees what the Face outputs), there should be no pressure on the Shoggoth to learn to decieve, conceal, dissemble, doublethink, etc. Instead the Shoggoth will be ‘free’ to put all its important thoughts in plain English.
Which suggests we are training the shoggoth with agency training. But I don’t understand how this is consistnet with “the training process be blind to the outputs of the Shoggoth”. Did you mean just some “alignment” subset of the training process?

Thomas Larsen Sep 29, 2024, 5:07 AM
19 points
16
on: “Slow” takeoff is a terrible term for “maybe even faster takeoff, actually”
I think a problem with all the proposed terms is that they are all binaries, and one bit of information is far too little to characterize takeoff:
- One person’s “slow” is >10 years, another’s is >6 months.
- The beginning and end points are super unclear; some people might want to put the end point near the limits of intelligence, some people might want to put the beginning points at >2x AI R&D speed, some at 10, etc.
- In general, a good description of takeoff should characterize capabilities at each point on the curve.
So I don’t really think that any of the binaries are all that useful for thinking or communicating about takeoff. I don’t have a great ontology for thinking about takeoff myself to suggest instead, but I generally try to in communication just define a start and end point and then say quantitatively how long this might take. One of the central ones I really care about is the time between wakeup and takeover capable AIs.
wakeup = “the first period in time when AIs are sufficiently capable that senior government people wake up to incoming AGI and ASI”
takeover capable AIs = “the first time there is a set of AI systems that are coordinating together and could take over the world if they wanted to”
The reason to think about this period is that (kind of by construction) it’s the time where unprecedented government actions that matter could happen. And so when planning for that sort of thing this length of time really matters.
Of course, the start and end times I think about are both fairly vague. They also aren’t purely a function of AI capabilities, and they care about stuff like “who is in government” and “how capable our institutions are at fighting a rogue AGI”. Also, many people believe that we never will get takeover capable AIs even at superintelligence.

Thomas Larsen Aug 21, 2024, 5:34 PM
16 points
0
in reply to: Garrett Baker’s comment on: Zach Stein-Perlman’s Shortform
Perhaps that was overstated. I think there is maybe a 2-5% chance that Anthropic directly causes an existential catastrophe (e.g. by building a misaligned AGI). Some reasoning for that:
1. I doubt Anthropic will continue to be in the lead because they are behind OAI/GDM in capital. They do seem around the frontier of AI models now, though, which might translate to increased returns, but it seems like they do best on very short timelines worlds.
2. I think that if they could cause an intelligence explosion, it is more likely than not that they would pause for at least long enough to allow other labs into the lead. This is especially true in short timelines worlds because the gap between labs is smaller.
3. I think they have much better AGI safety culture than other labs (though still far from perfect), which will probably result in better adherence to voluntary commitments.
4. On the other hand, they haven’t been very transparent, and we haven’t seen their ASL-4 commitments. So these commitments might amount to nothing, or Anthropic might just walk them back at a critical juncture.
2-5% is still wildly high in an absolute sense! However, risk from other labs seems even higher to me, and I think that Anthropic could reduce this risk by advocating for reasonable regulations (e.g. transparency into frontier AI projects so no one can build ASI without the government noticing).

Thomas Larsen Aug 21, 2024, 12:56 AM
17 points
−3
in reply to: Zach Stein-Perlman’s comment on: Zach Stein-Perlman’s Shortform
I agree with Zach that Anthropic is the best frontier lab on safety, and I feel not very worried about Anthropic causing an AI related catastrophe. So I think the most important asks for Anthropic to make the world better are on its policy and comms.

I think that Anthropic should more clearly state its beliefs about AGI, especially in its work on policy. For example, the SB-1047 letter they wrote states:
Broad pre-harm enforcement. The current bill requires AI companies to design and implement SSPs that meet certain standards – for example they must include testing sufficient to provide a “reasonable assurance” that the AI system will not cause a catastrophe, and must “consider” yet-to-be-written guidance from state agencies. To enforce these standards, the state can sue AI companies for large penalties, even if no actual harm has occurred. While this approach might make sense in a more mature industry where best practices are known, AI safety is a nascent field where best practices are the subject of original scientific research. For example, despite a substantial effort from leaders in our company, including our CEO, to draft and refine Anthropic’s RSP over a number of months, applying it to our first product launch uncovered many ambiguities. Our RSP was also the first such policy in the industry, and it is less than a year old. What is needed in such a new environment is iteration and experimentation, not prescriptive enforcement. There is a substantial risk that the bill and state agencies will simply be wrong about what is actually effective in preventing catastrophic risk, leading to ineffective and/or burdensome compliance requirements.
Liability doesn’t not address the central threat model of AI takeover, for which pre-harm mitigations are necessary due to the irreversible nature of the harm. I think that this letter should have acknowledged that explicitly, and that not doing so is misleading. I feel that Anthropic is trying to play a game of courting political favor by not being very straightforward about its beliefs around AGI, and that this is bad.
To be clear, I think it is reasonable that they argue that the FMD and government in general will be bad at implementing safety guidelines while still thinking that AGI will soon be transformative. I just really think they should be much clearer about the latter belief.

Thomas Larsen May 28, 2024, 1:04 AM
LW: 4 AF: 3
2
AF
in reply to: habryka’s comment on: Thomas Larsen’s Shortform
Yeah, actual FLOPs are the baseline thing that’s used in the EO. But the OpenAI/GDM/Anthropic RSPs all reference effective FLOPs.
If there’s a large algorithmic improvement you might have a large gap in capability between two models with the same FLOP, which is not desirable. Ideal thresholds in regulation / scaling policies are as tightly tied as possible to the risks.
Another downside that FLOPs / E-FLOPs share is that it’s unpredictable what capabilities a 1e26 or 1e28 FLOPs model will have. And it’s unclear what capabilities will emerge from a small bit of scaling: it’s possible that within a 4x flop scaling you get high capabilities that had not appeared at all in the smaller model.

Thomas Larsen May 27, 2024, 10:24 PM
LW: 4 AF: 3
0
AF
on: Thomas Larsen’s Shortform
Credit: Mainly inspired by talking with Eli Lifland. Eli has a potentially-published-soon document here.
The basic case against against Effective-FLOP.
1. We’re seeing many capabilities emerge from scaling AI models, and this makes compute (measured by FLOPs utilized) a natural unit for thresholding model capabilities. But compute is not a perfect proxy for capability because of algorithmic differences. Algorithmic progress can enable more performance out of a given amount of compute. This makes the idea of effective FLOP tempting: add a multiplier to account for algorithmic progress.
2. But doing this multiplications seems importantly quite ambiguous.
  1. Effective FLOPs depend on the underlying benchmark. It’s not at all apparent which benchmark people are talking about, but this isn’t obvious.
    People often use perplexity, but applying post training enhancements like scaffolding or chain of thought doesn’t improve perplexity but does improve downstream task performance.
    See https://arxiv.org/pdf/2312.07413 for examples of algorithmic changes that cause variable performance gains based on the benchmark.
  2. Effective FLOPs often depend on the scale of the model you are testing. See graph below from: https://arxiv.org/pdf/2001.08361 - the compute efficiency from from LSTMs to transformers is not invariant to scale. This means that you can’t just say that the jump from X to Y is a factor of Z improvement on Capability per FLOP. This leads to all sorts of unintuitive properties of effective FLOPs. For example, if you are using 2016-next-token-validation-E-FLOPs, and LSTM scaling becomes flat on the benchmark, you could easily imagine that at very large scales you could get a 1Mx E-FLOP improvement from switching to transformers, even if the actual capability difference is small.
  3. If we move away from pretrained LLMs, I think E-FLOPs become even harder to define, e.g., if we’re able to build systems may be better at reasoning but worse at knowledge retrieval. E-FLOPs does not seem very adaptable.
  4. (these lines would need to parallel for the compute efficiency ratio to be scale invariant on test loss)
3. Users of E-FLOP often don’t specify the time, scale, or benchmark that they are talking about it with respect to, which makes it very confusing. In particular, this concept has picked up lots of steam and is used in the frontier lab scaling policies, but is not clearly defined in any of the documents.
  1. Anthropic: “Effective Compute: We define effective compute as roughly the amount of compute it would have taken to train a model if no improvements to pretraining or fine-tuning techniques are included. This is operationalized by tracking the scaling of model capabilities (e.g. cross-entropy loss on a test set).”
    This specifies the metric, but doesn’t clearly specify any of (a) the techniques that count as the baseline, (b) the scale of the model where one is measuring E-FLOP with respect to, or (c) how they handle post training enhancements that don’t improve log loss but do dramatically improve downstream task capability.
  2. OpenAI on when they will run their evals: “This would include whenever there is a >2x effective compute increase or major algorithmic breakthrough”
    They don’t define effective compute at all.
  3. Since there is significant ambiguity in the concept, it seems good to clarify what it even means.
4. Basically, I think that E-FLOPs are confusing, and most of the time when we want to use flops, we’re usually just going to be better off talking directly about benchmark scores. For example, instead of saying “every 2x effective FLOP” say “every 5% performance increase on [simple benchmark to run like MMLU, GAIA, GPQA, etc] we’re going to run [more thorough evaluations, e.g. the ASL-3 evaluations]. I think this is much clearer, much less likely to have weird behavior, and is much more robust to changes in model design.
  1. It’s not very costly to run the simple benchmarks, but there is a small cost here.
  2. A real concern is that it is easier to game benchmarks than FLOPs. But I’m concerned that you could get benchmark gaming just the same with E-FLOPs because E-FLOPs are benchmark dependent — you could make your model perform poorly on the relevant benchmark and then claim that you didn’t scale E-FLOPs at all, even if you clearly have a broadly more capable model.
A3 in https://blog.heim.xyz/training-compute-thresholds/ also discusses limitations of effective FLOPs.

Thomas Larsen Jan 28, 2024, 7:16 AM
2 points
in reply to: Matthew Barnett’s comment on: Matthew Barnett’s Shortform
The fact that AIs will be able to coordinate well with each other, and thereby choose to “merge” into a single agent
My response: I agree AIs will be able to coordinate with each other, but “ability to coordinate” seems like a continuous variable that we will apply pressure to incrementally, not something that we should expect to be roughly infinite right at the start. Current AIs are not able to “merge” with each other.
Ability to coordinate being continuous doesn’t preclude sufficiently advanced AIs acting like a single agent. Why would it need to be infinite right at the start?
And of course current AIs being bad at coordination is true, but this doesn’t mean that future AIs won’t be.

Thomas Larsen Oct 11, 2023, 2:25 AM
5 points
0
in reply to: Quintin Pope’s comment on: Evolution provides no evidence for the sharp left turn
Thanks for the response!
If instead of reward circuitry inducing human values, evolution directly selected over policies, I’d expect similar inner alignment failures.
I very strongly disagree with this. “Evolution directly selecting over policies” in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It’s the loss landscape an parameter-function map that are responsible for most of a learning process’s inductive biases (especially for large amounts of data). See: Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent.
I think I understand these points, and I don’t see how this contradicts what I’m saying. I’ll try rewording.
Consider the following gaussian process:
Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don’t have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor.
Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven’t generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD.
You’ve argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. My claim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Since most generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn’t load bearing on this point—the related load bearing assumption is the lack of a an inductive bias that is benign.
If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don’t have to worry about this, but as yet, I don’t see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases.
It feels to me as if your argument is that we understand neither evolution nor NN inductive biases, and so we can’t make strong predictions about OOD generalization, so we are left with our high uncertainty prior over all of the possible proxies that we could find. It seems to me that we are far from being able to argue things like “because of inductive bias from the NN architecture, we’ll get non-deceptive AIs, even if there is a deceptive basin in the loss landscape that could get higher reward.”

I suspect you think bad misgeneralization happens only when you have a two layer selection process (and this is especially sharp when there’s a large time disparity between these processes), like evolution setting up the human within lifetime learning. I don’t see why you think that these types of functions would be more likely to misgeneralize.
(only responding to the first part of your comment now, may add on additional content later)

Long-Term Future Fund Ask Us Anything (September 2023)

Linch, calebp99, abergal, habryka, Thomas Larsen, LawrenceC and Lauro Langosco

Aug 31, 2023, 12:28 AM

33 points

6 comments1 min readLW link

(forum.effectivealtruism.org)

Thomas Larsen Aug 30, 2023, 5:40 PM
4 points
3
in reply to: StellaAthena’s comment on: Introducing the Center for AI Policy (& we’re hiring!)
We haven’t asked specific individuals if they’re comfortable being named publicly yet, but if advisors are comfortable being named, I’ll announce that soon. We’re also in the process of having conversations with academics, AI ethics folks, AI developers at small companies, and other civil society groups to discuss policy ideas with them.
So far, I’m confident that our proposals will not impede the vast majority of AI developers, but if we end up receiving feedback that this isn’t true, we’ll either rethink our proposals or remove this claim from our advocacy efforts. Also, as stated in a comment below:
I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development—most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.

Thomas Larsen Aug 30, 2023, 4:01 PM
15 points
5
in reply to: Nora Belrose’s comment on: Introducing the Center for AI Policy (& we’re hiring!)
I’ve changed the wording to “Only a few technical labs (OpenAI, DeepMind, Meta, etc) and people working with their models would be regulated currently.” The point of this sentence is to emphasize that this definition still wouldn’t apply to the vast majority of AI development—most AI development uses small systems, e.g. image classifiers, self driving cars, audio models, weather forecasting, the majority of AI used in health care, etc.

Thomas Larsen Aug 30, 2023, 1:59 PM
17 points
7
in reply to: Quintin Pope’s comment on: Introducing the Center for AI Policy (& we’re hiring!)
(ETA: these are my personal opinions)
Notes:
1. We’re going to make sure to exempt existing open source models. We’re trying to avoid pushing the frontier of open source AI, not trying to put the models that are already out their back in the box, which I agree is intractable.
2. These are good points, and I decided to remove the data criteria for now in response to these considerations.
3. The definition of frontier AI is wide because it describes the set of models that the administration has legal authority over, not the set of models that would be restricted. The point of this is to make sure that any model that could be dangerous would be included in the definition. Some non-dangerous models will be included, because of the difficulty with predicting the exact capabilities of a model before training.
4. We’re planning to shift to recommending a tiered system in the future, where the systems in the lower tiers have a reporting requirement but not a licensing requirement.
5. In order to mitigate the downside of including too many models, we have a fast track exemption for models that are clearly not dangerous but technically fall within the bounds of the definition.
6. I don’t expect this to impact the vast majority of AI developers outside the labs. I do think that open sourcing models at the current frontier is dangerous and want to prevent future extensions of the bar. Insofar as that AI development was happening on top of models produced by the labs, it would be affected.
7. The threshold is a work in progress. I think it’s likely that they’ll be revised significantly throughout this process. I appreciate the input and pushback here.

Thomas Larsen 29 Aug 2023 23:35 UTC
9 points
0
in reply to: lberglund’s comment on: Introducing the Center for AI Policy (& we’re hiring!)
Thanks!
I spoke with a lot of other AI governance folks before launching, in part due to worries about the unilateralists curse. I think that there is a chance this project ends up being damaging, either by being discordant with other actors in the space, committing political blunders, increasing the polarization of AI, etc. We’re trying our best to mitigate these risks (and others) and are corresponding with some experienced DC folks who are giving us advice, as well as being generally risk-averse in how we act. That being said, some senior folks I’ve talked to are bearish on the project for reasons including the above.
DM me if you’d be interested in more details, I can share more offline.

Thomas Larsen 29 Aug 2023 3:17 UTC
16 points
0
in reply to: Quintin Pope’s comment on: Introducing the Center for AI Policy (& we’re hiring!)
Your current threshold does include all Llama models (other than llama-1 6.7/13 B sizes), since they were trained with > 1 trillion tokens.
Yes, this reasoning was for capabilities benchmarks specifically. Data goes further with future algorithmic progress, so I thought a narrower criteria for that one was reasonable.
I also think 70% on MMLU is extremely low, since that’s about the level of ChatGPT 3.5, and that system is very far from posing a risk of catastrophe.
This is the threshold for the government has the ability to say no to, and is deliberately set well before catastrophe.
I also think that one route towards AGI in the event that we try to create a global shutdown of AI progress is by building up capabilities on top of whatever the best open source model is, and so I’m hesitant to give up the government’s ability to prevent the capabilities of the best open source model from going up.
The cutoffs also don’t differentiate between sparse and dense models, so there’s a fair bit of non-SOTA-pushing academic / corporate work that would fall under these cutoffs.
Thanks for pointing this out, I’ll think about if there’s a way to exclude sparse models, though I’m not sure if its worth the added complexity and potential for loopholes. I’m not sure how many models fall into this category—do you have a sense? This aggregation of models has around 40 models above the 70B threshold.

Thomas Larsen

AI 2027: What Su­per­in­tel­li­gence Looks Like

Long-Term Fu­ture Fund Ask Us Any­thing (Septem­ber 2023)

AI 2027: What Superintelligence Looks Like

Long-Term Future Fund Ask Us Anything (September 2023)