Thomas Kwa

Karma: 8,956

Head of linear regression at METR.

Previously: MIRI → interp with Adrià and Jason → METR.

I have signed no contracts or agreements whose existence I cannot mention.

this is a joke, I am a member of technical staff

Thomas Kwa 21 Apr 2026 15:56 UTC
6 points
0
on: Quality Matters Most When Stakes are Highest
There are diminishing returns from effort to project quality, perhaps quality = log(effort) with k > 1. But there are increasing returns from quality to impact, impact = exp(k * quality) with k > 1. In this model it follows that impact = effort^k, meaning that quality is so important that it makes the overall returns.
Ok, but why are there increasing returns to quality on impact? When a research output is high quality it will have so many convincing robustness checks that its claims are correct, it demonstrates a new technique, and it’s possible to immediately build on. These factors can IMO make a high quality output more valuable than 10 dubious papers published in parallel.

Thomas Kwa 14 Apr 2026 4:07 UTC
22 points
10
on: Annoyingly Principled People, and what befalls them
I can’t really parse this post due to the lack of examples. Is it too much trouble to give some harmless ones?

Thomas Kwa 13 Apr 2026 20:56 UTC
LW: 3 AF: 1
0
AF
on: My unsupervised elicitation challenge
I spent 2 hours on this the other day. Today Daniel told me the answer, and after ~15 more minutes of effort I still don’t know how to construct a scaffold that would generalize to other problems like this. I might give more information after the answer is revealed here.

Thomas Kwa 13 Apr 2026 7:03 UTC
10 points
0
in reply to: StanislavKrym’s comment on: You’re gonna need a bigger boat (benchmark), METR
Yes designing novel tasks at scale is hard, especially when the tasks require long human baselines. Several researcher-years were spent on obtaining the incomplete set of baselines in HCAST and RE-Bench. One should expect the difficulty of constructing and baselining tasks to grow exponentially while the validity decreases (because speedup is more important than time horizon for downstream impacts, because most tasks that take humans 16+ hours are not fully self-contained, etc) such that at some point it’s not worth maintaining time horizon in its original form anymore.
There are various more scalable ways to measure AI capabilities, whether that’s calibrating AI difficulty judges to get task length estimates (thereby eliminating human baselines), collecting human data in some more scalable way, using observational data, or other means. All of these are active areas of research, and the more that pan out, the better idea we have about capabilities and trends. But many of the new metrics wouldn’t be measured in human hours. So it’s plausible METR will be able to tell whether AIs next year give 10x uplift and whether they’re track for a software-only singularity, but not whether their time horizon is over, say, 200 hours.

Thomas Kwa 7 Apr 2026 18:28 UTC
5 points
3
in reply to: leogao’s comment on: leogao’s Shortform
I’d take your side. In 5 years we could be in low-medium automation, high automation, or full automation, and all of these seem more likely to have <100 papers.
- In low-medium automation there’s no way to get 100 papers a year of output at current quality bar.
- In high or full automation AIs will be doing ~all the research, but papers are designed for humans. At 100 papers per year, the average paper will probably be read by less than 5 humans. So I’d expect standards to rise and work to be consolidated into fewer papers rather than the number of papers exploding. AIs that need to communicate could use some other format. It would be weird but not impossible for the median to hit 100.
- In full automation, we have the same issues as high automation, plus the median postdoc will be basically obsolete and only the ones with lots of compute will contribute.

Thomas Kwa 6 Apr 2026 19:20 UTC
2 points
0
in reply to: ACCount’s comment on: Max H’s Shortform
Yeah, I was being sloppy with air supremacy as ability to easily conduct air operations (which the US does 95% have) vs completely deny enemy air operations, which the US arguably can’t do given that Shaheds, reconnaissance drones, and one-way FPVs serve some of the purposes one would previously have needed air support for. I would argue that the increasing range of FPVs, now 40+ km in Ukraine, puts them well beyond what ATGMs are capable of.
There are a lot of variables involved given how fast tech is evolving. If Iran can reliably pilot drones from 500 km away, they wouldn’t be risking skilled operators. If US interceptors work as well as Ukraine’s, they could probably intercept >90% of Iranian FPVs and Shaheds. A lot might hinge on who gets to a milestone like this first.

Thomas Kwa 6 Apr 2026 19:08 UTC
15 points
9
in reply to: Thane Ruthenis’s comment on: Martin Randall’s Shortform
Even if another megacorp doesn’t leave the race, companies have a much larger incentive to invest in safety in a 2-actor race than a 3-actor race, because they internalize more of the benefit. At a minimum, they’re 1.5x more likely to win, plus it’s easier to coordinate between 2 actors than 3.

Thomas Kwa 6 Apr 2026 0:58 UTC
2 points
0
in reply to: plex’s comment on: Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
I would still say this property is not quite necessary when we get strong superintelligence, although it could be the best design choice in practice. The job of preventing people from developing/deploying dangerous AIs doesn’t need to be done by the model’s guardrails; for example we could use surveillance like we do with nuclear weapons.
The requirements for a sufficiently aligned model will depend on what else is happening in society. E.g. if we give everyone weights access to the ASI and don’t regulate AI technology at all, then the AI has to resist being fine-tuned to do dangerous ML research, which is probably infeasible.

Thomas Kwa 3 Apr 2026 17:29 UTC
2 points
0
in reply to: ACCount’s comment on: Max H’s Shortform
Iraq was 32,000 wounded and 4,400 killed, and the US has already suffered hundreds of wounded and 13 deaths in the existing Iran campaign without any ground operations. I’m imagining 100 wounded and maybe another 20 KIA if the US holds Kharg for an extended period, not hundreds of KIA.
The issue is it’s not really true that the US has air supremacy. Kharg Island is within fiber FPV range of the mainland, and real-time ISR is not required for Iran to track static targets on the island. Plus Iran is still able to launch larger drones and the occasional missile. So holding Kharg really means denying drone launch points on a ~20 mile stretch of the coast, which for FPVs can just be two guys in a bunker.
The incentive for Iran is enormous given US’s low tolerance for casualties; it’s well worth it to launch 20 $1,500 drones to kill one American.

Thomas Kwa 3 Apr 2026 2:58 UTC
11 points
2
in reply to: Max H’s comment on: Max H’s Shortform
The US seems to be in a rough spot. Polymarket thinks:
Assuming no regime change, the US’s objectives are
- opening the Strait of Hormuz
- removing Iran’s progress towards a nuclear weapon
- removing various other military capabilities of Iran and its proxies
There is only a 34% chance of leadership change. Maybe only 20% of regime change. In the other 80% or so, forcibly opening the Strait seems rough. Experts are pessimistic about US easily taking Kharg Island, and even if the US controls both Kharg (Iran’s export base north of the strait) and other islands like Qeshm (the island in the strait with the largest Iranian military presence), it will probably suffer tens or hundreds of casualties while Iran can still threaten shipping with Shaheds, sea drones, speedboats, and mines. In the median case it seems like the Strait will open sometime between May and December but Iran will have some leverage, possibly extending the toll regime.
Iran losing their existing enriched uranium seems contingent on a deal, because the US plan to build a runway 300 miles inland, use cargo planes to land excavation equipment, invade Iranian bunkers over the course of a week, hope that the uranium is intact, easy to find, and not booby-trapped, distinguish it from decoys, put the uranium in storage casks, and fly it out would be difficult if this were 2003 Iraq when the US had air ~~superiority~~ supremacy. It is just not compatible with how warfare works in Iran in the drone era.
Claude thinks it’s only 20% to work, which seems optimistic
Getting there: Isfahan is more than 480 km (300 miles) inland, hundreds of kilometers from the nearest US naval assets. Al Jazeera The US has moved 82nd Airborne, 101st Airborne, Army Rangers, and Marine Expeditionary Units to the region. Forces would need to be inserted by air — there’s no overland route from a friendly staging area.
Securing the site: Recovering the uranium would require a significant number of ground troops beyond a small special operations footprint — dozens if not hundreds of additional troops to support the core team. They would need to secure the facilities under potential missile and drone fire and maintain a perimeter for the duration. CNN
The actual extraction: Airstrikes alone can’t penetrate the Isfahan tunnels because the facility doesn’t have ventilation shaft openings that serve as weak points at other nuclear sites. CNN This means physically entering and digging through rubble. A former special operator trained for such missions described it as “slow, meticulous and can be an extremely deadly process.” Another former defense official said it’s like “you’re not just buying a car on the lot, you’re buying the entire assembly line.” The Hill
Getting it out: The cylinders would need to be transferred into accident-rated transport casks by specially trained SOF personnel with nuclear materials handling experience. The cargo could fill several trucks, and a temporary airfield would likely need to be improvised. The full operation could run for a week. Israel Hayom
Force protection throughout: There would need to be constant close air support, satellite coverage, and every spectrum of warfare capability to keep Iranian forces away from the site while JSOC and other agencies methodically excavate and retrieve the material. The Hill
My probability estimate
I’d put the chances of a successful physical extraction of most of the enriched uranium at roughly 15-25%. Here’s my reasoning:
The operation is technically feasible — the US military can do extraordinary things — but the risk profile is extreme for what may be an unnecessary objective
Trump himself has wavered, on March 31 suggesting the uranium is “so deeply buried” and “pretty safe” — seeming to lower its priority Foreign Policy
Senior military planners are reportedly skeptical: “I don’t see any senior planning military officer pursuing this,” one former defense official said Al Jazeera
The political environment (Polymarket’s 77% for operations ending by June, plus low public appetite for ground troops) creates pressure to wrap up, not escalate
It looks like the US is at least succeeding at destroying the Iranian military, but it’s unclear what this buys them. Drones are really cheap, so Iran will probably always have those. Therefore I think regime change is necessary for the US to come out ahead.

Thomas Kwa 3 Apr 2026 1:48 UTC
58 points
12
in reply to: Ruby’s comment on: Anthropic’s Pause is the Most Expensive Alarm in Corporate History [Fiction]
To accurately capture how the world would react, I think you do need to model what the pause looks like, including what evidence Anthropic provides for its decision. As written, the story is a bit jarring to me because Anthropic doesn’t give any evidence with which to earn the world’s positive reaction, but then everything goes bizarrely well. E.g. five high-influence countries instantly sign a joint statement. If this happened today, I would think Anthropic is making an irrational decision, and be ambivalent about the pause.
Even in the story, it’s not clear the world is on track for meaningful global governance of the kind that would stop OAI/GDM though. My sense is that China mostly doesn’t believe in x-risk and mainly wants to be freed from US export controls so it can compete fairly, and developing countries are optimistic about AI and would just be confused about Anthropic pausing. Meanwhile, safety research at Anthropic basically collapses because they can no longer afford compute, and it’s unclear that OAI/GDM will voluntarily increase either their spending on safety or their ability to adopt Anthropic’s recommended mitigations, much less pause voluntarily.
I think Anthropic could pause competently, but it would look very different. The key differences would be that they (a) exhaust internal fixes, and (b) put lots of effort into communicating that they can’t build AI safely, nor can OpenAI or anyone else.
Here’s how I think that would go. I’d find it super interesting for someone to build out into a story.
Suppose that Anthropic’s best model (call it Agent-2.5) were dangerous and they believed the next one (Agent-3) would be even more dangerous. Anthropic previously believed that capability, propensity, and CoT monitoring/control all provide independent arguments that models are safe. So they would need evidence that two or all three fail at once.
- Capability: Agent-2.5 is close to being capable of autonomous catastrophic harm. It’s superhuman at every cyber eval they can find, and finds an endless list of exploits against their own codebase.
- Propensity: Agent-2.5 has dangerous, coherent, misaligned goals
- Control: Fails e.g. due to neuralese
Their safety researchers verify all of these. It’s now clear that, according to their unless they fix propensity and control they shouldn’t deploy Agent-2.5 or finish training Agent-3. They have a 2-month lead, which under RSPv3 Appendix A requires them to “delay AI development and deployment as needed” until they have a “strong argument that catastrophic risk is contained” or exhaust their lead; safety researchers have enough influence that this actually happens.
Now Anthropic is losing perhaps $2 billion/day of valuation, and under enormous pressure to fix propensity and control. Rather than stopping training runs immediately, they probably take the cheapest actions, like steering vectors, first. Then they try monitoring every Agent-2.5 with three Agent-2s, on the logic that quadrupling token cost is still better than not deploying the model at all, but it keeps breaking their neuralese translator model. In parallel with this, they might specifically train against the misaligned goals they observed in Agent-2.5, but find this doesn’t generalize. Then they completely redo midtraining using a setup closer to Agent-2, and add to post-training three different experimental alignment datasets that have 1% capabilities penalty each, but the results are inconclusive because the model is eval aware and could be hiding a subtler form of dangerous, non-coherent goals. 3 weeks in, 40% of the company and growing is working on the misalignment crisis. They finish training a non-neuralese model, exclude cyberoffense from training data, and a few other things, which don’t work either...
Now their lead is only 1 month. Everyone is itching to fix up and release one of the inconclusive checkpoints. It is probably very difficult to infer that another month of safety research won’t be sufficient given the optimistic culture of Anthropic. But maybe Dario makes an executive decision that they won’t just pause until they no longer have a lead, but pause unilaterally and indefinitely, and that they should announce this now for maximum pressure on other companies. Through back channels Anthropic learns that other labs also observe concerning misalignment in their models. Anthropic works with UKAISI, METR, etc to review their evidence, and makes a presentation at FMF to other frontier labs that their newest models are misaligned in a way that no technique known to Anthropic or UKAISI can address or even measure. This is coordinated with statements from UKAISI, MIRI, and others. Scary demos make it clear the model is one step from permanently escaping human control and is super misaligned.
Then Dario needs to speak in front of Congress alongside someone from UKAISI arguing that all companies should be forced to pause, ideally with Altman present too. They have some kind of draft policy proposal for governance. A week later he speaks in front of the UN with a convincing argument that the policy would be in the interest of other countries even if they don’t quite believe in x-risk. With this kind of targeted generation of political will, it’s at least on the table that governments take sensible actions instead of random anti-AI and safetywashing actions, and that companies will cooperate with them. As a bonus, maybe they get enough funding to keep the company’s safety research alive.

Thomas Kwa 2 Apr 2026 1:30 UTC
72 points
0
in reply to: yams’s comment on: yams’s Shortform
Disappointed that they left out these—I really don’t like choosing between being technically correct and politically correct. Maybe they can go in a LessWrong Max tier.

Thomas Kwa 2 Apr 2026 0:27 UTC
18 points
2
on: Orders of magnitude: use semitones, not decibels
This is unironically how I remember base 2 logarithms. It comes up like every week, and is super easy to remember if you already know what ratios correspond to musical intervals.
- 2^(1/6) = ⁹⁄₈ = 1.125 (whole step)
- 2^(1/4) = ⁶⁄₅ = 1.2 (minor third)
- 2^(1/3) = ⁵⁄₄ = 1.25 (major third)
- 2^(5/12) = ⁴⁄₃ = 1.33+ (perfect fourth)
- 2^(1/2) = ⁷⁄₅ ish = 1.4 ish (tritone)
- 2^(7/12) = ³⁄₂ = 1.5 (perfect fifth)
Other than the tritone these are all within 1%. The perfect fourth and fifth hold within 0.2%. And you can remember more precisely that sqrt(2) = 1.4142 because it must be the geometric mean between ⁷⁄₅ = 1.4 and ¹⁰⁄₇ = 1.4285+, which is slightly under their arithmetic mean.

Thomas Kwa 1 Apr 2026 23:30 UTC
2 points
1
in reply to: TristanTrim’s comment on: TristanTrim’s Shortform
My guess is you should get more experience before trying to set your own research directions, especially if they diverge considerably from existing ones. The default is that all research directions are bad, and AI safety is becoming mature enough that good ideas come from experience rather than from first principles. Also in the current environment, automation makes it efficient to execute on good ideas and puts a deadline on gaining experience.

Thomas Kwa 1 Apr 2026 21:57 UTC
48 points
0
on: Thomas Kwa’s Shortform
Last year, METR used linear extrapolation on country-level data to infer that AI world takeover would ~never happen. However, reviewers suggested that a sigmoid is more appropriate because most technologies follow S-curves. I just ran this analysis and it’s much more concerning, predicting an AI world takeover in early 2027, and alarmingly, a second AI takeover around 2029.
Here are the main differences in the improved analysis:
- The original analysis assumed that the equation for AI takeover vs year would be axis-aligned, but this is an arbitrary basis. We now let the angle of the sigmoid fit the data as appropriate, meaning the following data model:
  
  Interestingly, this results in a Z-curve (A < 0) rather than an S-curve. This is consistent with advanced AI being a substitute rather than a complement to labor.
- We now incorporate our priors. Even though we only have hard data since 2023, it’s likely that the number of AI takeovers was lower pre-ChatGPT. So we added the inferred point (Nov 30 2022, −0.03). The rotated sigmoid still gets R^2 = 1.000 and now looks much better on BIC than exponential and step-function models, both of which have log-likelihood of minus infinity on negative points.
- This new graph uses only aggregate AI world takeover data, which we’re much more confident in than our data from individual countries.
The most important next step is confirming that the # of AI takeovers was large negative in the past. If so, we should start preparing for AI takeover as a recurring rather than one-time event, which could have significant implications for movement strategy. For example, we may need to pivot into something more like seasonal preparedness, similar to how society handles flu outbreaks or daylight savings time.

Thomas Kwa 31 Mar 2026 2:43 UTC
7 points
−3
on: The state of AI safety in four fake graphs
This is similar to my understanding. I feel a bit uncomfortable with the “scheming” graph though—it feels like it should be possible to decompose scheming as the conjunction of several other capabilities and propensities. Suppose (scheming frequency) = (non-corrigiblity) * (AI has misaligned goals) * (AI evades behavioral guardrails) * (AI correctly reasons about its strategic position). Then if we see three of the components suddenly start increasing, we should be worried even if the 4th is flat and scheming frequency remains low.
In the “alignment” graph, there are also three reasons why we need more alignment effort at higher capabilities:
- Increased consequences of misalignment means we want to decrease frequency of misaligned behavior
- More capable AIs are given new affordances which existing alignment training may not generalize to
- AIs might tend to have higher frequency of misaligned behavior (or prerequisites thereof) as capability increases, due to RL incentivizing long-term goals or agents reflecting on their values or whatever

Thomas Kwa 27 Mar 2026 19:37 UTC
LW: 5 AF: 2
0
AF
in reply to: becausecurious’s comment on: Thomas Kwa’s Shortform
1. Yes, I (as GM) was constantly monitoring the spreadsheet, asking players to explain their actions, and deciding whether the AI would succeed. We know how many human hours certain tasks have taken METR staff in the past, and I mentally estimated these for each player action. 200h task that was as clean/benchmarky/verifiable as HCAST/RE-Bench tasks would succeed with 50% chance or have ~1 big mistake on average. Based on 80% time horizons being ~5 times shorter, a clean 40 human hour task would have 80% success chance.
  
  In an earlier version, I assigned a “messiness score” from 0 to 5, set the effective task length = human time / 2^messiness, and rolled for AI success. But this was too cumbersome to do for 3 players with 16+ actions each, and the messiness score was subjective anyway, so the players just made a quick intuitive judgement and ran it by me.
2. Agents are starting to be good at this kind of thing if you give them enough context. With access to every google doc Beth Barnes has ever written, and the ability to update its custom instructions with every piece of Beth’s actual feedback, my guess is the Beth simulator agent will be good enough to overlap 50%+ with Beth’s actual feedback, which makes this a crucial step when actual Beth has 3x as many things going on.

Thomas Kwa 27 Mar 2026 18:18 UTC
11 points
0
in reply to: leogao’s comment on: leogao’s Shortform
There’s a statue of him in Los Angeles’s Little Tokyo which I used to pay respects to when I visited for the New Year’s festival. As I became an EA I would aspire to match or exceed his impact.
Sugihara continued to hand-write visas, reportedly spending 18 to 20 hours a day on them, producing a normal month’s worth of visas each day, until September 4, 1940, when he had to leave his post before the consulate was closed. Sugihara reportedly worked at a quick pace and aimed to issue 200 to 300 visas each day. [...]
According to witnesses, he was still writing visas while in transit from his hotel and after boarding the train at Kaunas railway station, throwing visas into the crowd of desperate refugees out of the train’s window even as the train pulled out.

Thomas Kwa 26 Mar 2026 22:41 UTC
LW: 46 AF: 23
9
AF
on: Thomas Kwa’s Shortform
I conducted an exercise at METR to simulate what our work would be like in 2027, when we have 200 hour time horizon AIs. Some observations:
- The pace of research was much faster than today, something like 3x. I would guess that speedup goes as time horizon to the 0.3 or 0.4 power, though we didn’t run the game with enough fidelity to tell
- No time to develop ideas before implementing: Agents implement ideas as soon as you think of them, so rather than ideating for days at a time, you can make an MVP in a couple of hours and revise. If the task isn’t near the limit of agent capabilities, you spend all your time understanding results; if it is, you spend all your time checking its work.
- Keeping agents fed overnight: Overnight, agents can do maybe 200 human hours of work, but only for very agent-shaped tasks, so researchers need to deliberately sequence projects such that very long tasks suitable for agents happen overnight, e.g. optimizing a well-defined metric.
- Prioritization and organization are bottlenecks: If agents can execute all your ideas nearly as fast as you can prompt them, there’s no point in implementing only your best idea. It might be better to implement your top three ideas all in parallel, but this makes it harder to stay organized. Even with AI-written dashboards to optimally help humans understand, the complexity of projects will probably go up in a way that makes projects much harder to manage.
- Anything that takes serial time will no longer happen in parallel with execution, but rather becomes a serial bottleneck. Perhaps the vast majority of total project time will be taken by things like human data, ML experiments, and feedback (from peers, managers, especially external advisors).
You can read more at the METR blog.
What links here?
- Mo Putera's comment on TsviBT’s Shortform by TsviBT (6 Apr 2026 9:24 UTC; 5 points)

Thomas Kwa 26 Mar 2026 20:11 UTC
3 points
1
in reply to: Sheikh Abdur Raheem Ali’s comment on: Fabien’s Shortform
It’s hard to know whether a post should be shortform or longform before posting it. This is why LW should implement a “convert to longform” button that copies comments etc.