Thomas Kwa

Karma: 6,931

Member of technical staff at METR.

Previously: Vivek Hebbar’s team at MIRI → Adrià Garriga-Alonso on various empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Thomas Kwa 9 Sep 2025 2:31 UTC
2 points
0
on: Do model evaluations fall prey to the Good(er) Regulator Theorem?
I don’t understand how the issue of validity in evals connects to the Gooder Regulator Theorem. It’s certainly arbitrarily hard to measure a latent property of an AI agent that’s buried an an arbitrarily complex causal diagram of the agent, but this is also true of any kind of science. The solution is to only measure simple things and validate them extensively.
To start, the playbook includes methods from the social sciences. Are humans lying in a psychology experiment intending to deceive, or just socially conditioned into deception in these circumstances? We can discriminate using more experiments. Here’s a list suggested by Claude.
Internal Validity:
- Manipulation checks: Verify participants understand the task and setting as intended
- Control conditions: Include baseline conditions where lying provides no benefit
- Multiple measures: Use both behavioral indicators (actual lies told) and physiological measures (stress responses, reaction times)
- Debrief interviews: Post-experiment discussions to understand participant reasoning
Construct Validity:
- Operationalize “lying” clearly: Distinguish between errors, omissions, and active false statements
- Measure awareness: Test whether participants recognize they’re providing false information
- Cross-validate: Use multiple paradigms that elicit deception differently
External Validity:
- Vary contexts: Test across different social settings and stakes
- Diverse samples: Include participants from various cultural backgrounds where social norms differ
- Ecological validity: Use scenarios that mirror real-world situations
Intentional deception should show:
- Individual differences in moral reasoning correlation
- Sensitivity to personal cost-benefit analysis
- Greater cognitive effort signatures
- Ability to inhibit when explicitly instructed
- Conscious awareness and ability to justify
Socially conditioned behavior should show:
- Stronger situation-specific activation
- Less individual variation within similar social groups
- Faster, more automatic responses
- Resistance to conscious control
- Difficulty articulating clear reasons
Likewise, in evals we can do increasingly elaborate robustness checks where we change the prompt and setting and measure if the correlation is high, check that the model understand its behavior is deceptive, provide incentives to the model that should make its behavior change if it were intentionally deceptive, and so on. Much of the time, especially for propensity / alignment evals, we will find the evals are nonrobust and we’ll be forced to weaken our claims.
Sometimes, we have an easier job. I don’t expect capability evals to get vastly harder because many skills are verifiable and sandbagging is solvable. Also, AI evals are easier than human psychology experiments, because they’re likely to be much cheaper and faster to run and more replicable.

Thomas Kwa 8 Sep 2025 23:02 UTC
11 points
2
in reply to: Jan Betley’s comment on: Immigration to Poland
According to Wikipedia it seems to have worked well and not been expensive.
Poland began work on the 5.5-meter (18 foot) high steel wall topped with barbed wire at a cost of around 1.6 billion zł (US$407m) [...] in the late summer of 2021. The barrier was completed on 30 June 2022.^[3] An electronic barrier [...] was added to the fence between November 2022 and early summer 2023 at a cost of EUR 71.8 million.^[4]
[...] official border crossings with Belarus remained open, and the asylum process continued to function [...]
In 2024 [...] Exluding those who submitted applications at airports, there were 3,141 [asylum applications from] persons coming directly from the territory of Belarus, Russia or Ukraine.
Since the fence was built, illegal crossings have reduced to a trickle; however, between August 2021 and February 2023, 37 bodies were found on both sides of the border; people have died mainly from hypothermia or drowning.^[11]
The Greenberg article also suggests a reasonable tradeoff is being made in policy
Despite these fears, Duszczyk is convinced his approach is working. In a two-month period after the asylum suspension, illegal crossings from Belarus fell by 48% compared to the same period in 2024. At the same time, in all of 2024, there was one death—out of 30,000 attempted crossings—in Polish territory. There have been none so far in 2025. Duszczyk feels his humanitarian floor is holding.

Thomas Kwa 8 Sep 2025 22:43 UTC
24 points
10
in reply to: Matrice Jacobine’s comment on: MAGA speakers at NatCon were mostly against AI
Maybe you’re reading some other motivations into them, but if we just list the concerns in the article only 2 out of 11 indicate they want protectionism. The rest of the items that apply to AI include threats to conservative Christian values, threats to other conservative policies, and things we can mostly agree on. This gives a lot to ally on, especially the idea that Silicon Valley should not be allowed unaccountable rule over humanity, and that we should avoid destroying everything to beat China. It seems like a more viable alliance than with the fairness and bias people; plus conservatives have way more power right now.
- Mass unemployment
- “UBI-based communism”
- Acceleration to “beat China” forces sacrifice of a “happier future for your children and grandchildren”
- Suppression of conservative ideas by big tech eg algorithmic suppression, demonetization
- Various ways that tech destroys family values
  - Social media / AI addiction
  - Grok’s “hentai sex bots”
  - Transhumanism as an affront to God and to “human dignity and human flourishing”
  - “Tech assaulting the Judeo-Christian faith...”
- Tech “destroying humanity”
- Tech atrophying the brains of their children in school and destroying critical thought in universities.
- Rule by unaccountable Silicon Valley elites lacking national loyalty.

Thomas Kwa 4 Sep 2025 6:12 UTC
14 points
0
on: Natural Latents: Latent Variables Stable Across Ontologies
I’m curious about your sense of the path towards AI safety applications, if you have a more specific and/or opinionated view than the conclusion/discussion section.

Thomas Kwa 4 Sep 2025 5:00 UTC
4 points
0
in reply to: Raemon’s comment on: Thomas Kwa’s Shortform
My view is that AIs are improving faster at research-relevant skills like SWE and math than they’re increasing at misalignment (rate of bad behaviors like reward hacking, ease of eliciting an alignment faker, etc) or covert sabotage ability, such that we would need a discontinuity in both to get serious danger by 2x. There is as yet no scaling law for misalignment showing that it predictably gets worse when capabilities improve in practice.
The situation is not completely clear because we don’t have good alignment evals and could get neuralese any year, but the data are pointing in that direction. I’m not sure about research taste as the benchmarks for that aren’t very good. I’d change my mind here if we did see stagnation in research taste plus misalignment getting worse over time (not just sophistication of the bad things AIs do, but also frequency or egregiousness).

Thomas Kwa 3 Sep 2025 22:50 UTC
2 points
0
in reply to: Hastings’s comment on: Thomas Kwa’s Shortform
That is, you think alignment is so difficult that keeping humanity alive for 3 years is more valuable than the possibility of us solving alignment during the pause? Or that the AIs will sabotage the project in a way undetectable by management even if management is very paranoid about being sabotaged by any model that has shown prerequisite capabilities for it?

Thomas Kwa 3 Sep 2025 22:11 UTC
2 points
0
in reply to: Thomas Kwa’s comment on: Thomas Kwa’s Shortform
Ways this could be wrong:
- We can pause early (before AIs pose significant risk) at little cost
- We must pause early (AIs pose significant risk before they speed up research much). I think this is mostly ruled out by current evidence
- Safety research inherently has to be done by humans because it’s less verifiable, even when capabilities is automated
- AI lab CEOs are good at management of safety research because their capabilities experience transfers (in this case I’d still much prefer Buck Shlegeris or Sam Altman over the US or China governments)
- It’s easy to pause indefinitely once everyone realizes AIs are imminently dangerous, kind of like the current situation with nuclear
- Probably others I’m not thinking of

Thomas Kwa 3 Sep 2025 20:51 UTC
LW: 4 AF: 2
0
AF
in reply to: Bogdan Ionut Cirstea’s comment on: ryan_greenblatt’s Shortform
The data is pretty low-quality for that graph because the agents we used were inconsistent and Claude 3-level models could barely solve any tasks. Epoch has better data for SWE-bench Verified, which I converted to time horizon here and found to also be doubling every 4 months ish. Their elicitation is probably not as good for OpenAI as Anthropic models, but both are increasing at similar rates.

Thomas Kwa 3 Sep 2025 20:44 UTC
8 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
What’s the current best known solution to the 5 and 10 problem? I feel like actual agents have heuristic self-models rather than this cursed logical counterfactual thing and so there’s no guarantee a solution exists. But I don’t even know what formal properties we want, so I also don’t know whether we have impossibility theorems, some properties have gone out of fashion, or people think it can still work.

Thomas Kwa 3 Sep 2025 19:56 UTC
31 points
−3
on: Thomas Kwa’s Shortform
In a world with an AI pause, good management is much more important than the pause length. If the pause is government-mandated and happens late, I would prefer a 1-year pause with say Buck Shlegeris in charge over a 3-year pause with the average AI lab lead in charge.
The reason is that most of the safety research during the pause will be done by AIs. In a pause scenario where >50% of resources are temporarily dedicated to safety and >10x uplift is technically plausible, the speed of research is mainly limited by the level of AI we trust and our ability to understand the AI’s findings. The world probably can’t pause indefinitely, and it may not be desirable to do so. So what is the basic strategy for a finite AI pause?
- Use superhuman AIs as that are smart as possible to do safety research, but not so capable that they’re misaligned. This may require avoiding use of latest gen models, or advancing capabilities and using the next gen only for safety.
- Use an elaborate monitoring framework and incrementally apply the safety research generated to increase the maximum capability level we can run at without being sabotaged when models are misaligned.
- Prevent capability advances from being stolen, so no rogue actor can break out and threaten DSA, forcing the pause to end.
The variance in AI speed will be huge as models get smarter and more efficient. Conditional on us needing a pause, I imagine that the situation will look something like: Agent-2 speeds up work by 10x and incurs 0.2%/year x-risk, Agent-3 has 20x speed and 1.0%/year x-risk, Agent-4 has 40x speed and 5.0%/year x-risk, and so on. This means there is huge value in advancing capabilities in ways that that don’t reduce safety. For example, maybe alignment-specific training data doubles speed while preserving alignment, or a new architecture gets most of the benefits of “neuralese” without compromising monitorability.
I expect that in a pause, good vs bad management could make a >10x difference in rate of safety research progress, with some of this coming from targeting the appropriate risk level, some from better handling ambiguous evidence about safety of current models, and some from differentially advancing safe capabilities. In comparison, the amount of political will and diplomacy that will be required to make a pause 10x longer is immense.
Pause length is still valuable inasmuch as it gives more time for management to understand the situation, but is not the most important factor in whether the pause actually succeeds at solving alignment.

Thomas Kwa 3 Sep 2025 10:09 UTC
8 points
4
in reply to: Remmelt’s comment on: Anthropic’s leading researchers acted as moderate accelerationists
I have a better idea now what you intend. At risk of violating the “Not worth getting into?” react, I still don’t think the title is as informative as it could be; summarizing on the object level would be clearer than saying their actions were similar to actions of “moderate accelerationists”, which isn’t a term you define in the post or try to clarify the connotations of.
Who is a “moderate communist”? Hu Jintao, who ran the CCP but in a state capitalism way? Zohran Mamdani, because democratic socialism is sort of halfway to communism? It’s an inherently vague term until defined, and so is “moderate accelerationists”.
I would be fine with the title if you explained it somewhere, with a sentence in the intro and/or conclusion like “Anthropic have disappointingly acted as ‘moderate accelerationists’ who put at least as much resource into accelerating the development of AGI as ensuring it is safe”, or whatever version of this you endorse. As it is some readers, or at least I, have to think
- does Remmelt think that Anthropic’s actions would also be taken by people who believe extinction by entropy-maximizing robots is only sort of bad?
- Or is it that Remmelt thinks that Anthropic is acting like a company who think the social benefits of speeding up AI could outweigh the costs?
- Or is the post trying to claim that ~half of Anthropic’s actions sped up AI against their informal commitments?
This kind of triply recursive intention guessing is why I think the existing title is confusing.
Alternatively, the title could be something different like “Anthropic founders sped AI and abandoned many safety commitments” or even “Anthropic was not consistently candid about its priorities”. In any case it’s not clear to me that it’s worth changing vs making some kind of minor clarification.

Thomas Kwa 2 Sep 2025 14:48 UTC
5 points
0
on: Traffic and Transit Roundup #1
I’m writing a post about whether we’ll have a Dyson Swarm reach Earth’s current energy generation level before CA High-Speed Rail is complete. It seems plausible, if incompetence and nimbys delay CAHSR by a few years past ASI but space-based construction starts right after ASI.

Thomas Kwa 2 Sep 2025 4:31 UTC
15 points
8
in reply to: Vladimir_Nesov’s comment on: Vladimir_Nesov’s Shortform
In many industries cost decreases by some factor with every doubling of cumulative production. This is how solar eventually became economically viable.

Thomas Kwa 2 Sep 2025 2:14 UTC
11 points
2
on: Anthropic’s leading researchers acted as moderate accelerationists
Title is confusing and maybe misleading, when I see “accelerationists” I think either e/acc or the idea that we should hasten the collapse of society in order to bring about a communist, white supremacist, or other extremist utopia. This is different from accelerating AI progress and, as far as I know, not the motivation of most people at Anthropic.

Thomas Kwa 31 Aug 2025 19:55 UTC
10 points
0
in reply to: PhilGoetz’s comment on: Should you make stone tools?
I considered these factors. There are advantages too; the knapping subreddit also says it’s easier to work, and with less dust the silicosis risk is lower. But the crucial one is it’s much more exciting to make something molecularly sharp.
The cost seems acceptable because it’s still lower than most hobbies. I’m mostly concerned that the required PPE will ruin the experience as quite a lot is recommended: gloves, mask, eye protection, good ventilation, a tarp to catch stray obsidian shards. Most of these are still needed for flint.
Maybe this is why knapping isn’t so popular compared to other crafts like knitting and woodworking; it seems to be both tedious and just hazardous enough to avoid for rich societies that place extremely high value on health. People don’t persistence hunt gazelles for fun; they use guns or more likely play football and videogames.

Thomas Kwa 30 Aug 2025 1:25 UTC
28 points
0
on: Should you make stone tools?
This post convinced me to buy 10 pounds of obsidian and basic knapping tools. I may give an update on whether knapping is fun and/or causes me severe injury, though I suspect that the learning curve is low hundreds of hours for proficiency and I’m unlikely to stick with it that long.

Thomas Kwa 28 Aug 2025 19:19 UTC
3 points
1
in reply to: ChristianKl’s comment on: Katalina Hernandez’s Shortform
Agree, especially from developing countries without a strong preexisting stance on AI, where choices could be less biased towards experts who already have lots of prestige, and more weighted on merits + lobbying.

Thomas Kwa 28 Aug 2025 3:04 UTC
9 points
1
in reply to: Katalina Hernandez’s comment on: Katalina Hernandez’s Shortform
Item 3 has some constraints on members (emphasis mine):
Requests the Secretary-General to launch a published criteria-based open call and to recommend a list of 40 members of the Panel to be appointed by the General Assembly in a time-bound manner, on the basis of their outstanding expertise in artificial intelligence and related fields, an interdisciplinary perspective, and geographical and gender balance, taking into account candidacies from a broad representation of varying levels of technological development, including from developing countries, with due consideration to nominations from Member States and with no more than two selected candidates of the same nationality or affiliation and no employees of the United Nations system;
This means only two each from US, UK, China, etc. I wonder what the geographic and gender balance will actually look like; these will significantly influence the average expertise type and influence of members.
My guess is that x-risk mitigation will not be the primary focus at first just because over half of the experts are American and British men and there are so many other interests to represent. Nor would industry be heavily represented because it skews too American (and the document mentioned CoIs, and the 7 goals of the Dialogue are mostly not about frontier capabilities). But in the long term, unless takeoff is fast, developing countries will realize the US is marching towards DSA and interesting things could happen.
Edit: my guess for the composition would be similar to the existing High-level Advisory Body on Artificial Intelligence, with 39 members, of which
- 19 men, 20 women
- 15 academia/research (including 10 professors), 10 government, 4 from big tech and scaling labs, 10 other
- 17 Responsibility / Safety / Policy oversight positions, 22 other positions
- Nationalities:
  - 9 Americas (incl. 4 US), 11 Europe, 11 Asia (incl. 2 China), 1 Oceania, 5 Africa. They heavily overweighted Europe, which has 28% of the seats but only 9% of world population
  - 19 high income countries, 24 LMIC
  - gpt5 isn’t sure about some of the dual nationality cases though
- 1 big name in x-risk (Jaan Tallinn)

Thomas Kwa 24 Aug 2025 21:27 UTC
35 points
13
in reply to: Wei Dai’s comment on: Banning Said Achmiz (and broader thoughts on moderation)
FWIW I feel like I get sufficient status reward for criticism and this moderation decision basically won’t affect my behavior
- This defended a paper where I was lead author, which got 8 million views on Twitter and was possibly the most important research output by my current employer, against criticism that it was p-hacking
- This got me a bounty of $700 or so (which I think I declined or forgot about?) and citation in a follow-up post
- This ratioed the OP by 3:1 and induced a thoughtful response by OP that helped me learn some nontrivial stats facts
- This got 73 karma and was the most important counterpoint to what I still think are mostly wrong and overrated views on nanotech
- This got 70 karma and only took about an hour to write, and could have been 5 minutes if I were a better writer
Now it’s true that most of these comments are super long and high effort. But it’s possible to get status reward for lower effort comments too, e.g. this, though it feels more like springing a “gotcha”. Many of the examples of Said’s critiques in the post at least seemed either deliberately inflammatory or unhelpful or targeted at some procedural point that isn’t maximally relevant.
As for risking being wrong, this is the only “bad” recent comment of mine I can remember, and I think you have to be pretty risk averse to be totally discouraged from commenting. If 30% of my comments were wrong I would probably feel discouraged but if it were 15% I’d just be less confident or hedge more. Probably the main change I’ll make is to shift away from this uncommon and very marginal type of comment that imposes costs on the author and might be wrong, to just downvote and move on.

Thomas Kwa 22 Aug 2025 3:53 UTC
11 points
10
in reply to: Arjun Panickssery’s comment on: Arjun Panickssery’s Shortform
This man’s modus ponens is definitely my modus tollens. It seems super cursed to use moral premises to answer metaphysics problems. In this argument, except for step 8, you can replace belief in free will with anything, and the argument says that determinism implies that any widely held belief is true.
“Ought implies can” should be something that’s true by construction of your moral system, rather than something you can just assert about an arbitrary moral system and use to derive absurd conclusions.