RSS

Thomas Kwa

Karma: 6,931

Member of technical staff at METR.

Previously: Vivek Hebbar’s team at MIRI Adrià Garriga-Alonso onvarious empirical alignment projects → METR.

I have signed no contracts or agreements whose existence I cannot mention.

Claude, GPT, and Gem­ini All Strug­gle to Evade Monitors

6 Aug 2025 20:28 UTC
59 points
3 comments5 min readLW link

METR: How Does Time Hori­zon Vary Across Do­mains?

14 Jul 2025 16:13 UTC
84 points
7 comments14 min readLW link
(metr.org)

Ts­inghua pa­per: Does RL Really In­cen­tivize Rea­son­ing Ca­pac­ity in LLMs Beyond the Base Model?

Thomas Kwa5 May 2025 18:56 UTC
69 points
21 comments2 min readLW link
(arxiv.org)

Should CA, TX, OK, and LA merge into a gi­ant swing state, just for elec­tions?

Thomas Kwa6 Nov 2024 23:01 UTC
115 points
35 comments1 min readLW link

The mur­der­ous short­cut: a toy model of in­stru­men­tal convergence

Thomas Kwa2 Oct 2024 6:48 UTC
37 points
0 comments2 min readLW link

Good­hart in RL with KL: Appendix

Thomas Kwa18 May 2024 0:40 UTC
12 points
0 comments6 min readLW link

Catas­trophic Good­hart in RL with KL penalty

15 May 2024 0:58 UTC
62 points
10 comments7 min readLW link

[Question] Is a ran­dom box of gas pre­dictable af­ter 20 sec­onds?

24 Jan 2024 23:00 UTC
38 points
35 comments1 min readLW link

[Question] Will quan­tum ran­dom­ness af­fect the 2028 elec­tion?

24 Jan 2024 22:54 UTC
66 points
52 comments1 min readLW link

Thomas Kwa’s re­search journal

23 Nov 2023 5:11 UTC
79 points
1 comment6 min readLW link

Thomas Kwa’s MIRI re­search experience

2 Oct 2023 16:42 UTC
174 points
53 comments1 min readLW link

Catas­trophic Re­gres­sional Good­hart: Appendix

15 May 2023 0:10 UTC
25 points
1 comment9 min readLW link

When is Good­hart catas­trophic?

9 May 2023 3:59 UTC
180 points
30 comments8 min readLW link1 review

Challenge: con­struct a Gra­di­ent Hacker

9 Mar 2023 2:38 UTC
39 points
10 comments1 min readLW link

Failure modes in a shard the­ory al­ign­ment plan

Thomas Kwa27 Sep 2022 22:34 UTC
26 points
2 comments7 min readLW link

Utility func­tions and prob­a­bil­ities are entangled

Thomas Kwa26 Jul 2022 5:36 UTC
15 points
5 comments1 min readLW link

Deriv­ing Con­di­tional Ex­pected Utility from Pareto-Effi­cient Decisions

Thomas Kwa5 May 2022 3:21 UTC
24 points
1 comment6 min readLW link

Most prob­lems don’t differ dra­mat­i­cally in tractabil­ity (un­der cer­tain as­sump­tions)

Thomas Kwa4 May 2022 0:05 UTC
8 points
0 comments3 min readLW link

The case for turn­ing glowfic into Sequences

Thomas Kwa27 Apr 2022 6:58 UTC
88 points
29 comments5 min readLW link

[Question] (When) do high-di­men­sional spaces have lin­ear paths down to lo­cal min­ima?

Thomas Kwa22 Apr 2022 15:35 UTC
12 points
7 comments1 min readLW link