Tomek Korbak

Karma: 1,144

I work on monitoring agents at OpenAI

https://tomekkorbak.com/

Reasoning Models Struggle to Control Their Chains of Thought

Yueh Han Chen, robert mccarthy, Bruce W. Lee and Tomek Korbak

5 Mar 2026 22:37 UTC

74 points

9 comments3 min readLW link

Training Agents to Self-Report Misbehavior

Bruce W. Lee, Yueh Han Chen and Tomek Korbak

25 Feb 2026 17:50 UTC

26 points

0 comments8 min readLW link

Lessons from Studying Two-Hop Latent Reasoning

Mikita Balesni, Tomek Korbak and Owain_Evans

11 Sep 2025 17:53 UTC

68 points

19 comments2 min readLW link

(arxiv.org)

If you can generate obfuscated chain-of-thought, can you monitor it?

Asa Cooper Stickland and Tomek Korbak

4 Aug 2025 15:46 UTC

36 points

6 comments11 min readLW link

Research Areas in AI Control (The Alignment Project by UK AISI)

Julian Stastny, Tomek Korbak, Mojmir, Buck and Alan Cooney

1 Aug 2025 10:27 UTC

25 points

0 comments18 min readLW link

(alignmentproject.aisi.gov.uk)

The Alignment Project by UK AISI

Mojmir, Benjamin Hilton, Jacob Pfau, Geoffrey Irving, Joseph Bloom, Tomek Korbak, David Africa and Edmund Lau

1 Aug 2025 9:52 UTC

29 points

0 comments2 min readLW link

(alignmentproject.aisi.gov.uk)

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Vlad Mikulik and Rohin Shah

15 Jul 2025 16:23 UTC

167 points

32 comments1 min readLW link

(bit.ly)

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Tomek Korbak, Mikita Balesni, Buck and Geoffrey Irving

14 Apr 2025 16:45 UTC

29 points

1 comment2 min readLW link

A sketch of an AI control safety case

Tomek Korbak, joshc, Benjamin Hilton, Buck and Geoffrey Irving

30 Jan 2025 17:28 UTC

61 points

0 comments5 min readLW link

Eliciting bad contexts

Geoffrey Irving, Joseph Bloom and Tomek Korbak

24 Jan 2025 10:39 UTC

37 points

9 comments3 min readLW link

Automation collapse

Geoffrey Irving, Tomek Korbak and Benjamin Hilton

21 Oct 2024 14:50 UTC

72 points

9 comments7 min readLW link

Compositional preference models for aligning LMs

Tomek Korbak25 Oct 2023 12:17 UTC

18 points

2 comments5 min readLW link

Towards Understanding Sycophancy in Language Models

Ethan Perez, mrinank_sharma, Meg and Tomek Korbak

24 Oct 2023 0:30 UTC

66 points

0 comments2 min readLW link

(arxiv.org)

Paper: LLMs trained on “A is B” fail to learn “B is A”

lberglund, Owain_Evans, Meg, Maximilian Kaufmann, Mikita Balesni, Asa Cooper Stickland and Tomek Korbak

23 Sep 2023 19:55 UTC

125 points

74 comments4 min readLW link

(arxiv.org)

Paper: On measuring situational awareness in LLMs

Owain_Evans, Daniel Kokotajlo, Mikita Balesni, Tomek Korbak, Asa Cooper Stickland, Meg and Maximilian Kaufmann

4 Sep 2023 12:54 UTC

111 points

17 comments5 min readLW link

(arxiv.org)

Imitation Learning from Language Feedback

Jérémy Scheurer, Tomek Korbak and Ethan Perez

30 Mar 2023 14:11 UTC

71 points

3 comments10 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

135 points

20 comments11 min readLW link 2 reviews

RL with KL penalties is better seen as Bayesian inference

Tomek Korbak and Ethan Perez

25 May 2022 9:23 UTC

115 points

17 comments12 min readLW link