RSS

Tomek Korbak

Karma: 1,144

I work on monitoring agents at OpenAI

https://​​tomekkorbak.com/​​

Rea­son­ing Models Strug­gle to Con­trol Their Chains of Thought

5 Mar 2026 22:37 UTC
74 points
9 comments3 min readLW link

Train­ing Agents to Self-Re­port Misbehavior

25 Feb 2026 17:50 UTC
26 points
0 comments8 min readLW link

Les­sons from Study­ing Two-Hop La­tent Reasoning

11 Sep 2025 17:53 UTC
68 points
19 comments2 min readLW link
(arxiv.org)

If you can gen­er­ate obfus­cated chain-of-thought, can you mon­i­tor it?

4 Aug 2025 15:46 UTC
36 points
6 comments11 min readLW link

Re­search Areas in AI Con­trol (The Align­ment Pro­ject by UK AISI)

1 Aug 2025 10:27 UTC
25 points
0 comments18 min readLW link
(alignmentproject.aisi.gov.uk)

The Align­ment Pro­ject by UK AISI

1 Aug 2025 9:52 UTC
29 points
0 comments2 min readLW link
(alignmentproject.aisi.gov.uk)

Chain of Thought Mon­i­tora­bil­ity: A New and Frag­ile Op­por­tu­nity for AI Safety

15 Jul 2025 16:23 UTC
167 points
32 comments1 min readLW link
(bit.ly)

How to eval­u­ate con­trol mea­sures for LLM agents? A tra­jec­tory from to­day to superintelligence

14 Apr 2025 16:45 UTC
29 points
1 comment2 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
61 points
0 comments5 min readLW link

Elic­it­ing bad contexts

24 Jan 2025 10:39 UTC
37 points
9 comments3 min readLW link

Au­toma­tion collapse

21 Oct 2024 14:50 UTC
72 points
9 comments7 min readLW link

Com­po­si­tional prefer­ence mod­els for al­ign­ing LMs

Tomek Korbak25 Oct 2023 12:17 UTC
18 points
2 comments5 min readLW link

Towards Un­der­stand­ing Sy­co­phancy in Lan­guage Models

24 Oct 2023 0:30 UTC
66 points
0 comments2 min readLW link
(arxiv.org)

Paper: LLMs trained on “A is B” fail to learn “B is A”

23 Sep 2023 19:55 UTC
125 points
74 comments4 min readLW link
(arxiv.org)

Paper: On mea­sur­ing situ­a­tional aware­ness in LLMs

4 Sep 2023 12:54 UTC
111 points
17 comments5 min readLW link
(arxiv.org)

Imi­ta­tion Learn­ing from Lan­guage Feedback

30 Mar 2023 14:11 UTC
71 points
3 comments10 min readLW link

Pre­train­ing Lan­guage Models with Hu­man Preferences

21 Feb 2023 17:57 UTC
135 points
20 comments11 min readLW link2 reviews

RL with KL penalties is bet­ter seen as Bayesian inference

25 May 2022 9:23 UTC
115 points
17 comments12 min readLW link