Vlad Mikulik

Karma: 963

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

39 points

1 comment5 min readLW link

(alignment.anthropic.com)

Training fails to elicit subtle reasoning in current language models

mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton and Vlad Mikulik

9 Oct 2025 19:04 UTC

49 points

3 comments4 min readLW link

(alignment.anthropic.com)

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Tomek Korbak, Mikita Balesni, Vlad Mikulik and Rohin Shah

15 Jul 2025 16:23 UTC

167 points

32 comments1 min readLW link

(bit.ly)

Reasoning models don’t always say what they think

Joe Benton, Ethan Perez, Vlad Mikulik and Fabien Roger

9 Apr 2025 19:48 UTC

28 points

4 comments1 min readLW link

(www.anthropic.com)

Automated Researchers Can Subtly Sandbag

gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez and Fabien Roger

26 Mar 2025 19:13 UTC

44 points

0 comments4 min readLW link

(alignment.anthropic.com)

Discussion: Challenges with Unsupervised LLM Knowledge Discovery

Seb Farquhar, Vikrant Varma, zac_kenton, gasteigerjo, Vlad Mikulik and Rohin Shah

18 Dec 2023 11:58 UTC

149 points

21 comments10 min readLW link

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Neel Nanda, Tom Lieberum, Matthew Rahtz, János Kramár, Geoffrey Irving, Rohin Shah and Vlad Mikulik

20 Jul 2023 10:50 UTC

44 points

3 comments2 min readLW link

(arxiv.org)

Specification gaming: the flip side of AI ingenuity

Vika, Vlad Mikulik, Matthew Rahtz, tom4everitt, Zac Kenton and janleike

6 May 2020 23:51 UTC

69 points

9 comments6 min readLW link

Utility ≠ Reward

Vlad Mikulik5 Sep 2019 17:28 UTC

131 points

24 comments1 min readLW link 2 reviews

2-D Robustness

Vlad Mikulik30 Aug 2019 20:27 UTC

86 points

8 comments2 min readLW link

Risks from Learned Optimization: Conclusion and Related Work

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

7 Jun 2019 19:53 UTC

82 points

5 comments6 min readLW link

Deceptive Alignment

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

5 Jun 2019 20:16 UTC

119 points

20 comments17 min readLW link

The Inner Alignment Problem

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

4 Jun 2019 1:20 UTC

105 points

17 comments13 min readLW link

Conditions for Mesa-Optimization

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

1 Jun 2019 20:52 UTC

86 points

48 comments12 min readLW link

Risks from Learned Optimization: Introduction

evhub, Chris van Merwijk, Vlad Mikulik, Joar Skalse and Scott Garrabrant

31 May 2019 23:44 UTC

188 points

42 comments12 min readLW link 3 reviews

Clarifying Consequentialists in the Solomonoff Prior

Vlad Mikulik11 Jul 2018 2:35 UTC

20 points

16 comments6 min readLW link