RSS

Vlad Mikulik

Karma: 910

Chain of Thought Mon­i­tora­bil­ity: A New and Frag­ile Op­por­tu­nity for AI Safety

15 Jul 2025 16:23 UTC
166 points
32 comments1 min readLW link
(bit.ly)

Rea­son­ing mod­els don’t always say what they think

9 Apr 2025 19:48 UTC
28 points
4 comments1 min readLW link
(www.anthropic.com)

Au­to­mated Re­searchers Can Subtly Sandbag

26 Mar 2025 19:13 UTC
44 points
0 comments4 min readLW link
(alignment.anthropic.com)

Dis­cus­sion: Challenges with Un­su­per­vised LLM Knowl­edge Discovery

18 Dec 2023 11:58 UTC
149 points
21 comments10 min readLW link

Does Cir­cuit Anal­y­sis In­ter­pretabil­ity Scale? Ev­i­dence from Mul­ti­ple Choice Ca­pa­bil­ities in Chinchilla

20 Jul 2023 10:50 UTC
44 points
3 comments2 min readLW link
(arxiv.org)