RSS

Josh Engels

Karma: 959

Research scientist on the DeepMind interp team. Thoughts are my own and do not reflect views of my employer.

How trans­par­ent is Diffu­sionGemma (and why it mat­ters)

20 Jun 2026 20:05 UTC
73 points
2 comments4 min readLW link

Why Do Naive SFT Filters For Safety Prop­er­ties Fail?

14 Jun 2026 19:45 UTC
50 points
7 comments10 min readLW link

SFT Drives Gem­ini’s Safety Properties

13 Jun 2026 15:31 UTC
78 points
3 comments1 min readLW link

Build­ing and eval­u­at­ing model diffing agents

12 Jun 2026 17:14 UTC
61 points
2 comments12 min readLW link

[pa­per] Train­ing on Doc­u­ments About Mon­i­tor­ing Leads to CoT Obfuscation

27 May 2026 9:39 UTC
31 points
1 comment4 min readLW link
(arxiv.org)

Test your best meth­ods on our hard CoT in­terp tasks

26 Mar 2026 19:24 UTC
58 points
2 comments19 min readLW link

Train­ing on Doc­u­ments About Mon­i­tor­ing Leads To CoT Obfuscation

18 Mar 2026 20:37 UTC
65 points
5 comments16 min readLW link

Thought Edit­ing: Steer­ing Models by Edit­ing Their Chain of Thought

3 Feb 2026 9:51 UTC
20 points
0 comments5 min readLW link

Brief Ex­plo­ra­tions in LLM Value Rankings

12 Jan 2026 18:16 UTC
39 points
1 comment11 min readLW link

Steer­ing RL Train­ing: Bench­mark­ing In­ter­ven­tions Against Re­ward Hacking

29 Dec 2025 21:55 UTC
72 points
11 comments19 min readLW link

Can we in­ter­pret la­tent rea­son­ing us­ing cur­rent mechanis­tic in­ter­pretabil­ity tools?

22 Dec 2025 16:56 UTC
44 points
1 comment9 min readLW link

Prompt­ing Models to Obfus­cate Their CoT

8 Dec 2025 21:00 UTC
16 points
4 comments7 min readLW link

How Can In­ter­pretabil­ity Re­searchers Help AGI Go Well?

1 Dec 2025 13:05 UTC
68 points
1 comment14 min readLW link

A Prag­matic Vi­sion for Interpretability

1 Dec 2025 13:05 UTC
139 points
39 comments27 min readLW link

Cur­rent LLMs seem to rarely de­tect CoT tampering

19 Nov 2025 15:27 UTC
56 points
0 comments20 min readLW link