RSS

Daniel Tan

Karma: 1,435

Researching AI safety. Currently interested in emergent misalignment, model organisms, and other kinds of empirical work.

https://​​dtch1997.github.io/​​

Inoc­u­la­tion prompt­ing: In­struct­ing mod­els to mis­be­have at train-time can im­prove run-time behavior

8 Oct 2025 22:02 UTC
135 points
19 comments2 min readLW link

Could we have pre­dicted emer­gent mis­al­ign­ment a pri­ori us­ing un­su­per­vised be­havi­our elic­i­ta­tion?

Daniel Tan22 Aug 2025 13:42 UTC
6 points
0 comments1 min readLW link

Open Challenges in Rep­re­sen­ta­tion Engineering

3 Apr 2025 19:21 UTC
14 points
0 comments5 min readLW link

Show, not tell: GPT-4o is more opinionated in images than in text

2 Apr 2025 8:51 UTC
112 points
41 comments3 min readLW link

Open prob­lems in emer­gent misalignment

1 Mar 2025 9:47 UTC
83 points
17 comments7 min readLW link

A Col­lec­tion of Em­piri­cal Frames about Lan­guage Models

Daniel Tan2 Jan 2025 2:49 UTC
27 points
0 comments3 min readLW link

Why I’m Mov­ing from Mechanis­tic to Pro­saic Interpretability

Daniel Tan30 Dec 2024 6:35 UTC
113 points
34 comments5 min readLW link

A Sober Look at Steer­ing Vec­tors for LLMs

23 Nov 2024 17:30 UTC
40 points
0 comments5 min readLW link

Evolu­tion­ary prompt op­ti­miza­tion for SAE fea­ture visualization

14 Nov 2024 13:06 UTC
22 points
0 comments9 min readLW link

An In­ter­pretabil­ity Illu­sion from Pop­u­la­tion Statis­tics in Causal Analysis

Daniel Tan29 Jul 2024 14:50 UTC
9 points
3 comments1 min readLW link

Daniel Tan’s Shortform

Daniel Tan17 Jul 2024 6:38 UTC
2 points
262 comments1 min readLW link

Mech In­terp Lacks Good Paradigms

Daniel Tan16 Jul 2024 15:47 UTC
40 points
0 comments14 min readLW link

Ac­ti­va­tion Pat­tern SVD: A pro­posal for SAE Interpretability

Daniel Tan28 Jun 2024 22:12 UTC
15 points
2 comments2 min readLW link