RSS

StefanHex

Karma: 1,706

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

De­tect­ing Strate­gic De­cep­tion Us­ing Lin­ear Probes

Feb 6, 2025, 3:46 PM
102 points
9 comments2 min readLW link
(arxiv.org)

SAE reg­u­lariza­tion pro­duces more in­ter­pretable models

Jan 28, 2025, 8:02 PM
21 points
7 comments4 min readLW link