RSS

StefanHex

Karma: 2,148

Stefan Heimersheim. Mechanistic interpretability & AI safety researcher, previously at FAR.AI and Apollo Research. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Find­ing fea­tures in Trans­form­ers: Con­trastive di­rec­tions elicit stronger low-level per­tur­ba­tion re­sponses than baselines

20 Mar 2026 21:09 UTC
39 points
2 comments6 min readLW link

Ac­ti­va­tion Plateaus: Where and How They Emerge

17 Oct 2025 5:48 UTC
37 points
3 comments8 min readLW link