RSS

wesg

Karma: 497

OR PhD student at MIT working on interpretability.

Find out more here: https://​​wesg.me/​​

Re­fusal in LLMs is me­di­ated by a sin­gle direction

27 Apr 2024 11:13 UTC
252 points
95 comments10 min readLW link

SAE re­con­struc­tion er­rors are (em­piri­cally) pathological

wesg29 Mar 2024 16:37 UTC
106 points
16 comments8 min readLW link

Find­ing Neu­rons in a Haystack: Case Stud­ies with Sparse Probing

3 May 2023 13:30 UTC
33 points
6 comments2 min readLW link1 review
(arxiv.org)