The top-10 most-cited papers that MATS contributed to are (all with at least 290 citations)
Representation Engineering: A Top-Down Approach to AI Transparency
Sparse autoencoders find highly interpretable features in language models
Towards understanding sycophancy in language models
Steering Language Models With Activation Engineering
Steering Llama 2 via Contrastive Activation Addition
Refusal in language models is mediated by a single direction
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The Reversal Curse: LLMs trained on” A is B” fail to learn” B is A”
LLM Evaluators Recognize and Favor Their Own Generations
Finding neurons in a haystack: Case studies with sparse probing
Compare this to the top-10 highest-karma LessWrong posts that MATS contributed to (all with over 200 karma):
SolidGoldMagikarp (plus, prompt generation)
Steering GPT-2-XL by adding an activation vector (arXiv)
Transformers Represent Belief State Geometry in their Residual Stream (arXiv)
Understanding and Controlling a Maze-Solving Policy Network (arXiv)
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (arXiv)
Refusal in LLMs is mediated by a single direction (arXiv)
Natural Abstractions: Key Claims, Theorems, and Critiques
Distillation Robustifies Unlearning (arXiv)
Mechanistically Eliciting Latent Behaviors in Language Models
Neural networks generalize because of this one weird trick
The top-10 most-cited papers that MATS contributed to are (all with at least 290 citations)
Representation Engineering: A Top-Down Approach to AI Transparency
Sparse autoencoders find highly interpretable features in language models
Towards understanding sycophancy in language models
Steering Language Models With Activation Engineering
Steering Llama 2 via Contrastive Activation Addition
Refusal in language models is mediated by a single direction
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
The Reversal Curse: LLMs trained on” A is B” fail to learn” B is A”
LLM Evaluators Recognize and Favor Their Own Generations
Finding neurons in a haystack: Case studies with sparse probing
Compare this to the top-10 highest-karma LessWrong posts that MATS contributed to (all with over 200 karma):
SolidGoldMagikarp (plus, prompt generation)
Steering GPT-2-XL by adding an activation vector (arXiv)
Transformers Represent Belief State Geometry in their Residual Stream (arXiv)
Understanding and Controlling a Maze-Solving Policy Network (arXiv)
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs (arXiv)
Refusal in LLMs is mediated by a single direction (arXiv)
Natural Abstractions: Key Claims, Theorems, and Critiques
Distillation Robustifies Unlearning (arXiv)
Mechanistically Eliciting Latent Behaviors in Language Models
Neural networks generalize because of this one weird trick