Redwood Research and Constellation
Nate Thomas
Karma: 490
Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter
Causal scrubbing: results on induction heads
Causal scrubbing: results on a paren balance checker
Causal scrubbing: Appendix
Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley
Note that it’s unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is “given a model, how easy is it to find a failure by attacking that model using our rewriting tools?”
Thanks, Neel! It should be fixed now.