Nate Thomas

Karma: 521

Redwood Research and Constellation

Nate Thomas 30 Apr 2024 17:38 UTC
13 points
2
in reply to: Buck’s comment on: Express interest in an “FHI of the West”
To anyone reading this who wants to work on or discuss FHI-flavored work: Consider applying to Constellation’s programs (the deadline for some of them is today!), which include salaried positions for researchers.

Nate Thomas 26 Oct 2023 14:50 UTC
LW: 2 AF: 1
0
AF
in reply to: Neel Nanda’s comment on: Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter
Thanks, Neel! It should be fixed now.

Apply to the Constellation Visiting Researcher Program and Astra Fellowship, in Berkeley this Winter

Nate Thomas26 Oct 2023 3:07 UTC

42 points

10 comments1 min readLW link

Causal scrubbing: results on induction heads

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

34 points

1 comment17 min readLW link

Causal scrubbing: results on a paren balance checker

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, Tao Lin, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:59 UTC

39 points

2 comments30 min readLW link

Causal scrubbing: Appendix

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

18 points

4 comments20 min readLW link

Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]

LawrenceC, Adrià Garriga-alonso, Nicholas Goldowsky-Dill, ryan_greenblatt, jenny, Ansh Radhakrishnan, Buck and Nate Thomas

3 Dec 2022 0:58 UTC

208 points

35 comments20 min readLW link 1 review

Apply to the Redwood Research Mechanistic Interpretability Experiment (REMIX), a research program in Berkeley

maxnadeau, Xander Davies, Buck and Nate Thomas

27 Oct 2022 1:32 UTC

135 points

14 comments12 min readLW link

Nate Thomas 17 Sep 2022 15:04 UTC
LW: 18 AF: 9
12
AF
in reply to: Quintin Pope’s comment on: Takeaways from our robust injury classifier project [Redwood Research]
Note that it’s unsurprising that a different model categorizes this correctly because the failure was generated from an attack on the particular model we were working with. The relevant question is “given a model, how easy is it to find a failure by attacking that model using our rewriting tools?”

High-stakes alignment via adversarial training [Redwood Research report]

dmz, LawrenceC and Nate Thomas

5 May 2022 0:59 UTC

142 points

29 comments9 min readLW link

We’re Redwood Research, we do applied alignment research, AMA

Nate Thomas6 Oct 2021 5:51 UTC

56 points

2 comments2 min readLW link

(forum.effectivealtruism.org)