Zac Hatfield-Dodds comments on Alignment Org Cheat Sheet

Zac Hatfield-Dodds 20 Sep 2022 22:32 UTC
13 points
5
I think this is missing the point pretty badly for Anthropic, and leaves out most of the work that we do. I tried writing up a similar summary; which is necessarily a little longer:

Anthropic: Let’s get as much hands-on experience building safe and aligned AI as we can, without making things worse (advancing capabilities, race dynamics, etc). We’ll invest in mechanistic interpretability because solving that would be awesome, and even modest success would help us detect risks before they become disasters. We’ll train near-cutting-edge models to study how interventions like RL from human feedback and model-based supervision succeed and fail, iterate on them, and study how novel capabilities emerge as models scale up. We’ll also share information so policy-makers and other interested parties can understand what the state of the art is like, and provide an example to others of how responsible labs can do safety-focused research.

(Sound good? We’re hiring for both technical and non-technical roles.)

We were also delighted to welcome Tamera Lanham to Anthropic recently, so you could add her externalized reasoning oversight agenda to our alignment research too :-)
What links here?
- Orpheus16 21 Sep 2022 17:42 UTC
  14 points
  6
  Parent
  Thanks, Zac! Your description adds some useful details and citations.
  I’d like the one-sentence descriptions to be pretty jargon-free, and I currently don’t see how the original one misses the point.
  I’ve made some minor edits to acknowledge that the purpose of building larger models is broader than “tackle problems”, and I’ve also linked people to your comment where they can find some useful citations & details. (Edits in bold)
  Anthropic: Let’s interpret large language models by understanding which parts of neural networks are involved with certain types of cognition, and let’s build larger language models to tackle problems, test methods, and understand phenomenon that will emerge as we get closer to AGI (see Zac’s comment for some details & citations).
  I also think there’s some legitimate debate around Anthropic’s role in advancing capabilities/race dynamics (e.g., I encourage people to see Thomas’s comments here). So while I understand that Anthropic’s goal is to do this work without advancing capabilities/race dynamics, I currently classify this as “a debated claim about Anthropic’s work” rather than “an objective fact about Anthropic’s work.”
  I’m interested in the policy work & would love to learn more about it. Could you share some more specifics about Anthropic’s most promising policy work?