I appreciate work that shows how interpretability is naturally capabilities work! I was naturally drawn to capabilities work, it’s how I ended up interested in AI. My guess is that people who do interp work because they think it’s safety would like to know that it’s primarily capabilities, and they simply don’t believe theoretical arguments without concrete evidence it actually turns into capabilities. So having capabilities nerds posting here seems good to me. Strong upvoted, please poke holes in more alignment plans!
I appreciate work that shows how interpretability is naturally capabilities work! I was naturally drawn to capabilities work, it’s how I ended up interested in AI. My guess is that people who do interp work because they think it’s safety would like to know that it’s primarily capabilities, and they simply don’t believe theoretical arguments without concrete evidence it actually turns into capabilities. So having capabilities nerds posting here seems good to me. Strong upvoted, please poke holes in more alignment plans!
(This isn’t frontier capabilities work, anyway, IMO.)