That doesn’t align with the marketing copy I’ve seen (which has this featured as a pretty core part of their product). Maybe I am wrong? I haven’t checked that hard.
Goodfire’s goal is to use interpretability techniques to guide the new minds we’re building to share our values, and to learn from them where they have something to teach us.
Indeed, the “guess and check” feedback loop, which I think currently provides one of the biggest assurances we have that model internals are not being optimized to look good, is something he explicitly calls out as something to be fixed:
We currently attempt to design these systems by an expensive process of guess-and-check: first train, then evaluate, then tweak our training setup in ways we hope will work, then train and evaluate again and again, finally hoping that our evaluations catch everything we care about. Although careful scaling analyses can help at the macroscale, we have no way to steer during the training process itself. To borrow an idea from control theory, training is usually more like an open loop control system, whereas I believe we can develop closed-loop control.
Also given what multiple people who have worked with Goodfire, or know people well there, have told me, I am pretty confident it’s quite crucial to their bottom line and sales pitches.
I haven’t paid much attention to their marketing copy, but they do have big flashy things about a bunch of stuff including interpreting science models, and everything I’ve seen from them involving a real customer was not about training on interp. Plausibly they could communicate better here though
I interpret their new intentional design post as “here’s a research direction we think could be a big deal”, not “here’s the central focus of the company”
That doesn’t align with the marketing copy I’ve seen (which has this featured as a pretty core part of their product). Maybe I am wrong? I haven’t checked that hard.
Edit: This post also seems to put this very centrally into their philosophy: https://www.goodfire.ai/blog/intentional-design
Indeed, the “guess and check” feedback loop, which I think currently provides one of the biggest assurances we have that model internals are not being optimized to look good, is something he explicitly calls out as something to be fixed:
Also given what multiple people who have worked with Goodfire, or know people well there, have told me, I am pretty confident it’s quite crucial to their bottom line and sales pitches.
Tom McGrath, chief scientist, confirmed that my comment is correct: https://www.lesswrong.com/posts/XzdDypFuffzE4WeP7/themanxloiner-s-shortform?commentId=BupJhRhsAYvKZGLKG
I haven’t paid much attention to their marketing copy, but they do have big flashy things about a bunch of stuff including interpreting science models, and everything I’ve seen from them involving a real customer was not about training on interp. Plausibly they could communicate better here though
I interpret their new intentional design post as “here’s a research direction we think could be a big deal”, not “here’s the central focus of the company”