FYI I wouldn’t say at all that AI safety is under-represented in the EU (if anything, it would be easier to argue the opposite). Many safety orgs (including mine) supported the Codes of Practice, and almost all the Chairs and vice chairs are respected governance researchers. But probably still good for people to give feedback, just don’t want to give the impression that this is neglected.
Also no public mention of intention to sign the code has been made as far as I know. Though apart from copyright section, most of it is in line with RSPs, which makes signing more reasonable.
Thank you, Ariel! I guess I’ve let my personal opinion shine through. I do not see many regulatory efforts in general on verbalizing alignment or interpretability necessities or translating them into actionable compliance requirements. The AI Act mentions alignment vaguely, for example.
And as far as I saw, the third draft of the Codes (Safety & Security) mentions alignment / misalignment with a “this may be relevant to include” tone rather than providing specifics as to how GenAI providers are expected to document individual misalignment risks and appropriate mitigation strategies.
And interpretability / mech interp is not mentioned at all, not even in the context of model explainability or transparency.
This is why I hoped to see feedback from this community: to know whether I am over-estimating my concerns.
I’d suggest updating the language in the post to clarify things and not overstate :)
Regarding the 3rd draft—opinions varied between people I work with but we are generally happy. Loss of Control is included in the selected systemic risks, as well as CBRN. Appendix 1.2 also has useful things, though some valid concerns got raised there on compatibility with the AI Act language that still need tweaking (possobly merging parts of 1.2 into selected systemic risks). As far as interpretability—the code is meant to be outcome based, and the main reason evals are mentioned is that they are in the act. Prescribing interpretability isn’t something the code can do, and also probably shouldn’t as these techniques arent good enough yet to be prescribed as mandatory for mitigating systemic risks.
“Second, governments can use light-touch rules to encourage the development of interpretability research and its application to addressing problems with frontier AI models. Given how nascent and undeveloped the practice of “AI MRI” is, it should be clear why it doesn’t make sense to regulate or mandate that companies conduct them, at least at this stage: it’s not even clear what a prospective law should ask companies to do. But a requirement for companies to transparently disclose their safety and security practices (their Responsible Scaling Policy, or RSP, and its execution), including how they’re using interpretability to test models before release, would allow companies to learn from each other while also making clear who is behaving more responsibly, fostering a “race to the top”.”
Although I agree with you that (at this stage), regulatory requirement to disclose interpretability techniques used to test models before release would not be very useful for for outcome-based CoPs.
But I hope that there is a path forward for this approach in the near future.
Sure! and yeah regarding edits—I have not gone through the full request for feedback yet, I expect to have a better sense late next week of which contributions are most needed and how to prioritize. I mainly wanted to comment first on obvious things that stood out to me from the post.
There is also an Evals workshop in Brussels on Monday where we might learn more. I’ve know of some some non-EU based technical safety researchers who are attending, which is great to see.
FYI I wouldn’t say at all that AI safety is under-represented in the EU (if anything, it would be easier to argue the opposite). Many safety orgs (including mine) supported the Codes of Practice, and almost all the Chairs and vice chairs are respected governance researchers. But probably still good for people to give feedback, just don’t want to give the impression that this is neglected.
Also no public mention of intention to sign the code has been made as far as I know. Though apart from copyright section, most of it is in line with RSPs, which makes signing more reasonable.
Thank you, Ariel! I guess I’ve let my personal opinion shine through. I do not see many regulatory efforts in general on verbalizing alignment or interpretability necessities or translating them into actionable compliance requirements. The AI Act mentions alignment vaguely, for example.
And as far as I saw, the third draft of the Codes (Safety & Security) mentions alignment / misalignment with a “this may be relevant to include” tone rather than providing specifics as to how GenAI providers are expected to document individual misalignment risks and appropriate mitigation strategies.
And interpretability / mech interp is not mentioned at all, not even in the context of model explainability or transparency.
This is why I hoped to see feedback from this community: to know whether I am over-estimating my concerns.
I’d suggest updating the language in the post to clarify things and not overstate :)
Regarding the 3rd draft—opinions varied between people I work with but we are generally happy. Loss of Control is included in the selected systemic risks, as well as CBRN. Appendix 1.2 also has useful things, though some valid concerns got raised there on compatibility with the AI Act language that still need tweaking (possobly merging parts of 1.2 into selected systemic risks). As far as interpretability—the code is meant to be outcome based, and the main reason evals are mentioned is that they are in the act. Prescribing interpretability isn’t something the code can do, and also probably shouldn’t as these techniques arent good enough yet to be prescribed as mandatory for mitigating systemic risks.
I was reading “The Urgency of Interpretability” by Dario Amodei, and the following part made me think about our discussion.
Although I agree with you that (at this stage), regulatory requirement to disclose interpretability techniques used to test models before release would not be very useful for for outcome-based CoPs.
But I hope that there is a path forward for this approach in the near future.
Thanks a lot for your follow up. I’d love to connect on LinkedIn if that’s okay, I’m very grateful for your feedback!
I’d say: “I believe that more feedback from alignment and interpretability researchers is needed” instead. Thoughts?
Sure! and yeah regarding edits—I have not gone through the full request for feedback yet, I expect to have a better sense late next week of which contributions are most needed and how to prioritize. I mainly wanted to comment first on obvious things that stood out to me from the post.
There is also an Evals workshop in Brussels on Monday where we might learn more. I’ve know of some some non-EU based technical safety researchers who are attending, which is great to see.