I’d suggest updating the language in the post to clarify things and not overstate :)
Regarding the 3rd draft—opinions varied between people I work with but we are generally happy. Loss of Control is included in the selected systemic risks, as well as CBRN. Appendix 1.2 also has useful things, though some valid concerns got raised there on compatibility with the AI Act language that still need tweaking (possobly merging parts of 1.2 into selected systemic risks). As far as interpretability—the code is meant to be outcome based, and the main reason evals are mentioned is that they are in the act. Prescribing interpretability isn’t something the code can do, and also probably shouldn’t as these techniques arent good enough yet to be prescribed as mandatory for mitigating systemic risks.
“Second, governments can use light-touch rules to encourage the development of interpretability research and its application to addressing problems with frontier AI models. Given how nascent and undeveloped the practice of “AI MRI” is, it should be clear why it doesn’t make sense to regulate or mandate that companies conduct them, at least at this stage: it’s not even clear what a prospective law should ask companies to do. But a requirement for companies to transparently disclose their safety and security practices (their Responsible Scaling Policy, or RSP, and its execution), including how they’re using interpretability to test models before release, would allow companies to learn from each other while also making clear who is behaving more responsibly, fostering a “race to the top”.”
Although I agree with you that (at this stage), regulatory requirement to disclose interpretability techniques used to test models before release would not be very useful for for outcome-based CoPs.
But I hope that there is a path forward for this approach in the near future.
Sure! and yeah regarding edits—I have not gone through the full request for feedback yet, I expect to have a better sense late next week of which contributions are most needed and how to prioritize. I mainly wanted to comment first on obvious things that stood out to me from the post.
There is also an Evals workshop in Brussels on Monday where we might learn more. I’ve know of some some non-EU based technical safety researchers who are attending, which is great to see.
I’d suggest updating the language in the post to clarify things and not overstate :)
Regarding the 3rd draft—opinions varied between people I work with but we are generally happy. Loss of Control is included in the selected systemic risks, as well as CBRN. Appendix 1.2 also has useful things, though some valid concerns got raised there on compatibility with the AI Act language that still need tweaking (possobly merging parts of 1.2 into selected systemic risks). As far as interpretability—the code is meant to be outcome based, and the main reason evals are mentioned is that they are in the act. Prescribing interpretability isn’t something the code can do, and also probably shouldn’t as these techniques arent good enough yet to be prescribed as mandatory for mitigating systemic risks.
I was reading “The Urgency of Interpretability” by Dario Amodei, and the following part made me think about our discussion.
Although I agree with you that (at this stage), regulatory requirement to disclose interpretability techniques used to test models before release would not be very useful for for outcome-based CoPs.
But I hope that there is a path forward for this approach in the near future.
Thanks a lot for your follow up. I’d love to connect on LinkedIn if that’s okay, I’m very grateful for your feedback!
I’d say: “I believe that more feedback from alignment and interpretability researchers is needed” instead. Thoughts?
Sure! and yeah regarding edits—I have not gone through the full request for feedback yet, I expect to have a better sense late next week of which contributions are most needed and how to prioritize. I mainly wanted to comment first on obvious things that stood out to me from the post.
There is also an Evals workshop in Brussels on Monday where we might learn more. I’ve know of some some non-EU based technical safety researchers who are attending, which is great to see.