Musings from a Lawyer turned AI Safety researcher (ShortForm)

Katalina Hernandez3 Mar 2025 19:14 UTC

1 point

AI Control Deceptive Alignment Regulation and AI Risk

I will use the Shortform to link my posts and Quick Takes.

LongForm post for debate: The EU Is Asking for Feedback on Frontier AI Regulation (Open to Global Experts)—This Post Breaks Down What’s at Stake for AI Safety

The European AI Office is currently finalizing Codes of Practice to regulate how general-purpose AI (GPAI) models be governed under the EU AI Act. They are explicitly asking for global expert input, and feedback is open to anyone, not just EU citizens.

The guidelines in development will shape:

The definition of “systemic risk”
How training compute triggers obligations
When fine-tuners or downstream actors become legally responsible
What counts as sufficient transparency, evaluation, and risk mitigation

Major labs (OpenAI, Anthropic, Google DeepMind) have already expressed willingness to sign the upcoming Codes of Practice. These codes will likely become the default enforcement standard across the EU and possibly beyond.

So far, AI safety perspectives are seriously underrepresented.

Without strong input from AI Safety researchers and technical AI Governance experts, these rules could lock in shallow compliance norms (mostly centered on copyright or reputational risk) while missing core challenges around interpretability, loss of control, and emergent capabilities.

I’ve written a detailed Longform post breaking down exactly what’s being proposed, where input is most needed, and how you can engage.

Even if you don’t have policy experience, your technical insight could shape how safety is operationalized at scale.

📅 Feedback is open until 22 May 2025, 12:00 CET
🗳️ Submit your response here

Happy to connect with anyone individually for help drafting meaningful feedback.

Quick Take: “Alignment with human intent” explicitly mentioned in European law

The AI alignment community had a major victory in the regulatory landscape, and it went unnoticed by many.

The EU AI Act explicitly mentions “alignment with human intent” as a key focus area in relation to regulation of systemic risks.

As far as I know, this is the first time “alignment” has been mentioned by a law, or major regulatory text.

It’s buried in Recital 110, but it’s there.

And it also makes research on AI Control relevant:

“International approaches have so far identified the need to pay attention to risks from potential intentional misuse or unintended issues of control relating to alignment with human intent”.

The EU AI Act also mentions alignment as part of the Technical documentation that AI developers must make publicly available.

This means that alignment is now part of the EU’s regulatory vocabulary.

LongForm post for debate: For Policy’s Sake: Why We Must Distinguish AI Safety from AI Security in Regulatory Governance

𝐓𝐋;𝐃𝐑

I understand that Safety and Security are two sides of the same coin.

But if we don’t clearly articulate 𝐭𝐡𝐞 𝐢𝐧𝐭𝐞𝐧𝐭 𝐛𝐞𝐡𝐢𝐧𝐝 𝐀𝐈 𝐬𝐚𝐟𝐞𝐭𝐲 𝐞𝐯𝐚𝐥𝐮𝐚𝐭𝐢𝐨𝐧𝐬, we risk misallocating stakeholder responsibilities when defining best practices or regulatory standards.

For instance, a provider might point to adversarial robustness testing as evidence of “safety” compliance: when in fact, the measure only hardens the model against 𝐞𝐱𝐭𝐞𝐫𝐧𝐚𝐥 𝐭𝐡𝐫𝐞𝐚𝐭𝐬 (security), without addressing the internal model behaviors that could still cause harm to users.

𝐈𝐟 𝐫𝐞𝐠𝐮𝐥𝐚𝐭𝐨𝐫𝐬 𝐜𝐨𝐧𝐟𝐥𝐚𝐭𝐞 𝐭𝐡𝐞𝐬𝐞, 𝐡𝐢𝐠𝐡-𝐜𝐚𝐩𝐚𝐛𝐢𝐥𝐢𝐭𝐲 𝐥𝐚𝐛𝐬 𝐦𝐢𝐠𝐡𝐭 “𝐦𝐞𝐞𝐭 𝐭𝐡𝐞 𝐥𝐞𝐭𝐭𝐞𝐫 𝐨𝐟 𝐭𝐡𝐞 𝐥𝐚𝐰” 𝐰𝐡𝐢𝐥𝐞 𝐛𝐲𝐩𝐚𝐬𝐬𝐢𝐧𝐠 𝐭𝐡𝐞 𝐬𝐩𝐢𝐫𝐢𝐭 𝐨𝐟 𝐬𝐚𝐟𝐞𝐭𝐲 𝐚𝐥𝐭𝐨𝐠𝐞𝐭𝐡𝐞𝐫.

Opinion Post: Scaling AI Regulation: Realistically, what Can (and Can’t) Be Regulated?

Should we even expect regulation to be useful for AI safety?
Is there a version of AI regulation that wouldn’t be performative?
How do you see the “Brussels effect” playing out for AI Safety?
Are regulatory sandboxes a step in the right direction?

Katalina Hernandez3 Mar 2025 19:14 UTC

1 point

61 comments2 min readLW link