newsletter.safe.ai
Dan H
Karma: 3,549
MLSN #17: Measuring General AI Abilities and Mitigating Deception
AISN #65: Measuring Automation and Superintelligence Moratorium Letter
This dynamic is captured in IABIED’s story and this paper from 2023: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4445706
AISN#64: New AGI Definition and Senate Bill Would Establish Liability for AI Harms
AISN #63: California’s SB-53 Passes the Legislature
just rehearsed variations on the arguments Eliezer/MIRI already deployed
I think they’re improved and simplified.
My favorite chapter is “Chapter 5: Its Favorite Things.”
AISN #61: OpenAI Releases GPT-5
AISN #60: The AI Action Plan
AISN #59: EU Publishes General-Purpose AI Code of Practice
AISN #58: Senate Removes State AI Regulation Moratorium
AISN #57: The RAISE Act
AISN #56: Google Releases Veo 3
AISN #55: Trump Administration Rescinds AI Diffusion Rule, Allows Chip Sales to Gulf States
It’s a great book: it’s simple, memorable, and unusually convincing.
AISN #54: OpenAI Updates Restructure Plan
AISN #53: An Open Letter Attempts to Block OpenAI Restructuring
AISN#52: An Expert Virology Benchmark
AISN #51: AI Frontiers
If a strategy is likely to be outdated quickly it’s not robust and not a good strategy. Strategies should be able to withstand lots of variation.
Thank you to Neel for writing this. Most people pivot quietly.
I’ve been most skeptical of mechanistic interpretability for years. I excluded interpretability in Unsolved Problems in ML Safety for this reason. Other fields like d/acc (Systemic Safety) were included though, all the way back in 2021.
Here’s are some earlier criticisms: https://www.lesswrong.com/posts/5HtDzRAk7ePWsiL2L/open-problems-in-ai-x-risk-pais-5#Transparency
More recent commentary: https://ai-frontiers.org/articles/the-misguided-quest-for-mechanistic-ai-interpretability
I think the community should reflect on its genius worship culture (in the case of Olah, a close friend of the inner circle) and epistemics: the approach was so dominant for years, and I think this outcome was entirely foreseeable.