Call for suggestions—AI safety course
In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to be roughly similar to my “foundations of deep learning” course.
I am still not sure of the content, and would be happy to get suggestions.
Some (somewhat conflicting) desiderata:
I would like to cover the various ways AI could go wrong: malfunction, misuse, societal upheaval, arms race, surveillance, bias, misalignment, loss of control,… (and anything else I’m not thinking of right now). I talke about some of these issues here and here.
I would like to see what we can learn from other fields, including software security, aviation and automative safety, drug safety, nuclear arms control, etc.. (and happy to get other suggestions)
Talk about policy as well, various frameworks inside companies, regulations etc..
Talk about predictions for the future, methodologies for how to come up with them.
All of the above said, I get antsy if I don’t get my dosage of math and code- I intend 80% of the course to be technical and cover research papers and results. It should also involve some hands on projects.
Some technical components should include: evaluations, technical mitigations, attacks, white and black box interpretability methods, model organisms.
Whenever I teach a course I always like to learn something from it, so I hope to cover state of art research results, especially ones that require some work to dig into and I wouldn’t get to do so without this excuse.
Anyway, since I haven’t yet planned this course, I thought I would solicit comments on what should be covered in such a course. Links to other courses blogs etc. are also useful. (I do have a quirk that I’ve never been able to teach from someone else’s material, and often ended up writing a textbook, see here, here, and here whenever I teach a course.., so I don’t intent to adapt any curricula wholesale)
Thanks in advance!
[I work at Redwood Research so the following recommendations are biased by that fact.]
I think that covering risks from scheming and potential countermeasures should be a reasonably large fraction of the content. Concrete sub-topics here include:
Why/how schemers pose risks: Scheming AIs: Will AIs fake alignment during training in order to get power?, Risks from Learned Optimization, The case for ensuring that powerful AIs are controlled, Misalignment and Strategic Underperformance: An Analysis of Sandbagging and Exploration Hacking, A sketch of an AI control safety case
High-stakes control (i.e. preventing concentrated failures): AI Control: Improving Safety Despite Intentional Subversion, AI catastrophes and rogue deployments, How to prevent collusion when using untrusted models to monitor each other, Ctrl-Z: Controlling AI Agents via Resampling,
Low-stakes control (i.e. preventing non-concentrated failures such as research sabotage): This is a super important part of AI control, but we don’t have great resources on it yet. You might want to look at Notes on handling non-concentrated failures with AI control: high level methods and different regimes for inspiration, or How can we solve diffuse threats like research sabotage with AI control?
Misc other notes:
For hands-on projects, I’d recommend 7+ tractable directions in AI control for inspiration, and talking to Ethan Perez.
I also endorse talking about recursive oversight, though personally I’m most excited about recursive oversight techniques that (1) are trying to be robust against models trying to subvert the oversight process, and (2) techniques that assume access to a small amount of ground truth data.
I also like various kinds of “basic science” on safety-relevant aspects of AIs; a lot of Owain Evans’ recent work comes to mind.
I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
2. Act differently based on that
In addition to Elad Hazan’s, I am aware of similar courses from Roger Grosse (Toronto) and Max Lamparth et al (Stanford).
A course I tought on AI safety and alignment at Princeton from 2 years ago:
https://sites.google.com/view/cos598aisafety/
list of materials is on the webpage, happy to share lecture slides as well.
I believe Peter Henderson has a newer version. Roger Grosse taught a similar course at U Toronto.
As a category suggestion, I’d suggest including “obstructions/barriers” to making safe/aligned AGI, in the sense of the natural proofs barrier for P vs. NP. They aren’t going to be nearly as crisp in AGI safety/alignment as in math, but I still think it’d be valuable to present fundamental problems, that seem like they must be solved, as key parts of the field—or in other words, sets of approaches (such as ignoring those problems) that (most likely) won’t work.
Some examples are in “AGI Ruin: A List of Lethalities”, though not directly stated as barriers, e.g. “you only get one shot at making a safe fully-superhuman AI” could be stated as “solutions that require observing the behavior of a fully-superhuman AI don’t work [given such and such background conclusions]”.
(I also give examples in “The fraught voyage of aligned novelty”, though maybe not presented in a way you’d like. E.g. “if the AI is genuinely smarter than you, it’s thinking in ways that are alien to you” --> “solutions that require the AI to only think in ways that you already understand don’t work”.)
I would have taken this class had I not graduated this spring!
A few suggestions:
AI 2027
Measuring AI Ability to Complete Long Tasks
The case for AGI by 2030
Evaluating frontier AI R&D capabilities of language model agents against human experts
Recent Redwood Research project proposals
Open problems in emergent misalignment
Another thing I consider really important: many of the students will be like “Holy shit, AGI is happening! This affects my life plans!” and will want advice. I think it’s good to have something to point them to, like:
Why AGI could be here soon and what you can do about it: a primer
Good luck running the course!
Perhaps some ideas for non-technical readings from my syllabus (it’s an AI safety course I’ve been teaching for the past couple of years for a philosophy dept).
If legal policy is in your wheelhouse, here’s a selection of the growing literature (apologies, some of it is my own)
Noam Kolt – “Algorithmic Black Swans”
Addressing catastrophic tail events via anticipatory regulation
Published: https://journals.library.wustl.edu/lawreview/article/id/8906/
Peter Salib & Simon Goldstein – “AI Rights for Human Safety”
On AI rights for safety
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4536494
Yonathan Arbel, Matthew Tokson & Albert Lin – “Systemic Regulation of Artificial Intelligence”
Moving from application-level regulation to system-level regulation
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4543681
Published: https://arizonastatelawjournal.org/article/systemic-regulation-of-artificial-intelligence/
Gabriel Weil – “Tort Law as a Tool for Mitigating Catastrophic Risk from Artificial Intelligence” Proposes reforming tort law to deter catastrophic AI risks before they materialize
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4694006
Yonathan Arbel et al. – “Open Questions in Law and AI Safety: An Emerging Research Agenda”
Sets out a research agenda for a new field of “AI Safety Law,” focusing on existential and systemic AI risks
Published: https://www.lawfaremedia.org/article/open-questions-in-law-and-ai-safety-an-emerging-research-agenda
Peter Salib – “AI Outputs Are Not Protected Speech”
Argues that AI-generated outputs lack First Amendment protection, enabling stronger safety regulation
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4481512
Mirit Eyal & Yonathan Arbel – “Tax Levers for a Safer AI Future”
Proposes using tax credits and penalties to align AI development incentives with public safety
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4528105
Cullen O’Keefe, Rohan Ramakrishnan, Annette Zimmermann, Daniel Tay & David C. Winter – “Law-Following AI: Designing AI Agents to Obey Human Laws”
Suggests AI agents should be trained and constrained to follow human law, like corporate actors
SSRN: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4726207
I’m surprised no one has mentioned this yet (?) but I woudl recommend at least an overview of classical agent foundations, say the Vingean reflection section of: https://intelligence.org/research-guide/
Only a small core of people is still working on e.g. fully solving decision theory for alignment, which even I admit seems like a bit of a detour. But the theoretical discussion around Vingean uncertainty still seems illustrative of important barriers to both theoretical and empirical agendas.
Thank you—although these type of questions (how can a weak agent verify the actions of a strong agent) are closely related to my background in computational complexity (e.g., interactive proofs, probabilistically checkable proofs, delegation of computation), I plan to keep the course very empirical
Yes, I’m actually rereading your excellent computational complexity textbook currently!
I have been thinking about to what extent the agent foundations communities’ goal of understanding computational uncertainty (including Vingean reflection) is “hard” or even “complete” for much of computational complexity theory (perhaps also in the literal sense of containing complete problems for well-studied complexity classes), and therefore perhaps far too ambitious to expect a “solution” before AGI. I wonder if you have thoughts on this.
One direction I’ve been exploring recently is a computationally unbounded theory of embedded agency—which tries to avoid needing to talk about computational complexity, but this may not capture the important problems of self-trust needed for alignment.
Anyway, there’s no rigorous empirical science without some kind of theory—I know you didn’t try to teach cryptography in a purely empirical way, and both subjects share an adversarial nature :)
It seems good to spend a bunch of time on takeoff speeds given how important they are for how AI goes. There are many sources discussing takeoff speeds. Some places to look: Tom Davidson’s outputs, forethought, AI-2027 (including supplements), AI futures project, epoch, Daniel K.’s outputs, some posts by me.
Two ideas for projects/exercises, which I think could be very instructive and build solid instincts about AI safety:
Builder-breaker arguments, a la ELK
Writing up a safety case (and doing the work to generate the underlying evidence for it)
As someone who has applied to take this class, I’ll suggest 10 papers, 4 from my own niche research interests and 6 for more very recent eval-focused work which I think is interesting and I’d like an excuse to read/discuss.
Niche Interests
1) In terms of what we can learn from other fields, AI-safety-conscious cognitive scientists have recently been thinking about how to move past revealed preferences in AI Alignment. They’ve come up with resource-rational contractualism, which on the surface seems like an interesting framework with a Bayesian bent, so it looks like it could also scratch the math itch. These two papers: (Zhi-Xuan et al. 2024) and (Levine et al. 2025) seem to be the main ones so far, and are very recent.
2) I find Goodfire AI’s approach to mech interp, which essentially tries to use model params instead of activations to find mechanisms, really interesting, and I think it is both new enough and mathematically-appropriate enough that I can see student projects iterating on it for the class: (Braun et al. 2025) and (Bushnaq et al. 2025) are the main papers here.
Recent Eval Work
The METR doubling-time paper, Ai2′s SciArena, LLMs Often Know When They’re Being Evaluated, Anthropic’s Shade-ARENA, and UK AISI’s STACK Adversarial Attack, and Cohere’s takedown of LLMArena
As Elad mentioned earlier, Peter Henderson’s courses at Princeton can be found here: https://www.polarislab.org/#/teaching
There’s the Spring 2025 version of COS 598A: AI Safety & Alignment, which is a grad course on technical AI safety topics, and the Fall 2024 version of SPI 352/COS 352: Artificial Intelligence, Law, & Public Policy, which is an undergrad course that’s structured like a law school seminar.
I highly recommend both courses! I think the syllabi are designed pretty well, and the “paper debate” component of COS 598A was quite good for fostering deeper engagement.
I would appreciate a lecture on how causality is used in mechanistic interpretability, particularly for understanding the faithfulness of Chain-of-Thought reasoning. This intersection is crucial for AI safety as it addresses whether LLMs actually follow the reasoning steps they generate or if their explanations are post-hoc rationalizations. The field has developed sophisticated causal intervention methods to probe the relationship between intermediate reasoning steps and final outputs.
Key Papers:
Fazl, B., et al. (2025). “Chain-of-Thought Is Not Explainability.”
Paul, B., et al. (2025) “Thought Anchors: Which LLM Reasoning Steps Matter?” arXiv preprint.
Jin, Z., et al. (2024). LLMs with Chain-of-Thought Are Non-Causal Reasoners. arXiv:2402.16048.
Singh, J., et al. (2024). How to think step-by-step: A mechanistic understanding of chain-of-thought reasoning. arXiv:2402.18312.
Lanham, T., et al. (2023). Measuring Faithfulness in Chain-of-Thought Reasoning. Anthropic Technical Report.
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.
I agree on the object level that the hard barriers to alignment should be emphasized.
However, (based on this and other conversations with you) I think we have slightly different pictures of what this looks like in practice.
As an allegory, many computational complexity researchers would like to solve P vs NP. Like you do here, they would emphasize working on directions that have a chance of solving the hard part of the problem. However, they would not necessarily advocate that people just need to “work on the hard part of P vs NP.” In fact, for better or worse I think it’s actually considered slightly cringe to say you are trying to solve P vs NP. The reason is that the direct approaches are all known to fail, in the strong sense that there are proofs that many entire classes of proof techniques can’t work. So actually trying to solve P vs NP often looks like working on something very indirectly related to P vs NP, which you know must be resolved before (or along with) P vs NP for convoluted reasons, but would not itself directly resolve the question. It’s just something that might help you get traction. An example is algebraic complexity (though some people might work on it for its own sake as well). Another example is/was circuit lower bounds (though according to none other than Professor Barak this lately seems blocked).
I view the situation as similar in alignment. Many people have thought about how to just solve value alignment, including me, and realized they were not prepared to write down a solution. Then we switched to working on things which hopefully DO address the hard part, but don’t necessarily have a complete path to resolving the problem—rather we hope they can tractable make nonzero progress on the hard part.
This seems generally correct to me. I think that alignment is going to be easier to solve than P vs NP but that might be because I’m just ignorant about how hard doing research on the hard part actually is.
It does seem to me that there still aren’t good resources even clearly saying what the hard part of alignment is, let alone clearly explaining how to do research on it.
So I think there’s a lot of things that can be done to make it easier and test just how hard it really is.
I am getting some great links as responses to my post on X https://x.com/boazbaraktcs/status/1940780441092739351
I suggest something on Value Alignment itself. The actual problem of trying make a model have the values you want, be certain of it, be certain it will scale and other parts of the Hard Part of Alignment.