This fall Boaz Barak taught Harvard’s first AI safety course (course website). Boaz has done an excellent job organizing and promoting the material; you may have seen his original post on LW. I was the head TA for the course, and helped Boaz run the course (alongside Natalie Abreu, Sunny Qin, and Hanlin Zhang). I wanted to give an inside look into the course and share a lot more of our material in order to help future iterations of the course go well, both at Harvard and similar courses at other universities.
I think this course was a good thing which more universities should do. If you are in a position to try to get a similar course running at your university, this post is for you, and aims to walk you through course logistics and course decisions. Feel invited to reach out with questions or comments. The syllabus is public, and all of the lectures are shared on youtube here.
This course was structured as a research seminar where students learned to replicate important AI safety papers and produce novel research. You can see their projects here.
Basic organization in bullet form:
Assignments:
We had a homework-0 (replicate an emergent misalignment result) and interest-form in order to gate access to the course (we had 274 people fill out an interest form).
The course had a midterm assignment and a final assignment which revolved around recreating important AI safety results.
Final projects were a 5-10 page neurips-style paper and a poster presentation
There were optional weekly student 15 minute presentations by groups of ~4 on topics related to that lecture.
The course had a lot of reading that was lightly enforced
Boaz provided a lot of reimbursements for the course, approximately 50$ per group for the mini-projects, and 500$ per final project. Most students used much less than this.
Assignment Structure
Assignment 0 (Filtering for the course)
The process of filtration was a combination of a “homework 0” score, interest, and background
HW 0 was a recreation of an AI safety paper on Emergent Misalignment. The assignment is here.
The midterm assignment was a recreation of one of five papers that we suggested, with a small open-ended twist. An example here is something like trying to see how different forms of prompting may affect a result, or the robustness of the results on a different model or dataset.
Everyone had to do a poster and a paper. Most people formed groups of 2-4.
Final projects had 2 flavors:
Self-contained final project: this was effectively the midterm assignment extended—recreating a known result with some small modifications to how the original works were implemented. An example here is a project that looked at CoT faithfulness.
Research-y open-ended project: if the students were interested in longer-term projects, the goal was that this project would help set them up for success. As such, here we aimed to evaluate more on having a meaningful “Theory of Change” for the project, having well-thought out evaluations, and having done a meaningful literature review of the space. An example here is two groups did projects on how to uniquely identify a model, given blackbox access to the model and a minimal set of queries (1, 2), and one did a project on investigating AI psychosis in a somewhat open-ended fashion.
I solicited a lot of project ideas from different people, which I then shared with students (at the bottom of cs2881 Final Project Outline). One example was something I was personally interested in: a method for identifying what model generated text (model fingerprinting), given blackbox access to the model (2 groups did projects relating to this). Other projects were solicited from other researchers like one on developing a benchmark for AI scams, or validating a theoretical model of how misinformation spreads.
About 2 weeks after receiving the assignment, all the students had to meet with a TA once, and create a 1 page proposal for their project, shared with the rest of the course. It didn’t seem like most students looked at each other’s proposals, and most of the value was in forcing students to write something down. Realistically, I don’t see a good way to have the students engage across projects, one minor change that is possible here is to force students to give comments on 1 to 2 other projects. This would have been more possible if we started the final projects earlier.
In order to facilitate this, the TAs had to post more office hours, and keep a relatively strict schedule. Most groups met with me (because I was the most “AI safety pilled”), which meant that I had to enforce fairly strict 30-minute meetings with groups.
Upon reflection, I think this is a reasonably good structure. Having-to-meet forces students to come prepared, and do a lot of thinking on their own. It’s fairly fast to get a lot of context on how students are doing when you have high-context on the problem space.
Students submitted a 5-10 page neurips-style assignment on December 3rd, and then came to class on December 10th with a printed out poster.
Nearly every week a group of students volunteered to give a ~15 minute presentation on the experiments that they ran relating to the lecture that week. I think this structure is a very good idea, but is also the part of the course that could have benefited the most from more structure. These optional presentations clearly involved a lot of work, but were not for a grade, nor did they substitute other work. In the future, I would make these assignments either worth extra-credit or potentially could be used to replace the midterm assignment. Separately, these presentations were extremely open-ended, and in order to make these more productive for both the students presenting and the students listening, I think that the course staff should have offered the students more guidance about what we were expecting here.
Take a look at some of the LW articles that the students have posted—we had 11 posts by students!
General reflections
This course helped me solidify the role that academia can play in today’s climate: a lot of papers out there in the field of AI safety somewhat have the flavor of “we tried a thing and it worked.” These papers can be really valuable, but one place for academia is in picking these things up and inspecting them for how robust they really are—when do they break, and what isn’t stated in the papers (i.e. how specific are their claims).
I think this course would not have gone as well as it did if it was an undergraduate course. A lot of the structure that was applied somewhat relied on students wanting to learn, and generally de-emphasized grades. Our ability to give students guidance for projects and the course staff’s ability to grade relied on students trying to engage with the material in earnest and giving our semi-under-specified guidance the benefit of the doubt.
Grading is hard, in general we want to make something that is hard to do and easy to verify. Having students recreate a headline figure from a paper is a fairly decent way to keep things easy-to-verify while requiring meaningful engagement with the material. Requiring a “meaningful spin-off” also forces students to go a bit beyond running claude-code on an existing code base. Perhaps in the future AI tools will also make this too easy for students, but at this current iteration this was decent.
I think we had the correct number of assignments, and correct general structure. The mini-project helped students figure out groups that they wanted to collaborate with, without being too binding.
A lot of the lectures were guest lectures. One might be concerned that this only works well for Boaz, who is extremely well-connected in both academia and industry. I think this is partially true, but my reflection is that a very similar version of this course can be very meaningful even if the guest lectures are not from famous researchers.
One thing that I think was meaningful is making the work that the students did legible to outside people (e.g. we put the papers and the posters on our website, and also promoted this course, and also encouraged the students to post on LW). This helps the students improve their resumes so they can do more work in the field of AI safety.
Lastly, I think it’s important to share that I spent a lot of time on this course. I don’t share this to receive a pat-on-the-back, but rather because I think that the course going well relies on a great professor, but also on a course staff that is willing to spend a lot of time and energy in making the course work out. In future iterations, this may become less important, because the general structure will be more structured and specified.
Recommendations for future iterations:
I found that the the general structure of our course was good, and in future iterations I would keep many things the same. Here are some things I would change or emphasize:
Start the final projects earlier (we started early November, with the deadline of early December).
Share the grading rubrics early; as this was the first time this course was offered, we didn’t always have the rubrics ready at the time the assignments were assigned. This worked for this course, because the students were largely at a graduate level and could reasonably infer what was expected.
A bit (not much) more enforcement of reading materials would have been good.
As I discussed in the “Weekly Optional Presentations” section, giving those presentations more structure and offering credit to the students who do them, would likely meaningfully improve their quality.
For the final projects, I made a long list of suggestions and project proposals for the students. Upon reflection, I’m actually uncertain if this was worth the amount of effort I put in. I think it was worth it, and in future iterations I would actually probably spend about the same amount of time and effort to recreate this project proposal. Approximately 5 projects ended up being from the suggestion list, but I suspect these proposals were also helpful for students to reference, even if they didn’t use them. I observed that the students had a preference for projects relating to “flashy” topics that we covered in class (e.g. the students really liked persona vectors).
Aggregated Student Perspectives on the Course
We issued 2 sets of anonymous questionaires (one official one through Harvard which got 35 responses, and one through a google form, which got 21 responses). Generally the course appeared very well received, and the criticisms were fairly mild and limited. The full “Q-report” from Harvard is available here with the anonymous feedback from the students.
We are happy to share more information about how the students received the course, but here are some course evaluations we surface:
A google form was sent out, and 21⁄69 responses were given, which I then ran through ChatGPT to summarize. (I did very minimal filtering after that.) The Harvard-issued report was very similar in the kinds of responses we received.
Overall Experience
Students generally described the course as high quality, engaging, and well-structured, with a strong emphasis on exposure to real research and researchers.
Commonly cited strengths included:
Guest lectures: Frequently mentioned as a highlight, especially for providing firsthand perspectives from active researchers and practitioners.
Breadth of topics: Many appreciated the wide survey of AI safety subareas, helping them build a mental map of the field.
Readings and discussions: Papers were described as interesting and well-chosen, with discussion-based formats aiding understanding.
Project- and experiment-focused assignments: These were often preferred over traditional problem sets and helped make research feel accessible.
Course format and environment: The seminar style, openness of discussion, and informal elements (e.g., office hours, class atmosphere) were positively received.
Overall, the course was perceived as enjoyable, intellectually stimulating, and effective at exposing students to frontier ideas.
How the Course Influenced Students’ Future Plans
Shifts in Perspective
Many students reported that the course:
Refined or expanded their understanding of AI safety, particularly by clarifying what “safety” means beyond simplified narratives.
Helped distinguish between speculative concerns and concrete technical challenges, leading to more nuanced views on timelines and risks.
Provided a clearer picture of how engineering, interpretability, alignment, and governance fit together.
Some students noted that while their high-level stance did not radically change, they now felt more grounded and informed.
Future Intentions
Students’ stated plans varied, but common directions included:
Increased interest in AI safety research, particularly among those earlier in their academic careers.
Plans to pursue research projects, undergraduate research, or further coursework related to alignment, interpretability, or safety-adjacent topics.
For some, the course helped confirm existing interests rather than redirect them.
Others reported that, even if they do not plan to work directly in AI safety, they now feel better equipped to reason about and discuss the implications of advanced AI systems.
Student Critiques
While feedback was largely positive, several recurring critiques emerged.
Organization and clarity:
Students wanted earlier and clearer communication around Project options and expectations, grading criteria and rubrics, assignment structure and deadlines.
Course logistics were spread across multiple platforms (Slack, Perusall, Google Forms, personal websites), and many students preferred a single centralized system.
Projects and Timing
Projects were not introduced early enough, limiting how ambitious or polished final results could be.
Several students suggested: Announcing final project details earlier, and allowing more time for deeper or more technical projects
Technical Depth and Skill-Building
A suggestion was to increase technical rigor, particularly: More technical lectures, Expanded coverage of interpretability and related methods, or Optional problem sets or hands-on technical exercises
Discussion and Critical Engagement
There was a reasonably strong desire for:
More time to question and critically engage with guest speakers
Greater encouragement to challenge assumptions and contested claims in AI safety research.
Some small percentage of people felt like recording nearly all sessions was seen by some as dampening candid or controversial discussion
Overall, the course appears to have lowered barriers to engagement with the field, whether through direct research involvement or more informed participation in adjacent areas.
Conclusion
I’m extremely happy to have helped make this course a reality alongside Boaz, Natalie, Sunny, and Hanlin. I welcome comments on this post, and would be happy to engage with people trying to spin up similar courses at other universities/organizations—some conversations have already started.
Wrap-up: Summary of all the resources we make public:
Reflections on TA-ing Harvard’s first AI safety course
This fall Boaz Barak taught Harvard’s first AI safety course (course website). Boaz has done an excellent job organizing and promoting the material; you may have seen his original post on LW. I was the head TA for the course, and helped Boaz run the course (alongside Natalie Abreu, Sunny Qin, and Hanlin Zhang). I wanted to give an inside look into the course and share a lot more of our material in order to help future iterations of the course go well, both at Harvard and similar courses at other universities.
I think this course was a good thing which more universities should do. If you are in a position to try to get a similar course running at your university, this post is for you, and aims to walk you through course logistics and course decisions. Feel invited to reach out with questions or comments. The syllabus is public, and all of the lectures are shared on youtube here.
This course was structured as a research seminar where students learned to replicate important AI safety papers and produce novel research. You can see their projects here.
Basic organization in bullet form:
Assignments:
We had a homework-0 (replicate an emergent misalignment result) and interest-form in order to gate access to the course (we had 274 people fill out an interest form).
The course had a midterm assignment and a final assignment which revolved around recreating important AI safety results.
Final projects were a 5-10 page neurips-style paper and a poster presentation
There were optional weekly student 15 minute presentations by groups of ~4 on topics related to that lecture.
The course had a lot of reading that was lightly enforced
We had about ~23 final projects—see here.
Participants:
We had ~70 students, ~50% of them were experienced undergraduates
The rest were graduate students, with maybe ~20% of them being PhDs
We had 1 professor and 4 TAs
Class structure:
The course met once per week for 3 hours
Most of the lectures relied heavily on guest lecturers
Most lectures had a ~15 minute student presentation that was on-theme which a group of ~4 students prepared over the weeks prior.
The syllabus and recordings are public (https://boazbk.github.io/mltheoryseminar/)
Most communication occurred over Slack.
Misc:
Boaz provided a lot of reimbursements for the course, approximately 50$ per group for the mini-projects, and 500$ per final project. Most students used much less than this.
Assignment Structure
Assignment 0 (Filtering for the course)
The process of filtration was a combination of a “homework 0” score, interest, and background
HW 0 was a recreation of an AI safety paper on Emergent Misalignment. The assignment is here.
Interest in CS 2881 AI Safety (google form)
Midterm assignment
The midterm assignment was a recreation of one of five papers that we suggested, with a small open-ended twist. An example here is something like trying to see how different forms of prompting may affect a result, or the robustness of the results on a different model or dataset.
Mini Project description: CS2881 Mini-project
Rubric: Mini Project Grading
Final assignment
Everyone had to do a poster and a paper. Most people formed groups of 2-4.
Final projects had 2 flavors:
Self-contained final project: this was effectively the midterm assignment extended—recreating a known result with some small modifications to how the original works were implemented. An example here is a project that looked at CoT faithfulness.
Research-y open-ended project: if the students were interested in longer-term projects, the goal was that this project would help set them up for success. As such, here we aimed to evaluate more on having a meaningful “Theory of Change” for the project, having well-thought out evaluations, and having done a meaningful literature review of the space. An example here is two groups did projects on how to uniquely identify a model, given blackbox access to the model and a minimal set of queries (1, 2), and one did a project on investigating AI psychosis in a somewhat open-ended fashion.
I solicited a lot of project ideas from different people, which I then shared with students (at the bottom of cs2881 Final Project Outline). One example was something I was personally interested in: a method for identifying what model generated text (model fingerprinting), given blackbox access to the model (2 groups did projects relating to this). Other projects were solicited from other researchers like one on developing a benchmark for AI scams, or validating a theoretical model of how misinformation spreads.
About 2 weeks after receiving the assignment, all the students had to meet with a TA once, and create a 1 page proposal for their project, shared with the rest of the course. It didn’t seem like most students looked at each other’s proposals, and most of the value was in forcing students to write something down. Realistically, I don’t see a good way to have the students engage across projects, one minor change that is possible here is to force students to give comments on 1 to 2 other projects. This would have been more possible if we started the final projects earlier.
In order to facilitate this, the TAs had to post more office hours, and keep a relatively strict schedule. Most groups met with me (because I was the most “AI safety pilled”), which meant that I had to enforce fairly strict 30-minute meetings with groups.
Upon reflection, I think this is a reasonably good structure. Having-to-meet forces students to come prepared, and do a lot of thinking on their own. It’s fairly fast to get a lot of context on how students are doing when you have high-context on the problem space.
Students submitted a 5-10 page neurips-style assignment on December 3rd, and then came to class on December 10th with a printed out poster.
Project description: cs2881 Final Project Outline
Rubric: Final Project Grading Rubric
Weekly Optional Presentations
Nearly every week a group of students volunteered to give a ~15 minute presentation on the experiments that they ran relating to the lecture that week. I think this structure is a very good idea, but is also the part of the course that could have benefited the most from more structure. These optional presentations clearly involved a lot of work, but were not for a grade, nor did they substitute other work. In the future, I would make these assignments either worth extra-credit or potentially could be used to replace the midterm assignment. Separately, these presentations were extremely open-ended, and in order to make these more productive for both the students presenting and the students listening, I think that the course staff should have offered the students more guidance about what we were expecting here.
Take a look at some of the LW articles that the students have posted—we had 11 posts by students!
General reflections
This course helped me solidify the role that academia can play in today’s climate: a lot of papers out there in the field of AI safety somewhat have the flavor of “we tried a thing and it worked.” These papers can be really valuable, but one place for academia is in picking these things up and inspecting them for how robust they really are—when do they break, and what isn’t stated in the papers (i.e. how specific are their claims).
I think this course would not have gone as well as it did if it was an undergraduate course. A lot of the structure that was applied somewhat relied on students wanting to learn, and generally de-emphasized grades. Our ability to give students guidance for projects and the course staff’s ability to grade relied on students trying to engage with the material in earnest and giving our semi-under-specified guidance the benefit of the doubt.
Grading is hard, in general we want to make something that is hard to do and easy to verify. Having students recreate a headline figure from a paper is a fairly decent way to keep things easy-to-verify while requiring meaningful engagement with the material. Requiring a “meaningful spin-off” also forces students to go a bit beyond running claude-code on an existing code base. Perhaps in the future AI tools will also make this too easy for students, but at this current iteration this was decent.
I think we had the correct number of assignments, and correct general structure. The mini-project helped students figure out groups that they wanted to collaborate with, without being too binding.
A lot of the lectures were guest lectures. One might be concerned that this only works well for Boaz, who is extremely well-connected in both academia and industry. I think this is partially true, but my reflection is that a very similar version of this course can be very meaningful even if the guest lectures are not from famous researchers.
One thing that I think was meaningful is making the work that the students did legible to outside people (e.g. we put the papers and the posters on our website, and also promoted this course, and also encouraged the students to post on LW). This helps the students improve their resumes so they can do more work in the field of AI safety.
Lastly, I think it’s important to share that I spent a lot of time on this course. I don’t share this to receive a pat-on-the-back, but rather because I think that the course going well relies on a great professor, but also on a course staff that is willing to spend a lot of time and energy in making the course work out. In future iterations, this may become less important, because the general structure will be more structured and specified.
Recommendations for future iterations:
I found that the the general structure of our course was good, and in future iterations I would keep many things the same. Here are some things I would change or emphasize:
Start the final projects earlier (we started early November, with the deadline of early December).
Share the grading rubrics early; as this was the first time this course was offered, we didn’t always have the rubrics ready at the time the assignments were assigned. This worked for this course, because the students were largely at a graduate level and could reasonably infer what was expected.
A bit (not much) more enforcement of reading materials would have been good.
As I discussed in the “Weekly Optional Presentations” section, giving those presentations more structure and offering credit to the students who do them, would likely meaningfully improve their quality.
For the final projects, I made a long list of suggestions and project proposals for the students. Upon reflection, I’m actually uncertain if this was worth the amount of effort I put in. I think it was worth it, and in future iterations I would actually probably spend about the same amount of time and effort to recreate this project proposal. Approximately 5 projects ended up being from the suggestion list, but I suspect these proposals were also helpful for students to reference, even if they didn’t use them. I observed that the students had a preference for projects relating to “flashy” topics that we covered in class (e.g. the students really liked persona vectors).
Aggregated Student Perspectives on the Course
We issued 2 sets of anonymous questionaires (one official one through Harvard which got 35 responses, and one through a google form, which got 21 responses). Generally the course appeared very well received, and the criticisms were fairly mild and limited. The full “Q-report” from Harvard is available here with the anonymous feedback from the students.
We are happy to share more information about how the students received the course, but here are some course evaluations we surface:
A google form was sent out, and 21⁄69 responses were given, which I then ran through ChatGPT to summarize. (I did very minimal filtering after that.) The Harvard-issued report was very similar in the kinds of responses we received.
Overall Experience
Students generally described the course as high quality, engaging, and well-structured, with a strong emphasis on exposure to real research and researchers.
Commonly cited strengths included:
Guest lectures: Frequently mentioned as a highlight, especially for providing firsthand perspectives from active researchers and practitioners.
Breadth of topics: Many appreciated the wide survey of AI safety subareas, helping them build a mental map of the field.
Readings and discussions: Papers were described as interesting and well-chosen, with discussion-based formats aiding understanding.
Project- and experiment-focused assignments: These were often preferred over traditional problem sets and helped make research feel accessible.
Course format and environment: The seminar style, openness of discussion, and informal elements (e.g., office hours, class atmosphere) were positively received.
Overall, the course was perceived as enjoyable, intellectually stimulating, and effective at exposing students to frontier ideas.
How the Course Influenced Students’ Future Plans
Shifts in Perspective
Many students reported that the course:
Refined or expanded their understanding of AI safety, particularly by clarifying what “safety” means beyond simplified narratives.
Helped distinguish between speculative concerns and concrete technical challenges, leading to more nuanced views on timelines and risks.
Provided a clearer picture of how engineering, interpretability, alignment, and governance fit together.
Some students noted that while their high-level stance did not radically change, they now felt more grounded and informed.
Future Intentions
Students’ stated plans varied, but common directions included:
Increased interest in AI safety research, particularly among those earlier in their academic careers.
Plans to pursue research projects, undergraduate research, or further coursework related to alignment, interpretability, or safety-adjacent topics.
For some, the course helped confirm existing interests rather than redirect them.
Others reported that, even if they do not plan to work directly in AI safety, they now feel better equipped to reason about and discuss the implications of advanced AI systems.
Student Critiques
While feedback was largely positive, several recurring critiques emerged.
Organization and clarity:
Students wanted earlier and clearer communication around Project options and expectations, grading criteria and rubrics, assignment structure and deadlines.
Course logistics were spread across multiple platforms (Slack, Perusall, Google Forms, personal websites), and many students preferred a single centralized system.
Projects and Timing
Projects were not introduced early enough, limiting how ambitious or polished final results could be.
Several students suggested: Announcing final project details earlier, and allowing more time for deeper or more technical projects
Technical Depth and Skill-Building
A suggestion was to increase technical rigor, particularly: More technical lectures, Expanded coverage of interpretability and related methods, or Optional problem sets or hands-on technical exercises
Discussion and Critical Engagement
There was a reasonably strong desire for:
More time to question and critically engage with guest speakers
Greater encouragement to challenge assumptions and contested claims in AI safety research.
Some small percentage of people felt like recording nearly all sessions was seen by some as dampening candid or controversial discussion
Overall, the course appears to have lowered barriers to engagement with the field, whether through direct research involvement or more informed participation in adjacent areas.
Conclusion
I’m extremely happy to have helped make this course a reality alongside Boaz, Natalie, Sunny, and Hanlin. I welcome comments on this post, and would be happy to engage with people trying to spin up similar courses at other universities/organizations—some conversations have already started.
Wrap-up: Summary of all the resources we make public:
Syllabus
Video lectures
Course filtering:
Hw0
Interest form
Midterm Assignment
CS2881 Mini-project description
Mini Project Grading
Final Project
CS2881 Final Project Outline
Final Project Grading Rubric
Link to posters and presentations
LW posts from optional weekly presentations