CS 2881r

TagLast edit: 11 Sep 2025 17:17 UTC by habryka

CS 2881r is a class by @boazbarak on AI Safety and Alignment at Harvard.

This tag applies to all posts about that class, as well as posts created in the context of it, e.g. as part of student assignments.

[CS 2881r] Some Generalizations of Emergent Misalignment

Valerio Pepe14 Sep 2025 16:18 UTC

12 points

0 comments9 min readLW link

AI Safety course intro blog

boazbarak21 Jul 2025 2:35 UTC

18 points

0 comments1 min readLW link

(windowsontheory.org)

[CS 2881r] [Week 6] Recursive Self-Improvement

Joshua Qin13 Oct 2025 6:56 UTC

4 points

0 comments6 min readLW link

[CS 2881r] [Week 3] Adversarial Robustness, Jailbreaks, Prompt Injection, Security

egeckr27 Sep 2025 1:31 UTC

3 points

0 comments26 min readLW link

Economic Impacts of Foundation Models (Experiment)

nsiwek, Lia Zheng, Bright Liu and Karina Chung

24 Nov 2025 1:54 UTC

2 points

0 comments4 min readLW link

[CS2881r] Optimizing Prompts with Reinforcement Learning

Anastasia Ahani and atticusw

1 Oct 2025 14:02 UTC

2 points

0 comments5 min readLW link

Week 3: Adversarial Robustness

Ely Hahami21 Nov 2025 1:43 UTC

1 point

0 comments3 min readLW link

Call for suggestions—AI safety course

boazbarak3 Jul 2025 14:30 UTC

53 points

23 comments1 min readLW link

[CS2881r][Week 8] When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

Joseph Bejjani, Itamar Rocha Filho, Haichuan Wang and Zidi Xiong

7 Nov 2025 5:45 UTC

2 points

0 comments23 min readLW link

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks

hughvd28 Oct 2025 2:38 UTC

4 points

0 comments8 min readLW link

[CS 2881r AI Safety] [Week 2] Modern LLM Training

jusyc26 Sep 2025 1:25 UTC

1 point

0 comments4 min readLW link

[CS 2881r AI Safety] [Week 5] Content Policies

MB Samuel and audreyty

16 Oct 2025 4:27 UTC

1 point

0 comments12 min readLW link

[CS 2881r AI Safety] [Week 1] Introduction

bira, nsiwek and atticusw

14 Sep 2025 19:52 UTC

15 points

0 comments13 min readLW link

Economic Impacts of Foundation Models (Experiment)

nsiwek, Lia Zheng, Karina Chung and Bright Liu

23 Nov 2025 4:06 UTC

2 points

0 comments4 min readLW link

Learnings from AI safety course so far

boazbarak27 Sep 2025 18:17 UTC

106 points

6 comments3 min readLW link

No comments.