RSS

Alexa Pan

Karma: 457

The dis­til­la­tion dou­ble bind: Distill­ing mis­al­igned mod­els ei­ther trans­fers mis­al­ign­ment or it doesn’t

18 Jun 2026 21:21 UTC
57 points
3 comments5 min readLW link
(blog.redwoodresearch.org)

In­crim­i­nat­ing mis­al­igned AI mod­els via distillation

15 May 2026 21:43 UTC
115 points
12 comments5 min readLW link

Alexa Pan’s Shortform

Alexa Pan22 Apr 2026 23:35 UTC
4 points
2 comments1 min readLW link

A tax­on­omy of bar­ri­ers to trad­ing with early mis­al­igned AIs

Alexa Pan21 Apr 2026 19:02 UTC
76 points
3 comments47 min readLW link

Will mis­al­igned AIs know that they’re mis­al­igned?

Alexa Pan4 Dec 2025 21:58 UTC
13 points
5 comments9 min readLW link

What would an IRB-like policy for AI ex­per­i­ments look like?

Alexa Pan24 Nov 2025 19:36 UTC
22 points
0 comments15 min readLW link

Son­net 4.5′s eval gam­ing se­ri­ously un­der­mines al­ign­ment evals, and this seems caused by train­ing on al­ign­ment evals

30 Oct 2025 15:34 UTC
144 points
22 comments14 min readLW link

AI Safety Newslet­ter #43: White House Is­sues First Na­tional Se­cu­rity Memo on AI Plus, AI and Job Dis­place­ment, and AI Takes Over the Nobels

28 Oct 2024 16:03 UTC
6 points
0 comments6 min readLW link
(newsletter.safe.ai)