RSS

Condensation

abramdemski9 Nov 2025 19:08 UTC
53 points
6 comments15 min readLW link

Prob­lems I’ve Tried to Legibilize

Wei Dai9 Nov 2025 10:27 UTC
28 points
3 comments2 min readLW link

My­opia Mythology

abramdemski8 Nov 2025 22:22 UTC
23 points
1 comment3 min readLW link

Com­par­ing Payor & Löb

abramdemski8 Nov 2025 5:40 UTC
45 points
1 comment3 min readLW link

A scheme to credit hack policy gra­di­ent training

Adrià Garriga-alonso7 Nov 2025 6:24 UTC
15 points
0 comments5 min readLW link

Geo­met­ric UDT

abramdemski6 Nov 2025 6:24 UTC
28 points
6 comments7 min readLW link

Meta-agen­tic Pri­soner’s Dilemmas

TsviBT5 Nov 2025 16:44 UTC
37 points
1 comment5 min readLW link

Leg­ible vs. Illeg­ible AI Safety Problems

Wei Dai4 Nov 2025 21:39 UTC
254 points
77 comments2 min readLW link

GDM: Con­sis­tency Train­ing Helps Limit Sy­co­phancy and Jailbreaks in Gem­ini 2.5 Flash

4 Nov 2025 16:25 UTC
40 points
2 comments6 min readLW link
(arxiv.org)

Weak-To-Strong Generalization

abramdemski2 Nov 2025 2:45 UTC
28 points
0 comments9 min readLW link

An­thropic’s Pilot Sab­o­tage Risk Report

dmz30 Oct 2025 17:50 UTC
32 points
2 comments3 min readLW link
(alignment.anthropic.com)

Son­net 4.5′s eval gam­ing se­ri­ously un­der­mines al­ign­ment evals, and this seems caused by train­ing on al­ign­ment evals

30 Oct 2025 15:34 UTC
139 points
20 comments14 min readLW link

Steer­ing Eval­u­a­tion-Aware Models to Act Like They Are Deployed

30 Oct 2025 15:03 UTC
58 points
12 comments16 min readLW link

Should AI Devel­op­ers Re­move Dis­cus­sion of AI Misal­ign­ment from AI Train­ing Data?

Alek Westover23 Oct 2025 15:12 UTC
37 points
3 comments9 min readLW link

Tech­ni­cal Ac­cel­er­a­tion Meth­ods for AI Safety: Sum­mary from Oc­to­ber 2025 Symposium

Martin Leitgab22 Oct 2025 21:33 UTC
25 points
2 comments6 min readLW link

Re­duc­ing risk from schem­ing by study­ing trained-in schem­ing behavior

ryan_greenblatt16 Oct 2025 16:16 UTC
32 points
0 comments11 min readLW link

Rogue in­ter­nal de­ploy­ments via ex­ter­nal APIs

15 Oct 2025 19:34 UTC
34 points
4 comments6 min readLW link

Cur­rent Lan­guage Models Strug­gle to Rea­son in Ciphered Language

14 Oct 2025 9:08 UTC
77 points
6 comments5 min readLW link

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
115 points
13 comments9 min readLW link

Iter­ated Devel­op­ment and Study of Schemers (IDSS)

ryan_greenblatt10 Oct 2025 14:17 UTC
41 points
1 comment8 min readLW link