RSS

For­mal­iz­ing Embed­ded­ness Failures in Univer­sal Ar­tifi­cial Intelligence

Cole WyethMay 26, 2025, 12:36 PM
39 points
0 comments1 min readLW link
(arxiv.org)

Re­ward but­ton alignment

Steven ByrnesMay 22, 2025, 5:36 PM
50 points
15 comments12 min readLW link

Un­ex­ploitable search: block­ing mal­i­cious use of free parameters

May 21, 2025, 5:23 PM
34 points
16 comments6 min readLW link

Model­ing ver­sus Implementation

Cole WyethMay 18, 2025, 1:38 PM
27 points
10 comments3 min readLW link

Proof Sec­tion to an In­tro­duc­tion to Re­in­force­ment Learn­ing for Un­der­stand­ing In­fra-Bayesianism

Brittany GelbMay 17, 2025, 2:36 AM
3 points
0 comments9 min readLW link

An In­tro­duc­tion to Re­in­force­ment Learn­ing for Un­der­stand­ing In­fra-Bayesianism

Brittany GelbMay 17, 2025, 2:34 AM
11 points
0 comments20 min readLW link

It Is Un­ten­able That Near-Fu­ture AI Sce­nario Models Like “AI 2027” Don’t In­clude Open Source AI

Andrew DicksonMay 16, 2025, 2:20 AM
36 points
17 comments5 min readLW link

Prob­lems with in­struc­tion-fol­low­ing as an al­ign­ment target

Seth HerdMay 15, 2025, 3:41 PM
48 points
14 comments10 min readLW link

What if Agent-4 breaks out?

Alvin ÅnestrandMay 15, 2025, 9:15 AM
10 points
0 comments6 min readLW link

Dodg­ing sys­tem­atic hu­man er­rors in scal­able oversight

Geoffrey IrvingMay 14, 2025, 3:19 PM
33 points
3 comments4 min readLW link

Work­ing through a small tiling result

James PayorMay 13, 2025, 8:28 PM
66 points
9 comments5 min readLW link

Mea­sur­ing Schel­ling Co­or­di­na­tion—Reflec­tions on Sub­ver­sion Strat­egy Eval

Graeme FordMay 12, 2025, 7:06 PM
5 points
0 comments8 min readLW link

Poli­ti­cal syco­phancy as a model or­ganism of scheming

May 12, 2025, 5:49 PM
39 points
0 comments14 min readLW link

AIs at the cur­rent ca­pa­bil­ity level may be im­por­tant for fu­ture safety work

ryan_greenblattMay 12, 2025, 2:06 PM
81 points
2 comments4 min readLW link

Highly Opinionated Ad­vice on How to Write ML Papers

Neel NandaMay 12, 2025, 1:59 AM
60 points
4 comments32 min readLW link

Ab­solute Zero: Alpha Zero for LLM

alapmiMay 11, 2025, 8:42 PM
23 points
16 comments1 min readLW link

Glass box learn­ers want to be black box

Cole WyethMay 10, 2025, 11:05 AM
46 points
10 comments4 min readLW link

Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM
4 points
1 comment6 min readLW link

Slow cor­po­ra­tions as an in­tu­ition pump for AI R&D automation

9 May 2025 14:49 UTC
91 points
23 comments9 min readLW link

Video & tran­script: Challenges for Safe & Benefi­cial Brain-Like AGI

Steven Byrnes8 May 2025 21:11 UTC
24 points
0 comments18 min readLW link