Mind the Co­her­ence Gap: Les­sons from Steer­ing Llama with Goodfire

eitan sprejerMay 9, 2025, 9:29 PM
4 points

3 votes

Overall karma indicates overall quality.

1 comment6 min readLW link

My Ex­pe­rience With EMDR

SableMay 9, 2025, 9:25 PM
22 points

6 votes

Overall karma indicates overall quality.

0 comments11 min readLW link
(affablyevil.substack.com)

AI’s Hid­den Game: Un­der­stand­ing Strate­gic De­cep­tion in AI and Why It Mat­ters for Our Future

EmilyinAIMay 9, 2025, 8:01 PM
4 points

3 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Mud­dling Through Some Thoughts on the Na­ture of Historiography

E.G. Blee-GoldmanMay 9, 2025, 7:04 PM
2 points

1 vote

Overall karma indicates overall quality.

0 comments4 min readLW link

A Guide to AI 2027

koenraneMay 9, 2025, 5:14 PM
0 points

5 votes

Overall karma indicates overall quality.

1 comment28 min readLW link

Let’s stop mak­ing “In­tel­li­gence scale” graphs with hu­mans and AI

ExpertiumMay 9, 2025, 4:01 PM
3 points

11 votes

Overall karma indicates overall quality.

15 comments1 min readLW link

Slow cor­po­ra­tions as an in­tu­ition pump for AI R&D automation

May 9, 2025, 2:49 PM
91 points

36 votes

Overall karma indicates overall quality.

23 comments9 min readLW link

Cheaters Gonna Cheat Cheat Cheat Cheat Cheat

ZviMay 9, 2025, 2:30 PM
55 points

26 votes

Overall karma indicates overall quality.

4 comments22 min readLW link
(thezvi.wordpress.com)

Hu­mans vs LLM, memes as theorems

Yaroslav GranowskiMay 9, 2025, 1:26 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link

Mov­ing to­wards a ques­tion-based plan­ning frame­work, in­stead of task lists

casualphysicsenjoyerMay 9, 2025, 12:18 PM
4 points

2 votes

Overall karma indicates overall quality.

1 comment8 min readLW link
(substack.com)

Jim Bab­cock’s Main­line Doom Sce­nario: Hu­man-Level AI Can’t Con­trol Its Successor

May 9, 2025, 5:20 AM
30 points

11 votes

Overall karma indicates overall quality.

4 comments62 min readLW link
(www.youtube.com)

At­tend the 2025 Re­pro­duc­tive Fron­tiers Sum­mit, June 10-12

May 9, 2025, 5:17 AM
59 points

20 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

In­ter­est In Con­flict Is In­stru­men­tally Convergent

ScrewtapeMay 9, 2025, 2:16 AM
66 points

25 votes

Overall karma indicates overall quality.

58 comments10 min readLW link

Is ChatGPT ac­tu­ally fixed now?

sjadlerMay 8, 2025, 11:34 PM
17 points

7 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(stevenadler.substack.com)

Post EAG Lon­don AI x-Safety Co-work­ing Retreat

plexMay 8, 2025, 11:00 PM
10 points

2 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

a brief cri­tique of reduction

Vadim GolubMay 8, 2025, 10:43 PM
−17 points

7 votes

Overall karma indicates overall quality.

4 comments2 min readLW link

Video & tran­script: Challenges for Safe & Benefi­cial Brain-Like AGI

Steven ByrnesMay 8, 2025, 9:11 PM
26 points

10 votes

Overall karma indicates overall quality.

0 comments18 min readLW link

Ap­pendix: In­ter­pretable by De­sign—Con­straint Sets with Disjoint Limit Points

Ronak_MehtaMay 8, 2025, 9:09 PM
2 points

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

In­ter­pretable by De­sign—Con­straint Sets with Disjoint Limit Points

Ronak_MehtaMay 8, 2025, 9:08 PM
24 points

8 votes

Overall karma indicates overall quality.

2 comments9 min readLW link
(ronakrm.github.io)

Is there a Half-Life for the Suc­cess Rates of AI Agents?

Matrice JacobineMay 8, 2025, 8:10 PM
8 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(www.tobyord.com)

Misal­ign­ment and Strate­gic Un­der­perfor­mance: An Anal­y­sis of Sand­bag­ging and Ex­plo­ra­tion Hacking

May 8, 2025, 7:06 PM
77 points

17 votes

Overall karma indicates overall quality.

3 comments15 min readLW link

Be­hold the Pale Child (es­cap­ing Moloch’s Mad Maze)

rogersbaconMay 8, 2025, 4:36 PM
8 points

15 votes

Overall karma indicates overall quality.

16 comments11 min readLW link
(www.secretorum.life)

An al­ign­ment safety case sketch based on debate

May 8, 2025, 3:02 PM
57 points

16 votes

Overall karma indicates overall quality.

21 comments25 min readLW link
(arxiv.org)

Mechanis­tic In­ter­pretabil­ity Via Learn­ing Differ­en­tial Equa­tions: AI Safety Camp Pro­ject In­ter­me­di­ate Re­port.

May 8, 2025, 2:45 PM
8 points

6 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

AI #115: The Evil Ap­pli­ca­tions Division

ZviMay 8, 2025, 1:40 PM
32 points

15 votes

Overall karma indicates overall quality.

3 comments62 min readLW link
(thezvi.wordpress.com)

The Stegano­graphic Po­ten­tials of Lan­guage Models

May 8, 2025, 11:23 AM
9 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Our bet on whether the AI mar­ket will crash

May 8, 2025, 9:56 AM
23 points

13 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

Con­cept-an­chored rep­re­sen­ta­tion en­g­ineer­ing for alignment

Sandy FraserMay 8, 2025, 8:59 AM
5 points

3 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Orthog­o­nal­ity Th­e­sis in lay­man’s terms.

Michael (@lethal_ai)May 8, 2025, 8:31 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

Arkose may be clos­ing, but you can help

Victoria BrookMay 8, 2025, 7:28 AM
8 points

4 votes

Overall karma indicates overall quality.

0 comments2 min readLW link

Heal­ing pow­ers of med­i­ta­tion or the role of at­ten­tion in hu­moral reg­u­la­tion.

Yaroslav GranowskiMay 8, 2025, 6:48 AM
7 points

4 votes

Overall karma indicates overall quality.

0 comments1 min readLW link

Ori­ent­ing Toward Wizard Power

johnswentworthMay 8, 2025, 5:23 AM
566 points

344 votes

Overall karma indicates overall quality.

147 comments5 min readLW link

Re­la­tional Align­ment: Trust, Re­pair, and the Emo­tional Work of AI

Priyanka BharadwajMay 8, 2025, 2:44 AM
3 points

2 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

There’s more low-hang­ing fruit in in­ter­dis­ci­plinary work thanks to LLMs

ChristianKlMay 7, 2025, 7:48 PM
26 points

10 votes

Overall karma indicates overall quality.

2 comments1 min readLW link

OpenAI Claims Non­profit Will Re­tain Nom­i­nal Control

ZviMay 7, 2025, 7:40 PM
65 points

19 votes

Overall karma indicates overall quality.

4 comments11 min readLW link
(thezvi.wordpress.com)

So­cial sta­tus games might have “com­pute weight class” in the future

RaemonMay 7, 2025, 6:56 PM
34 points

14 votes

Overall karma indicates overall quality.

7 comments2 min readLW link

Events of Low Prob­a­bil­ity: Buri­dan’s Principle

Nikita GladkovMay 7, 2025, 6:46 PM
12 points

8 votes

Overall karma indicates overall quality.

0 comments10 min readLW link

[Question] Which jour­nal­ists would you give quotes to? [one jour­nal­ist per com­ment, agree vote for trust­wor­thy]

Nathan YoungMay 7, 2025, 6:39 PM
12 points

12 votes

Overall karma indicates overall quality.

26 comments1 min readLW link

Please Donate to CAIP (Post 1 of 7 on AI Gover­nance)

Mass_DriverMay 7, 2025, 5:13 PM
119 points

42 votes

Overall karma indicates overall quality.

20 comments33 min readLW link

UK AISI’s Align­ment Team: Re­search Agenda

May 7, 2025, 4:33 PM
113 points

55 votes

Overall karma indicates overall quality.

2 comments11 min readLW link

Four Pre­dic­tions About OpenAI’s Plans To Re­tain Non­profit Control

garrisonMay 7, 2025, 3:48 PM
12 points

4 votes

Overall karma indicates overall quality.

0 comments5 min readLW link
(www.obsolete.pub)

A Dis­ci­plined Way to Avoid Wireheading

amitlevy49May 7, 2025, 3:20 PM
18 points

10 votes

Overall karma indicates overall quality.

6 comments5 min readLW link
(ivy0.substack.com)

Reflec­tions on Com­pat­i­bil­ism, On­tolog­i­cal Trans­la­tions, and the Ar­tifi­cial Divine

Mahdi ComplexMay 7, 2025, 12:16 PM
2 points

7 votes

Overall karma indicates overall quality.

1 comment22 min readLW link

The His­tor­i­cal Par­allels: Pre­limi­nary Reflection

EQMay 7, 2025, 8:06 AM
3 points

2 votes

Overall karma indicates overall quality.

0 comments9 min readLW link
(eqmind.substack.com)

Euro­pean Links (07.05.25)

Martin SustrikMay 7, 2025, 4:20 AM
10 points

3 votes

Overall karma indicates overall quality.

0 comments2 min readLW link
(250bpm.substack.com)

[Question] Chess—“Elo” of ran­dom play?

Shankar SivarajanMay 7, 2025, 2:18 AM
10 points

4 votes

Overall karma indicates overall quality.

16 comments1 min readLW link

$500 + $500 Bounty Prob­lem: Does An (Ap­prox­i­mately) Deter­minis­tic Max­i­mal Re­dund Always Ex­ist?

May 6, 2025, 11:05 PM
73 points

19 votes

Overall karma indicates overall quality.

16 comments3 min readLW link

Loss Curves

James CamachoMay 6, 2025, 10:22 PM
16 points

5 votes

Overall karma indicates overall quality.

3 comments4 min readLW link
(github.com)

Nega­tive Re­sults on Group SAEs

Josh EngelsMay 6, 2025, 9:49 PM
70 points

22 votes

Overall karma indicates overall quality.

3 comments8 min readLW link

ACX At­lanta May 2025 Meetup

Steve FrenchMay 6, 2025, 9:00 PM
2 points

1 vote

Overall karma indicates overall quality.

0 comments1 min readLW link