Su­per­in­tel­li­gence’s goals are likely to be random

Mikhail Samin13 Mar 2025 22:41 UTC
6 points
6 comments5 min readLW link

Should AI safety be a mass move­ment?

MattAlexander13 Mar 2025 20:36 UTC
5 points
1 comment4 min readLW link

Au­dit­ing lan­guage mod­els for hid­den objectives

13 Mar 2025 19:18 UTC
145 points
15 comments13 min readLW link

Re­duc­ing LLM de­cep­tion at scale with self-other over­lap fine-tuning

13 Mar 2025 19:09 UTC
162 points
46 comments6 min readLW link

Vacuum De­cay: Ex­pert Sur­vey Results

JessRiedel13 Mar 2025 18:31 UTC
96 points
26 comments13 min readLW link

A Fron­tier AI Risk Man­age­ment Frame­work: Bridg­ing the Gap Between Cur­rent AI Prac­tices and Estab­lished Risk Management

13 Mar 2025 18:29 UTC
10 points
0 comments1 min readLW link
(arxiv.org)

Creat­ing Com­plex Goals: A Model to Create Au­tonomous Agents

theraven13 Mar 2025 18:17 UTC
6 points
1 comment6 min readLW link

Haber­mas Machine

NicholasKees13 Mar 2025 18:16 UTC
54 points
7 comments6 min readLW link
(mosaic-labs.org)

The Other Align­ment Prob­lem: Maybe AI Needs Pro­tec­tion From Us

Peterpiper13 Mar 2025 18:03 UTC
−2 points
0 comments3 min readLW link

AI #107: The Mis­placed Hype Machine

Zvi13 Mar 2025 14:40 UTC
47 points
12 comments40 min readLW link
(thezvi.wordpress.com)

In­tel­sat as a Model for In­ter­na­tional AGI Governance

13 Mar 2025 12:58 UTC
45 points
0 comments1 min readLW link
(www.forethought.org)

Stac­ity: a Lock-In Risk Bench­mark for Large Lan­guage Models

alamerton13 Mar 2025 12:08 UTC
4 points
0 comments1 min readLW link
(huggingface.co)

The prospect of ac­cel­er­ated AI safety progress, in­clud­ing philo­soph­i­cal progress

Mitchell_Porter13 Mar 2025 10:52 UTC
12 points
0 comments4 min readLW link

The “Rev­er­sal Curse”: you still aren’t antropo­mor­phis­ing enough.

lumpenspace13 Mar 2025 10:24 UTC
3 points
0 comments1 min readLW link
(lumpenspace.substack.com)

For­mal­iz­ing Space-Far­ing Civ­i­liza­tions Sat­u­ra­tion con­cepts and metrics

Maxime Riché13 Mar 2025 9:40 UTC
4 points
0 comments8 min readLW link

The Eco­nomics of p(doom)

Jakub Growiec13 Mar 2025 7:33 UTC
2 points
0 comments1 min readLW link

So­cial Me­dia: How to fix them be­fore they be­come the biggest news platform

Sam G13 Mar 2025 7:28 UTC
5 points
2 comments3 min readLW link

Penny Whis­tle in E?

jefftk13 Mar 2025 2:40 UTC
9 points
1 comment1 min readLW link
(www.jefftk.com)

An­thropic, and tak­ing “tech­ni­cal philos­o­phy” more seriously

Raemon13 Mar 2025 1:48 UTC
139 points
29 comments11 min readLW link

LW/​ACX So­cial Meetup

Stefan12 Mar 2025 23:13 UTC
2 points
0 comments1 min readLW link

I grade ev­ery NBA bas­ket­ball game I watch based on enjoyability

proshowersinger12 Mar 2025 21:46 UTC
24 points
2 comments4 min readLW link

Kairos is hiring a Head of Oper­a­tions/​Found­ing Generalist

agucova12 Mar 2025 20:58 UTC
6 points
0 comments5 min readLW link

USAID Out­look: A Me­tac­u­lus Fore­cast­ing Series

ChristianWilliams12 Mar 2025 20:34 UTC
9 points
0 comments1 min readLW link
(www.metaculus.com)

What is in­stru­men­tal con­ver­gence?

12 Mar 2025 20:28 UTC
2 points
0 comments2 min readLW link
(aisafety.info)

Re­vis­ing Stages-Over­sight Re­veals Greater Si­tu­a­tional Aware­ness in LLMs

Sanyu Rajakumar12 Mar 2025 17:56 UTC
16 points
0 comments13 min readLW link

Why Obe­di­ent AI May Be the Real Catastrophe

G~12 Mar 2025 17:50 UTC
5 points
2 comments3 min readLW link

Your Com­mu­ni­ca­tion Prefer­ences Aren’t Law

Jonathan Moregård12 Mar 2025 17:20 UTC
25 points
4 comments1 min readLW link
(honestliving.substack.com)

Reflec­tions on Neuralese

Alice Blair12 Mar 2025 16:29 UTC
42 points
3 comments5 min readLW link

Field tests of semi-ra­tio­nal­ity in Brazilian mil­i­tary training

P. João12 Mar 2025 16:14 UTC
31 points
0 comments2 min readLW link

Many life-sav­ing drugs fail for lack of fund­ing. But there’s a solu­tion: des­per­ate rich people

Mvolz12 Mar 2025 15:24 UTC
17 points
0 comments1 min readLW link
(www.theguardian.com)

The Most For­bid­den Technique

Zvi12 Mar 2025 13:20 UTC
165 points
9 comments17 min readLW link
(thezvi.wordpress.com)

You don’t ac­tu­ally need a phys­i­cal mul­ti­verse to ex­plain an­thropic fine-tun­ing.

Fraser12 Mar 2025 7:33 UTC
7 points
8 comments3 min readLW link
(frvser.com)

AI Can’t Write Good Fiction

JustisMills12 Mar 2025 6:11 UTC
38 points
24 comments7 min readLW link
(justismills.substack.com)

Ex­ist­ing UDTs test the limits of Bayesi­anism (and con­sis­tency)

Cole Wyeth12 Mar 2025 4:09 UTC
28 points
22 comments7 min readLW link

(Anti)Aging 101

George3d612 Mar 2025 3:59 UTC
5 points
2 comments3 min readLW link
(cerebralab.com)

The Grapes of Hardness

adamShimi11 Mar 2025 21:01 UTC
8 points
0 comments5 min readLW link
(formethods.substack.com)

Don’t over-up­date on Fron­tierMath results

David Matolcsi11 Mar 2025 20:44 UTC
47 points
7 comments9 min readLW link

Re­sponse to Scott Alexan­der on Imprisonment

Zvi11 Mar 2025 20:40 UTC
40 points
4 comments9 min readLW link
(thezvi.wordpress.com)

Paths and waysta­tions in AI safety

Joe Carlsmith11 Mar 2025 18:52 UTC
42 points
1 comment11 min readLW link
(joecarlsmith.substack.com)

Meri­dian Cam­bridge Visit­ing Re­searcher Pro­gramme: Turn AI safety ideas into funded pro­jects in one week!

Meridian Cambridge11 Mar 2025 17:46 UTC
13 points
0 comments2 min readLW link

Elon Musk May Be Tran­si­tion­ing to Bipo­lar Type I

Cyborg2511 Mar 2025 17:45 UTC
87 points
22 comments4 min readLW link

Scal­ing AI Reg­u­la­tion: Real­is­ti­cally, what Can (and Can’t) Be Reg­u­lated?

Katalina Hernandez11 Mar 2025 16:51 UTC
3 points
1 comment3 min readLW link

How Lan­guage Models Un­der­stand Nullability

11 Mar 2025 15:57 UTC
5 points
0 comments2 min readLW link
(dmodel.ai)

Forethought: a new AI macros­trat­egy group

11 Mar 2025 15:39 UTC
20 points
0 comments3 min readLW link

Prepar­ing for the In­tel­li­gence Explosion

11 Mar 2025 15:38 UTC
79 points
17 comments1 min readLW link
(www.forethought.org)

stop solv­ing prob­lems that have already been solved

dhruvmethi11 Mar 2025 15:30 UTC
10 points
3 comments8 min readLW link

AI Con­trol May In­crease Ex­is­ten­tial Risk

Jan_Kulveit11 Mar 2025 14:30 UTC
101 points
13 comments1 min readLW link

When is it Bet­ter to Train on the Align­ment Proxy?

dil-leik-og11 Mar 2025 13:35 UTC
14 points
0 comments9 min readLW link

A differ­ent take on the Musk v OpenAI pre­limi­nary in­junc­tion order

TFD11 Mar 2025 12:46 UTC
8 points
0 comments20 min readLW link
(www.thefloatingdroid.com)

Do rea­son­ing mod­els use their scratch­pad like we do? Ev­i­dence from dis­till­ing paraphrases

Fabien Roger11 Mar 2025 11:52 UTC
127 points
23 comments11 min readLW link
(alignment.anthropic.com)