Mid-Gen­er­a­tion Self-Cor­rec­tion: A Sim­ple Tool for Safer AI

MrThinkDec 19, 2024, 11:41 PM
13 points
0 comments1 min readLW link

Ap­ply now to SPAR!

agucovaDec 19, 2024, 10:29 PM
11 points
0 commentsLW link

How to repli­cate and ex­tend our al­ign­ment fak­ing demo

Fabien RogerDec 19, 2024, 9:44 PM
114 points
5 comments2 min readLW link
(alignment.anthropic.com)

The Ge­n­e­sis Project

aproteinengineDec 19, 2024, 9:26 PM
15 points
0 comments1 min readLW link
(genesis-embodied-ai.github.io)

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
62 points
0 comments11 min readLW link

Claude’s Con­sti­tu­tional Con­se­quen­tial­ism?

1a3ornDec 19, 2024, 7:53 PM
43 points
6 comments6 min readLW link

A short cri­tique of Omo­hun­dro’s “Ba­sic AI Drives”

Soumyadeep BoseDec 19, 2024, 7:19 PM
6 points
0 comments4 min readLW link

When Is In­surance Worth It?

kqrDec 19, 2024, 7:07 PM
175 points
71 comments4 min readLW link
(entropicthoughts.com)

Launch­ing Third Opinion: Anony­mous Ex­pert Con­sul­ta­tion for AI Professionals

karlDec 19, 2024, 7:06 PM
3 points
0 comments5 min readLW link

Us­ing LLM Search to Aug­ment (Math­e­mat­ics) Research

kalebDec 19, 2024, 6:59 PM
5 points
0 comments6 min readLW link

A progress policy agenda

jasoncrawfordDec 19, 2024, 6:42 PM
31 points
1 comment5 min readLW link
(newsletter.rootsofprogress.org)

build­ing char­ac­ter isn’t about willpower or sacrifice

dhruvmethiDec 19, 2024, 6:17 PM
1 point
0 comments4 min readLW link

AISN #45: Cen­ter for AI Safety 2024 Year in Review

Dec 19, 2024, 6:15 PM
13 points
0 comments4 min readLW link
(newsletter.safe.ai)

Learn­ing Multi-Level Fea­tures with Ma­tryoshka SAEs

Dec 19, 2024, 3:59 PM
42 points
6 comments11 min readLW link

Sim­ple Stegano­graphic Com­pu­ta­tion Eval—gpt-4o and gem­ini-exp-1206 can’t solve it yet

Filip SondejDec 19, 2024, 3:47 PM
13 points
2 comments3 min readLW link

AI #95: o1 Joins the API

ZviDec 19, 2024, 3:10 PM
58 points
1 comment41 min readLW link
(thezvi.wordpress.com)

Ex­ec­u­tive Direc­tor for AIS Brus­sels—Ex­pres­sion of interest

Dec 19, 2024, 9:19 AM
1 point
0 comments4 min readLW link

Ex­ec­u­tive Direc­tor for AIS France—Ex­pres­sion of interest

Dec 19, 2024, 8:14 AM
9 points
0 comments3 min readLW link

Inescapably Value-Laden Ex­pe­rience—a Catchy Term I Made Up to Make Mo­ral­ity Rationalisable

James Stephen BrownDec 19, 2024, 4:45 AM
5 points
0 comments2 min readLW link
(nonzerosum.games)

I’m Writ­ing a Book About Liberalism

Yoav RavidDec 19, 2024, 12:13 AM
6 points
6 comments2 min readLW link

A Solu­tion for AGI/​ASI Safety

Weibing WangDec 18, 2024, 7:44 PM
50 points
29 comments1 min readLW link

Takes on “Align­ment Fak­ing in Large Lan­guage Models”

Joe CarlsmithDec 18, 2024, 6:22 PM
105 points
7 comments62 min readLW link

A Mat­ter of Taste

ZviDec 18, 2024, 5:50 PM
36 points
4 comments11 min readLW link
(thezvi.wordpress.com)

Are we a differ­ent per­son each time? A sim­ple ar­gu­ment for the im­per­ma­nence of our identity

l4mpDec 18, 2024, 5:21 PM
−4 points
5 comments1 min readLW link

Align­ment Fak­ing in Large Lan­guage Models

Dec 18, 2024, 5:19 PM
483 points
75 comments10 min readLW link

Can o1-pre­view find ma­jor mis­takes amongst 59 NeurIPS ’24 MLSB pa­pers?

Abhishaike MahajanDec 18, 2024, 2:21 PM
19 points
0 comments6 min readLW link
(www.owlposting.com)

Walk­ing Sue

Matthew McRedmondDec 18, 2024, 1:19 PM
2 points
5 comments8 min readLW link

What con­clu­sions can be drawn from a sin­gle ob­ser­va­tion about wealth in ten­nis?

Trevor CappalloDec 18, 2024, 9:55 AM
8 points
3 comments2 min readLW link

Don’t As­so­ci­ate AI Safety With Activism

EneaszDec 18, 2024, 8:01 AM
17 points
15 comments1 min readLW link
(deathisbad.substack.com)

[Question] How should I op­ti­mize my de­ci­sion mak­ing model for ‘ideas’?

CstineSublimeDec 18, 2024, 4:09 AM
3 points
0 comments4 min readLW link

Prep­pers Are Too Nega­tive on Objects

jefftkDec 18, 2024, 2:30 AM
44 points
2 comments1 min readLW link
(www.jefftk.com)

Re­view: Break­ing Free with Dr. Stone

TurnTroutDec 18, 2024, 1:26 AM
47 points
5 comments1 min readLW link
(turntrout.com)

Be­ing Pre­sent is Not a Skill

ChipmonkDec 18, 2024, 1:11 AM
23 points
8 comments1 min readLW link
(chrislakin.blog)

Abla­tions for “Fron­tier Models are Ca­pable of In-con­text Schem­ing”

Dec 17, 2024, 11:58 PM
115 points
1 comment2 min readLW link

Care­less think­ing: A the­ory of bad thinking

Nathan YoungDec 17, 2024, 6:23 PM
49 points
17 comments9 min readLW link
(nathanpmyoung.substack.com)

The Se­cond Gemini

ZviDec 17, 2024, 3:50 PM
23 points
0 comments11 min readLW link
(thezvi.wordpress.com)

AIS Hun­gary is hiring a part-time Tech­ni­cal Lead! (Dead­line: Dec 31st)

gergogasparDec 17, 2024, 2:12 PM
1 point
0 comments2 min readLW link

Every­thing you care about is in the map

TahpDec 17, 2024, 2:05 PM
17 points
27 comments3 min readLW link

Real­ity is Frac­tal-Shaped

silentbobDec 17, 2024, 1:52 PM
18 points
1 comment8 min readLW link

Try­ing to trans­late when peo­ple talk past each other

Kaj_SotalaDec 17, 2024, 9:40 AM
41 points
12 comments6 min readLW link
(kajsotala.fi)

What is “wire­head­ing”?

Dec 17, 2024, 7:49 AM
10 points
0 comments1 min readLW link
(aisafety.info)

1 What If We Re­build Mo­ti­va­tion with the Fermi ESTIMATion?

P. JoãoDec 17, 2024, 7:46 AM
6 points
0 comments3 min readLW link

Where do you put your ideas?

CstineSublimeDec 17, 2024, 7:26 AM
9 points
20 comments1 min readLW link

Ele­vat­ing Air Purifiers

jefftkDec 17, 2024, 1:40 AM
25 points
0 comments1 min readLW link
(www.jefftk.com)

A dataset of ques­tions on de­ci­sion-the­o­retic rea­son­ing in New­comb-like problems

Dec 16, 2024, 10:42 PM
49 points
1 comment2 min readLW link
(arxiv.org)

A prac­ti­cal guide to tiling the uni­verse with hedonium

Vittu PerkeleDec 16, 2024, 9:25 PM
−8 points
1 comment1 min readLW link
(perkeleperusing.substack.com)

AI Safety Seed Fund­ing Net­work—Join as a Donor or Investor

Alexandra BosDec 16, 2024, 7:30 PM
30 points
0 commentsLW link

Is this a bet­ter way to do match­mak­ing?

ChipmonkDec 16, 2024, 7:06 PM
9 points
4 comments1 min readLW link

I read ev­ery ma­jor AI lab’s safety plan so you don’t have to

sarahhwDec 16, 2024, 6:51 PM
20 points
0 comments12 min readLW link
(longerramblings.substack.com)

Grokking re­vis­ited: re­verse en­g­ineer­ing grokking mod­ulo ad­di­tion in LSTM

Dec 16, 2024, 6:48 PM
4 points
0 comments6 min readLW link