Best-of-N Jailbreaking

Dec 14, 2024, 4:58 AM
78 points
5 comments2 min readLW link
(arxiv.org)

The 2023 LessWrong Re­view: The Ba­sic Ask

RaemonDec 4, 2024, 7:52 PM
77 points
25 comments9 min readLW link

2025 Pre­dic­tion Thread

habrykaDec 30, 2024, 1:50 AM
77 points
21 comments1 min readLW link

When AI 10x’s AI R&D, What Do We Do?

Logan RiggsDec 21, 2024, 11:56 PM
72 points
16 comments4 min readLW link

In­tri­ca­cies of Fea­ture Geom­e­try in Large Lan­guage Models

Dec 7, 2024, 6:10 PM
71 points
0 comments12 min readLW link

An Illus­trated Sum­mary of “Ro­bust Agents Learn Causal World Model”

DalcyDec 14, 2024, 3:02 PM
67 points
2 comments10 min readLW link

Learn to write well BEFORE you have some­thing worth saying

eukaryoteDec 29, 2024, 11:42 PM
67 points
18 comments3 min readLW link
(eukaryotewritesblog.com)

Drexler’s Nan­otech Software

PeterMcCluskeyDec 2, 2024, 4:55 AM
67 points
9 comments4 min readLW link
(bayesianinvestor.com)

An­thropic lead­er­ship conversation

Zach Stein-PerlmanDec 20, 2024, 10:00 PM
67 points
17 comments6 min readLW link
(www.youtube.com)

Check­ing in on Scott’s com­po­si­tion image bet with ima­gen 3

Dave OrrDec 22, 2024, 7:04 PM
65 points
0 comments1 min readLW link

Ret­ro­spec­tive: PIBBSS Fel­low­ship 2024

Dec 20, 2024, 3:55 PM
64 points
1 comment4 min readLW link

A Qual­i­ta­tive Case for LTFF: Filling Crit­i­cal Ecosys­tem Gaps

LinchDec 3, 2024, 9:57 PM
64 points
2 commentsLW link

Zen and The Art of Semi­con­duc­tor Man­u­fac­tur­ing

RecurrentedDec 9, 2024, 5:19 PM
64 points
2 comments9 min readLW link
(futuring.substack.com)

RL, but don’t do any­thing I wouldn’t do

Gunnar_ZarnckeDec 7, 2024, 10:54 PM
63 points
5 comments1 min readLW link
(arxiv.org)

o3, Oh My

ZviDec 30, 2024, 2:10 PM
63 points
17 comments36 min readLW link
(thezvi.wordpress.com)

Mea­sur­ing whether AIs can state­lessly strate­gize to sub­vert se­cu­rity measures

Dec 19, 2024, 9:25 PM
62 points
0 comments11 min readLW link

ReSols­ticed vol I: “We’re Not Go­ing Quietly”

RaemonDec 26, 2024, 5:52 PM
61 points
4 comments19 min readLW link

Cog­ni­tive Work and AI Safety: A Ther­mo­dy­namic Perspective

Daniel MurfetDec 8, 2024, 9:42 PM
61 points
9 comments4 min readLW link

A case for donat­ing to AI risk re­duc­tion (in­clud­ing if you work in AI)

tlevinDec 2, 2024, 7:05 PM
61 points
2 commentsLW link

Ideas for bench­mark­ing LLM creativity

gwernDec 16, 2024, 5:18 AM
60 points
11 comments1 min readLW link
(gwern.net)

Fund­ing Case: AI Safety Camp 11

Dec 23, 2024, 8:51 AM
60 points
4 comments6 min readLW link
(manifund.org)

o1 Turns Pro

ZviDec 10, 2024, 5:00 PM
59 points
3 comments14 min readLW link
(thezvi.wordpress.com)

AI #95: o1 Joins the API

ZviDec 19, 2024, 3:10 PM
58 points
1 comment41 min readLW link
(thezvi.wordpress.com)

AI #96: o3 But Not Yet For Thee

ZviDec 26, 2024, 8:30 PM
58 points
8 comments36 min readLW link
(thezvi.wordpress.com)

AI As­sis­tants Should Have a Direct Line to Their Developers

Jan_KulveitDec 28, 2024, 5:01 PM
57 points
6 comments2 min readLW link

Luck Based Medicine: No Good Very Bad Win­ter Cured My Hypothyroidism

ElizabethDec 8, 2024, 8:10 PM
55 points
3 comments2 min readLW link
(acesounderglass.com)

Ve­gans need to eat just enough Meat—em­per­i­cally eval­u­ate the min­i­mum am­mount of meat that max­i­mizes utility

Johannes C. MayerDec 22, 2024, 10:08 PM
55 points
35 comments3 min readLW link

[Question] What Have Been Your Most Valuable Ca­sual Con­ver­sa­tions At Con­fer­ences?

johnswentworthDec 25, 2024, 5:49 AM
54 points
21 comments1 min readLW link

I Fi­nally Worked Through Bayes’ The­o­rem (Per­sonal Achieve­ment)

keltanDec 5, 2024, 2:04 AM
53 points
7 comments9 min readLW link

A toy eval­u­a­tion of in­fer­ence code tampering

Fabien RogerDec 9, 2024, 5:43 PM
52 points
0 comments9 min readLW link
(alignment.anthropic.com)

Just one more ex­po­sure bro

ChipmonkDec 12, 2024, 9:37 PM
52 points
6 comments2 min readLW link
(chrislakin.blog)

Cor­rect my H5N1 re­search

ElizabethDec 9, 2024, 7:07 PM
52 points
24 comments2 min readLW link

Con­sid­er­a­tions on orca intelligence

Towards_KeeperhoodDec 29, 2024, 2:35 PM
51 points
14 comments9 min readLW link

A Solu­tion for AGI/​ASI Safety

Weibing WangDec 18, 2024, 7:44 PM
50 points
29 comments1 min readLW link

D&D.Sci Dun­geon­build­ing: the Dun­geon Tournament

aphyerDec 14, 2024, 4:30 AM
49 points
16 comments3 min readLW link

AI #94: Not Now, Google

ZviDec 12, 2024, 3:40 PM
49 points
3 comments64 min readLW link
(thezvi.wordpress.com)

A dataset of ques­tions on de­ci­sion-the­o­retic rea­son­ing in New­comb-like problems

Dec 16, 2024, 10:42 PM
49 points
1 comment2 min readLW link
(arxiv.org)

Care­less think­ing: A the­ory of bad thinking

Nathan YoungDec 17, 2024, 6:23 PM
49 points
17 comments9 min readLW link
(nathanpmyoung.substack.com)

Anal­y­sis of Global AI Gover­nance Strategies

Dec 4, 2024, 10:45 AM
49 points
10 comments36 min readLW link

Greedy-Ad­van­tage-Aware RLHF

sej2020Dec 27, 2024, 7:47 PM
48 points
15 comments13 min readLW link

Cog­ni­tive Bi­ases Con­tribut­ing to AI X-risk — a deleted ex­cerpt from my 2018 ARCHES draft

Andrew_CritchDec 3, 2024, 9:29 AM
48 points
2 comments5 min readLW link

Book a Time to Chat about In­terp Research

Logan RiggsDec 3, 2024, 5:27 PM
47 points
3 comments1 min readLW link

Re­view: Break­ing Free with Dr. Stone

TurnTroutDec 18, 2024, 1:26 AM
47 points
5 comments1 min readLW link
(turntrout.com)

Deep Learn­ing is cheap Solomonoff in­duc­tion?

Dec 7, 2024, 11:00 AM
45 points
1 comment17 min readLW link

De­tec­tion of Asymp­tomat­i­cally Spread­ing Pathogens

jefftkDec 5, 2024, 6:20 PM
45 points
8 comments7 min readLW link
(www.jefftk.com)

The Deep Lore of LightHaven, with Oliver Habryka (TBC epi­sode 228)

Dec 24, 2024, 10:45 PM
45 points
4 comments91 min readLW link
(thebayesianconspiracy.substack.com)

Con­jec­ture: A Roadmap for Cog­ni­tive Soft­ware and A Hu­man­ist Fu­ture of AI

Dec 2, 2024, 1:28 PM
44 points
10 comments29 min readLW link
(www.conjecture.dev)

Prep­pers Are Too Nega­tive on Objects

jefftkDec 18, 2024, 2:30 AM
44 points
2 comments1 min readLW link
(www.jefftk.com)

Be­gan a pay-on-re­sults coach­ing ex­per­i­ment, made $40,300 since July

ChipmonkDec 29, 2024, 9:12 PM
43 points
15 comments1 min readLW link
(chrislakin.blog)

Claude’s Con­sti­tu­tional Con­se­quen­tial­ism?

1a3ornDec 19, 2024, 7:53 PM
43 points
6 comments6 min readLW link