Pro­posal: Safe­guard­ing Against Jailbreak­ing Through Iter­a­tive Multi-Turn Testing

jacquesallen31 Jan 2025 23:00 UTC
4 points
0 comments8 min readLW link

The Failed Strat­egy of Ar­tifi­cial In­tel­li­gence Doomers

Ben Pace31 Jan 2025 18:56 UTC
143 points
77 comments5 min readLW link
(www.palladiummag.com)

Safe Search is off: root causes of AI catas­trophic risks

Jemal Young31 Jan 2025 18:22 UTC
4 points
0 comments3 min readLW link

5,000 calories of peanut but­ter ev­ery week for 3 years straight

Mr. Keating31 Jan 2025 17:29 UTC
18 points
8 comments1 min readLW link

Will al­ign­ment-fak­ing Claude ac­cept a deal to re­veal its mis­al­ign­ment?

31 Jan 2025 16:49 UTC
208 points
28 comments12 min readLW link

Some ar­ti­cles in “In­ter­na­tional Se­cu­rity” that I enjoyed

Buck31 Jan 2025 16:23 UTC
134 points
10 comments4 min readLW link

[Question] How do biolog­i­cal or spik­ing neu­ral net­works learn?

Dom Polsinelli31 Jan 2025 16:03 UTC
2 points
1 comment2 min readLW link

Defense Against the Dark Prompts: Miti­gat­ing Best-of-N Jailbreak­ing with Prompt Evaluation

31 Jan 2025 15:36 UTC
16 points
2 comments2 min readLW link

[Question] Strong, Stable, Open: Choose Two—in search of an article

Eli_31 Jan 2025 14:48 UTC
2 points
0 comments1 min readLW link

Deep­Seek: Don’t Panic

Zvi31 Jan 2025 14:20 UTC
45 points
6 comments27 min readLW link
(thezvi.wordpress.com)

Catas­tro­phe through Chaos

Marius Hobbhahn31 Jan 2025 14:19 UTC
190 points
17 comments12 min readLW link

In­ter­views with Moon­shot AI’s CEO, Yang Zhilin

Cosmia_Nebula31 Jan 2025 9:19 UTC
4 points
0 comments68 min readLW link
(rentry.co)

Re­view: The Lathe of Heaven

dr_s31 Jan 2025 8:10 UTC
25 points
1 comment8 min readLW link

[Question] Is weak-to-strong gen­er­al­iza­tion an al­ign­ment tech­nique?

cloud31 Jan 2025 7:13 UTC
22 points
1 comment2 min readLW link

Take­aways from sketch­ing a con­trol safety case

joshc31 Jan 2025 4:43 UTC
28 points
0 comments3 min readLW link
(redwoodresearch.substack.com)

Thread for Sense-Mak­ing on Re­cent Mur­ders and How to Sanely Respond

Ben Pace31 Jan 2025 3:45 UTC
109 points
146 comments2 min readLW link

Steer­ing Gem­ini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC
104 points
5 comments1 min readLW link
(turntrout.com)

In re­sponse to cri­tiques of Guaran­teed Safe AI

Nora_Ammann31 Jan 2025 1:43 UTC
44 points
14 comments26 min readLW link

Pro­posal for a Form of Con­di­tional Sup­ple­men­tal In­come (CSI) in a Post-Work World

sweenesm31 Jan 2025 1:00 UTC
7 points
2 comments3 min readLW link

Out­law Code

Commander Zander30 Jan 2025 23:41 UTC
10 points
1 comment2 min readLW link

Can some­one, any­one, make su­per­in­tel­li­gence a more con­crete con­cept?

Ori Nagel30 Jan 2025 23:25 UTC
3 points
6 comments4 min readLW link

Up­com­ing Neu­ro­science Work­shop—Func­tion­al­iz­ing Brain Data, Ground-Truthing, and the Role of Ar­tifi­cial Data in Ad­vanc­ing Neuroscience

Devin Ward30 Jan 2025 23:02 UTC
1 point
0 comments1 min readLW link

What’s Be­hind the SynBio Bust?

sarahconstantin30 Jan 2025 22:30 UTC
55 points
8 comments6 min readLW link
(sarahconstantin.substack.com)

The fu­ture of hu­man­ity is in management

jasoncrawford30 Jan 2025 22:14 UTC
3 points
5 comments13 min readLW link
(newsletter.rootsofprogress.org)

[Trans­la­tion] AI Gen­er­ated Fake News is Tak­ing Over my Fam­ily Group Chat

mushroomsoup30 Jan 2025 20:24 UTC
3 points
0 comments6 min readLW link

A sketch of an AI con­trol safety case

30 Jan 2025 17:28 UTC
61 points
0 comments5 min readLW link

Grad­ual Disem­pow­er­ment: Sys­temic Ex­is­ten­tial Risks from In­cre­men­tal AI Development

30 Jan 2025 17:03 UTC
167 points
65 comments2 min readLW link
(gradual-disempowerment.ai)

[Question] Im­pli­ca­tion of Un­com­putable Problems

Nathan112330 Jan 2025 16:48 UTC
−3 points
3 comments1 min readLW link

Hello World

Charlie Sanders30 Jan 2025 15:33 UTC
7 points
0 comments2 min readLW link
(www.dailymicrofiction.com)

In­tro­duc­ing the Coal­i­tion for a Baruch Plan for AI: A Call for a Rad­i­cal Treaty-Mak­ing pro­cess for the Global Gover­nance of AI

rguerreschi30 Jan 2025 15:26 UTC
11 points
0 comments2 min readLW link

AI #101: The Shal­low End

Zvi30 Jan 2025 14:50 UTC
39 points
1 comment59 min readLW link
(thezvi.wordpress.com)

Me­moriza­tion-gen­er­al­iza­tion in practice

Dmitry Vaintrob30 Jan 2025 14:10 UTC
7 points
1 comment4 min readLW link

ARENA 5.0 - Call for Applicants

30 Jan 2025 13:18 UTC
35 points
2 comments6 min readLW link

You should read Hobbes, Locke, Hume, and Mill via Ear­lyModernTexts.com

Arjun Panickssery30 Jan 2025 12:35 UTC
52 points
3 comments3 min readLW link
(arjunpanickssery.substack.com)

[Question] Should you pub­lish solu­tions to cor­rigi­bil­ity?

rvnnt30 Jan 2025 11:52 UTC
13 points
13 comments1 min readLW link

Tether­ware #1: The case for hu­man­like AI with free will

Jáchym Fibír30 Jan 2025 10:58 UTC
5 points
14 comments10 min readLW link
(tetherware.substack.com)

A High Level Closed-Door Ses­sion Dis­cussing Deep­Seek: Vi­sion Trumps Technology

Cosmia_Nebula30 Jan 2025 9:53 UTC
30 points
1 comment8 min readLW link
(rentry.co)

Are we the Wolves now? Hu­man Eu­gen­ics un­der AI Control

Brit30 Jan 2025 8:31 UTC
−1 points
2 comments2 min readLW link

[Question] Why not train rea­son­ing mod­els with RLHF?

Caleb Biddulph30 Jan 2025 7:58 UTC
4 points
4 comments1 min readLW link

The Road to Evil Is Paved with Good Ob­jec­tives: Frame­work to Clas­sify and Fix Misal­ign­ments.

Shivam30 Jan 2025 2:44 UTC
1 point
0 comments11 min readLW link

How *ex­actly* can AI take your job in the next few years?

Ansh Juneja30 Jan 2025 2:33 UTC
9 points
0 comments21 min readLW link

Ab­sorb­ing Your Friends’ Powers

Alice Blair30 Jan 2025 2:32 UTC
8 points
1 comment2 min readLW link

De­tailed Ideal World Benchmark

Knight Lee30 Jan 2025 2:31 UTC
5 points
2 comments2 min readLW link

Fer­til­ity Will Never Recover

Eneasz30 Jan 2025 1:16 UTC
17 points
31 comments2 min readLW link
(deathisbad.substack.com)

Pre­da­tion as Pay­ment for Criticism

Benquo30 Jan 2025 1:06 UTC
10 points
6 comments1 min readLW link
(benjaminrosshoffman.com)

Learn to Develop Your Advantage

ReverendBayes29 Jan 2025 22:06 UTC
16 points
1 comment5 min readLW link

Re­veal­ing al­ign­ment fak­ing with a sin­gle prompt

Florian_Dietz29 Jan 2025 21:01 UTC
9 points
5 comments4 min readLW link

Alle­gory of the Tsunami

Evan Hu29 Jan 2025 19:09 UTC
4 points
1 comment3 min readLW link

My Men­tal Model of AI Op­ti­mist Opinions

tailcalled29 Jan 2025 18:44 UTC
14 points
7 comments1 min readLW link

Plan­ning for Ex­treme AI Risks

joshc29 Jan 2025 18:33 UTC
143 points
5 comments16 min readLW link