The­ory of Change for AI Safety Camp

Linda LinseforsJan 22, 2025, 10:07 PM
36 points
3 comments7 min readLW link

On Deep­Seek’s r1

ZviJan 22, 2025, 7:50 PM
55 points
2 comments35 min readLW link
(thezvi.wordpress.com)

De­tect Good­hart and shut down

Jeremy GillenJan 22, 2025, 6:45 PM
70 points
21 comments7 min readLW link

Re­cur­sive Self-Model­ing as a Plau­si­ble Mechanism for Real-time In­tro­spec­tion in Cur­rent Lan­guage Models

rifeJan 22, 2025, 6:36 PM
8 points
6 comments2 min readLW link

The Fun­da­men­tal Cir­cu­lar­ity The­o­rem: Why Some Math­e­mat­i­cal Be­havi­ours Are In­her­ently Unprovable

Alister MundayJan 22, 2025, 6:20 PM
−11 points
2 comments4 min readLW link

The Dead Cra­dle The­ory: Why Earth May Not Sur­vive Hu­man­ity’s Ex­pan­sion into Space

Nicholas AndresenJan 22, 2025, 5:43 PM
10 points
0 comments11 min readLW link

The Func­tion­al­ist Case for Ma­chine Con­scious­ness: Ev­i­dence from Large Lan­guage Models

James DiacoumisJan 22, 2025, 5:43 PM
14 points
24 comments9 min readLW link

Mechanisms too sim­ple for hu­mans to design

MalmesburyJan 22, 2025, 4:54 PM
207 points
45 comments15 min readLW link

Train­ing Data At­tri­bu­tion: Ex­am­in­ing Its Adop­tion & Use Cases

Jan 22, 2025, 3:41 PM
10 points
0 comments3 min readLW link
(www.convergenceanalysis.org)

Train­ing Data At­tri­bu­tion (TDA): Ex­am­in­ing Its Adop­tion & Use Cases

Jan 22, 2025, 3:40 PM
16 points
0 comments3 min readLW link
(www.convergenceanalysis.org)

The Quan­tum Mars Tele­porter: An Em­piri­cal Test Of Per­sonal Iden­tity Theories

avturchinJan 22, 2025, 11:48 AM
10 points
18 comments2 min readLW link

Bayesian Rea­son­ing on Maps

SjlverJan 22, 2025, 10:45 AM
4 points
0 comments4 min readLW link
(blog.purpureus.net)

Against blan­ket ar­gu­ments against interpretability

Dmitry VaintrobJan 22, 2025, 9:46 AM
50 points
4 comments7 min readLW link

The real poli­ti­cal spectrum

HznJan 22, 2025, 8:55 AM
−14 points
0 comments1 min readLW link

Evolu­tion and the Low Road to Nash

Jan 22, 2025, 7:06 AM
43 points
2 comments10 min readLW link

The Hu­man Align­ment Prob­lem for AIs

rifeJan 22, 2025, 4:06 AM
10 points
5 comments3 min readLW link

When does ca­pa­bil­ity elic­i­ta­tion bound risk?

joshcJan 22, 2025, 3:42 AM
25 points
0 comments17 min readLW link
(redwoodresearch.substack.com)

[Question] Pop­u­lar ma­te­ri­als about en­vi­ron­men­tal goals/​agent foun­da­tions? Peo­ple want­ing to dis­cuss such top­ics?

Q HomeJan 22, 2025, 3:30 AM
5 points
0 comments1 min readLW link

Kitchen Air Puri­fier Comparison

jefftkJan 22, 2025, 3:20 AM
35 points
2 comments3 min readLW link
(www.jefftk.com)

Novem­ber-De­cem­ber 2024 Progress in Guaran­teed Safe AI

QuinnJan 22, 2025, 1:20 AM
17 points
0 comments4 min readLW link
(gsai.substack.com)

Quotes from the Star­gate press conference

Nikola JurkovicJan 22, 2025, 12:50 AM
149 points
7 comments1 min readLW link
(www.c-span.org)

Tell me about your­self: LLMs are aware of their learned behaviors

Jan 22, 2025, 12:47 AM
130 points
5 comments6 min readLW link

King Lear—A Reinterpretation

Kailuo WangJan 21, 2025, 11:54 PM
2 points
1 comment14 min readLW link

Us­ing the prob­a­bil­is­tic method to bound the perfor­mance of toy transformers

Alex GibsonJan 21, 2025, 11:01 PM
1 point
0 comments3 min readLW link

Train­ing on Doc­u­ments About Re­ward Hack­ing In­duces Re­ward Hacking

Jan 21, 2025, 9:32 PM
131 points
15 comments2 min readLW link
(alignment.anthropic.com)

Veo-2 Can Pro­duce Real­is­tic Ads

Logan RiggsJan 21, 2025, 7:13 PM
14 points
0 comments1 min readLW link

Com­pu­ta­tional Limits on Efficiency

vibhumehJan 21, 2025, 6:29 PM
8 points
1 comment5 min readLW link

De­moc­ra­tiz­ing AI Gover­nance: Balanc­ing Ex­per­tise and Public Participation

Lucile Ter-MinassianJan 21, 2025, 6:29 PM
1 point
0 comments15 min readLW link

Hitler was not a monster

halgirJan 21, 2025, 6:21 PM
−11 points
5 comments1 min readLW link

Nat­u­ral In­tel­li­gence is Overhyped

CollisteruJan 21, 2025, 6:09 PM
15 points
0 comments7 min readLW link

14+ AI Safety Ad­vi­sors You Can Speak to – New AISafety.com Resource

Jan 21, 2025, 5:34 PM
24 points
0 comments1 min readLW link

[Linkpost] Why AI Safety Camp strug­gles with fundrais­ing (FBB #2)

gergogasparJan 21, 2025, 5:27 PM
3 points
0 comments1 min readLW link

The Man­hat­tan Trap: Why a Race to Ar­tifi­cial Su­per­in­tel­li­gence is Self-Defeating

Jan 21, 2025, 4:57 PM
87 points
11 commentsLW link
(www.convergenceanalysis.org)

Links and short notes, 2025-01-20

jasoncrawfordJan 21, 2025, 4:10 PM
8 points
0 comments1 min readLW link
(newsletter.rootsofprogress.org)

The Case Against AI Con­trol Research

johnswentworthJan 21, 2025, 4:03 PM
353 points
81 comments6 min readLW link

Will AI Re­silience pro­tect Devel­op­ing Na­tions?

ejk64Jan 21, 2025, 3:31 PM
4 points
0 comments8 min readLW link

Sleep, Diet, Ex­er­cise and GLP-1 Drugs

ZviJan 21, 2025, 12:20 PM
41 points
5 comments18 min readLW link
(thezvi.wordpress.com)

We don’t want to post again “This might be the last AI Safety Camp”

Jan 21, 2025, 12:03 PM
36 points
17 comments1 min readLW link
(manifund.org)

On Responsibility

silentbobJan 21, 2025, 10:47 AM
9 points
2 comments6 min readLW link

The ‘anti woke’ are po­si­tioned to win but can they cap­i­tal­ize?

HznJan 21, 2025, 9:52 AM
−8 points
0 comments2 min readLW link

Al­most all growth is ex­po­nen­tial growth

lemonhopeJan 21, 2025, 7:16 AM
19 points
7 comments1 min readLW link

Ar­bi­trage Drains Worse Mar­kets to Feeds Bet­ter Ones

CedarJan 21, 2025, 3:44 AM
25 points
1 comment1 min readLW link

On Con­tact, Part 1

james.lucassenJan 21, 2025, 3:10 AM
14 points
1 comment11 min readLW link

Ret­ro­spec­tive: 12 [sic] Months Since MIRI

james.lucassenJan 21, 2025, 2:52 AM
67 points
0 comments9 min readLW link

Easily Eval­u­ate SAE-Steered Models with EleutherAI Eval­u­a­tion Harness

Matthew KhoriatyJan 21, 2025, 2:02 AM
4 points
0 comments3 min readLW link

Why We Need More Shovel-Ready AI Notkil­lev­ery­oneism Me­gapro­ject Proposals

Peter BerggrenJan 20, 2025, 10:38 PM
36 points
1 comment6 min readLW link

Tips and Code for Em­piri­cal Re­search Workflows

Jan 20, 2025, 10:31 PM
94 points
14 comments20 min readLW link

Lec­ture Series on Tiling Agents #2

abramdemskiJan 20, 2025, 9:02 PM
16 points
0 comments1 min readLW link

An­nounce­ment: Learn­ing The­ory On­line Course

Jan 20, 2025, 7:55 PM
63 points
33 comments4 min readLW link

The Hid­den Sta­tus Game in Hospi­tal Slacking

EpistemicExplorerJan 20, 2025, 6:35 PM
2 points
4 comments3 min readLW link