AI Safety at the Fron­tier: Paper High­lights, De­cem­ber ’24

gasteigerjo11 Jan 2025 22:54 UTC
7 points
2 comments7 min readLW link
(aisafetyfrontier.substack.com)

Fluori­da­tion: The RCT We Still Haven’t Run (But Should)

ChristianKl11 Jan 2025 21:02 UTC
22 points
5 comments2 min readLW link

In Defense of a But­le­rian Jihad

sloonz11 Jan 2025 19:30 UTC
10 points
25 comments9 min readLW link

Near term dis­cus­sions need some­thing smaller and more con­crete than AGI

ryan_b11 Jan 2025 18:24 UTC
13 points
0 comments6 min readLW link

A pro­posal for iter­ated in­ter­pretabil­ity with known-in­ter­pretable nar­row AIs

Peter Berggren11 Jan 2025 14:43 UTC
6 points
0 comments2 min readLW link

Have fron­tier AI sys­tems sur­passed the self-repli­cat­ing red line?

nsage11 Jan 2025 5:31 UTC
4 points
0 comments4 min readLW link

We need a uni­ver­sal defi­ni­tion of ‘agency’ and re­lated words

CstineSublime11 Jan 2025 3:22 UTC
18 points
1 comment5 min readLW link

[Question] AI for med­i­cal care for hard-to-treat dis­eases?

CronoDAS10 Jan 2025 23:55 UTC
12 points
1 comment1 min readLW link

Beliefs and state of mind into 2025

RussellThor10 Jan 2025 22:07 UTC
18 points
10 comments7 min readLW link

Recom­men­da­tions for Tech­ni­cal AI Safety Re­search Directions

Sam Marks10 Jan 2025 19:34 UTC
64 points
1 comment17 min readLW link
(alignment.anthropic.com)

Is AI Align­ment Enough?

Aram Panasenco10 Jan 2025 18:57 UTC
30 points
6 comments6 min readLW link

[Question] What are some sce­nar­ios where an al­igned AGI ac­tu­ally helps hu­man­ity, but many/​most peo­ple don’t like it?

RomanS10 Jan 2025 18:13 UTC
14 points
6 comments3 min readLW link

Hu­man takeover might be worse than AI takeover

Tom Davidson10 Jan 2025 16:53 UTC
147 points
56 comments8 min readLW link
(forethoughtnewsletter.substack.com)

The Align­ment Map­ping Pro­gram: Forg­ing In­de­pen­dent Thinkers in AI Safety—A Pilot Retrospective

10 Jan 2025 16:22 UTC
31 points
0 comments4 min readLW link

On Dwarkesh Pa­tel’s 4th Pod­cast With Tyler Cowen

Zvi10 Jan 2025 13:50 UTC
44 points
7 comments27 min readLW link
(thezvi.wordpress.com)

Scal­ing Sparse Fea­ture Cir­cuit Find­ing to Gemma 9B

10 Jan 2025 11:08 UTC
86 points
11 comments17 min readLW link

[Question] Is Musk still net-pos­i­tive for hu­man­ity?

mikbp10 Jan 2025 9:34 UTC
−5 points
18 comments1 min readLW link

Ac­ti­va­tion Mag­ni­tudes Mat­ter On Their Own: In­sights from Lan­guage Model Distri­bu­tional Analysis

Matt Levinson10 Jan 2025 6:53 UTC
4 points
0 comments4 min readLW link

Dmitry’s Koan

Dmitry Vaintrob10 Jan 2025 4:27 UTC
44 points
8 comments22 min readLW link

NAO Up­dates, Jan­uary 2025

jefftk10 Jan 2025 3:37 UTC
23 points
0 comments3 min readLW link
(naobservatory.org)

MATS men­tor selection

10 Jan 2025 3:12 UTC
44 points
12 comments6 min readLW link

AI Fore­cast­ing Bench­mark: Con­grat­u­la­tions to Q4 Win­ners + Q1 Prac­tice Ques­tions Open

ChristianWilliams10 Jan 2025 3:02 UTC
7 points
0 comments2 min readLW link
(www.metaculus.com)

[Question] How do you de­cide to phrase pre­dic­tions you ask of oth­ers? (and how do you make your own?)

CstineSublime10 Jan 2025 2:44 UTC
7 points
1 comment2 min readLW link

You are too dumb to un­der­stand insurance

Lorec9 Jan 2025 23:33 UTC
1 point
12 comments7 min readLW link

Is AI Hit­ting a Wall or Mov­ing Faster Than Ever?

garrison9 Jan 2025 22:18 UTC
12 points
5 comments5 min readLW link
(garrisonlovely.substack.com)

Ex­pevolu, Part II: Buy­ing land to cre­ate countries

Fernando9 Jan 2025 21:11 UTC
4 points
0 comments20 min readLW link
(expevolu.substack.com)

Last week of the Dis­cus­sion Phase

Raemon9 Jan 2025 19:26 UTC
35 points
0 comments3 min readLW link

Dis­cur­sive War­fare and Fac­tion Formation

Benquo9 Jan 2025 16:47 UTC
52 points
3 comments3 min readLW link
(benjaminrosshoffman.com)

Can we res­cue Effec­tive Altru­ism?

Elizabeth9 Jan 2025 16:40 UTC
20 points
0 comments1 min readLW link
(acesounderglass.com)

AI #98: World Ends With Six Word Story

Zvi9 Jan 2025 16:30 UTC
36 points
2 comments38 min readLW link
(thezvi.wordpress.com)

Many Wor­lds and the Prob­lems of Evil

Jonah Wilberg9 Jan 2025 16:10 UTC
−3 points
2 comments9 min readLW link

PIBBSS Fel­low­ship 2025: Boun­ties and Co­op­er­a­tive AI Track Announcement

9 Jan 2025 14:23 UTC
20 points
0 comments1 min readLW link

The “Every­one Can’t Be Wrong” Prior causes AI risk de­nial but helped pre­his­toric people

Knight Lee9 Jan 2025 5:54 UTC
1 point
0 comments2 min readLW link

Gover­nance Course—Week 1 Reflections

Alice Blair9 Jan 2025 4:48 UTC
4 points
1 comment5 min readLW link

Thoughts on the In-Con­text Schem­ing AI Experiment

ExCeph9 Jan 2025 2:19 UTC
2 points
0 comments4 min readLW link

A Sys­tem­atic Ap­proach to AI Risk Anal­y­sis Through Cog­ni­tive Capabilities

Tom DAVID9 Jan 2025 0:18 UTC
2 points
0 comments3 min readLW link

Gothen­burg LW /​ ACX meetup

Stefan8 Jan 2025 21:39 UTC
2 points
0 comments1 min readLW link

Aris­toc­racy and Hostage Capital

Arjun Panickssery8 Jan 2025 19:38 UTC
108 points
7 comments3 min readLW link
(arjunpanickssery.substack.com)

[Question] What is the most im­pres­sive game LLMs can play well?

Cole Wyeth8 Jan 2025 19:38 UTC
19 points
20 comments1 min readLW link

The Type of Writ­ing that Pushes Women Away

Dahlia8 Jan 2025 18:54 UTC
23 points
4 comments2 min readLW link

Ann Alt­man has filed a law­suit in US fed­eral court alleg­ing that she was sex­u­ally abused by Sam Altman

quanticle8 Jan 2025 14:59 UTC
7 points
3 comments1 min readLW link

AI Safety Outreach Sem­i­nar & So­cial (on­line)

Linda Linsefors8 Jan 2025 13:25 UTC
9 points
0 comments1 min readLW link

XX by Rian Hughes: Pre­ten­tious Bullshit

Yair Halberstadt8 Jan 2025 13:02 UTC
33 points
5 comments5 min readLW link

Ac­ti­va­tion space in­ter­pretabil­ity may be doomed

8 Jan 2025 12:49 UTC
152 points
34 comments8 min readLW link

AI Safety as a YC Startup

Lukas Petersson8 Jan 2025 10:46 UTC
58 points
9 comments5 min readLW link

The ab­solute ba­sics of rep­re­sen­ta­tion the­ory of finite groups

Dmitry Vaintrob8 Jan 2025 9:47 UTC
21 points
1 comment10 min readLW link

Im­pli­ca­tions of the AI Se­cu­rity Gap

Dan Braun8 Jan 2025 8:31 UTC
46 points
0 comments9 min readLW link

What are poly­se­man­tic neu­rons?

8 Jan 2025 7:35 UTC
9 points
0 comments4 min readLW link
(aisafety.info)

Tips On Em­piri­cal Re­search Slides

8 Jan 2025 5:06 UTC
97 points
4 comments6 min readLW link

On Eat­ing the Sun

jessicata8 Jan 2025 4:57 UTC
96 points
99 comments3 min readLW link
(unstablerontology.substack.com)