Align­ment First, In­tel­li­gence Later

Chris LakinMar 30, 2025, 10:26 PM
18 points
5 comments3 min readLW link

[Question] Why do many peo­ple who care about AI Safety not clearly en­dorse PauseAI?

humnrdbleMar 30, 2025, 6:06 PM
45 points
42 comments2 min readLW link

Enu­mer­at­ing ob­jects a model “knows” us­ing en­tity-de­tec­tion fea­tures.

Alex GibsonMar 30, 2025, 4:58 PM
6 points
2 comments6 min readLW link

Bonn ACX Meetup Spring 2025

Fernand0Mar 30, 2025, 3:12 PM
2 points
1 comment1 min readLW link

What does al­ign­ing AI to an ide­ol­ogy mean for true al­ign­ment?

StanislavKrymMar 30, 2025, 3:12 PM
1 point
0 comments8 min readLW link

How to en­joy fail at­tempts with­out self-de­cep­tion (tech­nique)

YanLyutnevMar 30, 2025, 1:49 PM
9 points
0 comments9 min readLW link

Me­mory Per­sis­tence within Con­ver­sa­tion Threads with Mul­ti­modal LLMS

sjay8Mar 30, 2025, 7:16 AM
4 points
0 comments1 min readLW link

How I talk to those above me

Maxwell PetersonMar 30, 2025, 6:54 AM
102 points
16 comments8 min readLW link

How do SAE Cir­cuits Fail? A Case Study Us­ing a Starts-with-‘E’ Let­ter De­tec­tion Task

adsingh-64Mar 30, 2025, 12:47 AM
1 point
0 comments3 min readLW link

Climb­ing the Hill of Experiments

nomagicpillMar 29, 2025, 8:37 PM
4 points
0 comments6 min readLW link
(nomagicpill.github.io)

[Question] Does the AI con­trol agenda broadly rely on no FOOM be­ing pos­si­ble?

Noosphere89Mar 29, 2025, 7:38 PM
22 points
3 comments1 min readLW link

Ex­er­cis­ing Rationality

EggsMar 29, 2025, 7:08 PM
4 points
0 comments4 min readLW link

Yeshua’s Basilisk

Alex BeymanMar 29, 2025, 6:11 PM
8 points
1 comment4 min readLW link

AI Needs Us? In­for­ma­tion The­ory and Hu­mans as data

tomdekanMar 29, 2025, 3:51 PM
0 points
6 comments4 min readLW link

Auto Shut­down Script

jefftkMar 29, 2025, 1:10 PM
16 points
5 comments1 min readLW link
(www.jefftk.com)

Pro­posal for a Post-La­bor So­cietal Struc­ture to Miti­gate ASI Risks: The ‘Game Cul­ture Civ­i­liza­tion’ (GCC) Model

Beyond SingularityMar 29, 2025, 11:31 AM
2 points
0 comments4 min readLW link

Tor­ment­ing Gem­ini 2.5 with the [[[]]][][[]] Puzzle

CzynskiMar 29, 2025, 2:51 AM
48 points
36 comments3 min readLW link

Sin­gu­lar­ity Sur­vival Guide: A Bayesian Guide for Nav­i­gat­ing the Pre-Sin­gu­lar­ity Period

mbrooksMar 28, 2025, 11:21 PM
6 points
4 comments2 min readLW link

Soft­max, Em­mett Shear’s new AI startup fo­cused on “Or­ganic Align­ment”

Chris LakinMar 28, 2025, 9:23 PM
59 points
1 comment1 min readLW link
(www.corememory.com)

The Pando Prob­lem: Re­think­ing AI Individuality

Jan_KulveitMar 28, 2025, 9:03 PM
128 points
14 comments13 min readLW link

Selec­tion Pres­sures on LM Personas

Raymond DouglasMar 28, 2025, 8:33 PM
30 points
0 comments3 min readLW link

AXRP Epi­sode 40 - Ja­son Gross on Com­pact Proofs and Interpretability

DanielFilanMar 28, 2025, 6:40 PM
23 points
0 comments89 min readLW link

[Question] Share AI Safety Ideas: Both Crazy and Not. №2

ankMar 28, 2025, 5:22 PM
2 points
10 comments1 min readLW link

AI x Bio Workshop

Allison DuettmannMar 28, 2025, 5:21 PM
16 points
0 comments1 min readLW link

[Question] How many times faster can the AGI ad­vance the sci­ence than hu­mans do?

StanislavKrymMar 28, 2025, 3:16 PM
0 points
0 comments1 min readLW link

Gem­ini 2.5 is the New SoTA

ZviMar 28, 2025, 2:20 PM
52 points
1 comment12 min readLW link
(thezvi.wordpress.com)

Will the Need to Re­train AI Models from Scratch Block a Soft­ware In­tel­li­gence Ex­plo­sion?

Tom DavidsonMar 28, 2025, 2:12 PM
10 points
0 comments3 min readLW link

How We Might All Die in A Year

Greg CMar 28, 2025, 1:22 PM
5 points
13 comments21 min readLW link
(x.com)

The vi­sion of Bill Thurston

TsviBTMar 28, 2025, 11:45 AM
50 points
34 comments4 min readLW link

What Uni­parental Di­somy Tells Us About Im­proper Im­print­ing in Humans

MorpheusMar 28, 2025, 11:24 AM
32 points
1 comment6 min readLW link
(www.tassiloneubauer.com)

Ex­plain­ing Bri­tish Naval Dom­i­nance Dur­ing the Age of Sail

Arjun PanicksseryMar 28, 2025, 5:47 AM
199 points
17 comments4 min readLW link
(arjunpanickssery.substack.com)

Will the AGIs be able to run the civil­i­sa­tion?

StanislavKrymMar 28, 2025, 4:50 AM
−4 points
2 comments3 min readLW link

[Question] Is AGI ac­tu­ally that likely to take off given the world en­ergy con­sump­tion?

StanislavKrymMar 27, 2025, 11:13 PM
2 points
2 comments1 min readLW link

[Linkpost] The value of ini­ti­at­ing a pur­suit in tem­po­ral de­ci­sion-making

Gunnar_ZarnckeMar 27, 2025, 9:47 PM
13 points
0 comments2 min readLW link

Align­ment through atomic agents

micseydelMar 27, 2025, 6:43 PM
−1 points
0 comments1 min readLW link

Machines of Stolen Grace

Riley TavassoliMar 27, 2025, 6:15 PM
2 points
0 comments5 min readLW link

An ar­gu­ment for asexuality

filthy_hedonistMar 27, 2025, 6:08 PM
−2 points
10 comments1 min readLW link

On the plau­si­bil­ity of a “messy” rogue AI com­mit­ting hu­man-like evil

Jacob GriffithMar 27, 2025, 6:06 PM
6 points
0 comments7 min readLW link

AI Mo­ral Align­ment: The Most Im­por­tant Goal of Our Generation

Ronen BarMar 27, 2025, 6:04 PM
3 points
0 comments8 min readLW link
(forum.effectivealtruism.org)

Trac­ing the Thoughts of a Large Lan­guage Model

Adam JermynMar 27, 2025, 5:20 PM
304 points
24 comments10 min readLW link
(www.anthropic.com)

Com­pu­ta­tional Su­per­po­si­tion in a Toy Model of the U-AND Problem

Adam NewgasMar 27, 2025, 4:56 PM
18 points
2 comments11 min readLW link

Mis­tral Large 2 (123B) seems to ex­hibit al­ign­ment faking

Mar 27, 2025, 3:39 PM
80 points
4 comments13 min readLW link

AIS Nether­lands is look­ing for a Found­ing Ex­ec­u­tive Direc­tor (EOI form)

Mar 27, 2025, 3:30 PM
15 points
0 comments4 min readLW link

AI #109: Google Fails Mar­ket­ing Forever

ZviMar 27, 2025, 2:50 PM
42 points
12 comments35 min readLW link
(thezvi.wordpress.com)

What life will be like for hu­mans if al­igned ASI is created

james oofouMar 27, 2025, 10:06 AM
3 points
6 comments2 min readLW link

What is scaf­fold­ing?

Mar 27, 2025, 9:06 AM
10 points
0 comments2 min readLW link
(aisafety.info)

Work­flow vs in­ter­face vs implementation

SniffnoyMar 27, 2025, 7:38 AM
12 points
0 comments1 min readLW link

Quick thoughts on the difficulty of widely con­vey­ing a non-stereo­typed position

SniffnoyMar 27, 2025, 7:30 AM
12 points
0 comments5 min readLW link

Do­ing prin­ci­ple-of-char­ity better

SniffnoyMar 27, 2025, 5:19 AM
22 points
1 comment3 min readLW link

X as phe­nomenon vs as policy, Good­hart, and the AB problem

SniffnoyMar 27, 2025, 4:32 AM
13 points
0 comments2 min readLW link