Cor­rect­ing De­cep­tive Align­ment us­ing a Deon­tolog­i­cal Approach

JeaniceKApr 14, 2025, 10:07 PM
8 points
0 comments7 min readLW link

Reli­gious Per­sis­tence: A Miss­ing Prim­i­tive for Ro­bust Alignment

lauriewiredApr 14, 2025, 10:03 PM
6 points
3 comments8 min readLW link

The 4-Minute Mile Effect

Parker ConleyApr 14, 2025, 9:41 PM
32 points
6 comments2 min readLW link
(parconley.com)

Light­ning Talks!

nathandunkerleyApr 14, 2025, 8:39 PM
1 point
0 comments1 min readLW link

The Bell Curve of Bad Behavior

ScrewtapeApr 14, 2025, 7:58 PM
54 points
6 comments10 min readLW link

Sen­tinel’s Global Risks Weekly Roundup #15/​2025: Tar­iff yoyo, OpenAI slash­ing safety test­ing, Iran nu­clear pro­gramme ne­go­ti­a­tions, 1K H5N1 con­firmed herd in­fec­tions.

NunoSempereApr 14, 2025, 7:11 PM
42 points
0 comments2 min readLW link
(blog.sentinel-team.org)

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 7: Timeline, continued

pythagoras5015Apr 14, 2025, 5:43 PM
2 points
0 comments36 min readLW link

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 8: Timeline, continued

pythagoras5015Apr 14, 2025, 5:42 PM
4 points
0 comments71 min readLW link

Fron­tier AI Models Still Fail at Ba­sic Phys­i­cal Tasks: A Man­u­fac­tur­ing Case Study

Adam KarvonenApr 14, 2025, 5:38 PM
156 points
42 comments7 min readLW link
(adamkarvonen.github.io)

How to eval­u­ate con­trol mea­sures for LLM agents? A tra­jec­tory from to­day to superintelligence

Apr 14, 2025, 4:45 PM
29 points
1 comment2 min readLW link

Ap­pli­ca­tions Open for Im­pact Ac­cel­er­a­tor Pro­gram for Ex­pe­rienced Professionals

Clark WisenbakerApr 14, 2025, 4:27 PM
1 point
0 comments3 min readLW link

The Last Light

Bridgett KayApr 14, 2025, 3:41 PM
31 points
2 comments4 min readLW link

Offer: Team Con­flict Coun­sel­ing for AI Safety Orgs

Severin T. SeehrichApr 14, 2025, 3:17 PM
19 points
1 comment1 min readLW link

Slop­world 2035: The dan­gers of mediocre AI

titotalApr 14, 2025, 1:14 PM
22 points
6 comments29 min readLW link
(titotal.substack.com)

Try train­ing to­ken-level probes

StefanHexApr 14, 2025, 11:56 AM
46 points
6 comments8 min readLW link

Monthly Roundup #29: April 2025

ZviApr 14, 2025, 11:50 AM
23 points
7 comments24 min readLW link
(thezvi.wordpress.com)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight LeeApr 14, 2025, 10:27 AM
−3 points
2 comments4 min readLW link

One-shot steer­ing vec­tors cause emer­gent mis­al­ign­ment, too

Jacob DunefskyApr 14, 2025, 6:40 AM
98 points
6 comments11 min readLW link

Un­bend­able Arm as Test Case for Reli­gious Belief

Ivan VendrovApr 14, 2025, 1:57 AM
28 points
39 comments2 min readLW link
(nothinghuman.substack.com)

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 5: Timeline, continued

pythagoras5015Apr 14, 2025, 1:00 AM
1 point
0 comments125 min readLW link

Луна Лавгуд и Комната Тайн, Часть 5

Apr 14, 2025, 12:10 AM
4 points
0 comments3 min readLW link

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 4: Timeline, continued

pythagoras5015Apr 13, 2025, 11:41 PM
1 point
0 comments51 min readLW link

The Struc­ture of the Pain of Change

ReverendBayesApr 13, 2025, 9:51 PM
7 points
0 comments10 min readLW link

Луна Лавгуд и Комната Тайн, Часть 4

Apr 13, 2025, 8:55 PM
3 points
0 comments4 min readLW link

Thoughts on the Dou­ble Im­pact Project

Mati_RoyApr 13, 2025, 7:07 PM
27 points
14 comments2 min readLW link

In­tro to Multi-Agent Safety

james__pApr 13, 2025, 5:40 PM
12 points
0 comments5 min readLW link

[Question] How far are Western welfare states from cod­dling the pop­u­la­tion into be­com­ing use­less?

StanislavKrymApr 13, 2025, 5:08 PM
−24 points
5 comments1 min readLW link

Ves­ti­gial rea­son­ing in RL

Caleb BiddulphApr 13, 2025, 3:40 PM
50 points
8 comments9 min readLW link

Four Types of Disagreement

silentbobApr 13, 2025, 11:22 AM
50 points
4 comments5 min readLW link

How I switched ca­reers from soft­ware en­g­ineer to AI policy operations

Lucie PhilipponApr 13, 2025, 6:37 AM
58 points
1 comment5 min readLW link

Steel­man­ning heuris­tic arguments

Dmitry VaintrobApr 13, 2025, 1:09 AM
74 points
0 comments17 min readLW link

MONA: Three Month Later—Up­dates and Steganog­ra­phy Without Op­ti­miza­tion Pressure

Apr 12, 2025, 11:15 PM
31 points
0 comments5 min readLW link

The Era of the Divi­d­ual—are we fal­ling apart?

James Stephen BrownApr 12, 2025, 10:35 PM
3 points
2 comments4 min readLW link

Com­mit­ment Races are a tech­ni­cal prob­lem ASI can eas­ily solve

Knight LeeApr 12, 2025, 10:22 PM
7 points
6 comments6 min readLW link

The King’s Gift: How In­sti­tu­tions Re­brand Re­spon­si­bil­ity into Illu­sion

Hu YichaoApr 12, 2025, 7:38 PM
1 point
0 comments1 min readLW link

Ex­perts have it easy

beyarkayApr 12, 2025, 7:32 PM
23 points
3 comments9 min readLW link

find_pur­pose.exe

heatdeathandtaxesApr 12, 2025, 7:31 PM
−1 points
0 comments5 min readLW link
(heatdeathandtaxes.substack.com)

The Cynic Wasps in the Beehive

mempkoApr 12, 2025, 7:30 PM
−3 points
0 comments1 min readLW link
(blog.mempko.com)

Луна Лавгуд и Комната Тайн, Часть 3

Apr 12, 2025, 7:20 PM
3 points
0 comments2 min readLW link

[Question] What is autism?

Adam ZernerApr 12, 2025, 6:12 PM
18 points
7 comments1 min readLW link

Col­lege Ad­vice For Peo­ple Like Me

henryjApr 12, 2025, 2:36 PM
50 points
5 comments17 min readLW link
(www.henryjosephson.com)

Why does LW not put much more fo­cus on AI gov­er­nance and out­reach?

Apr 12, 2025, 2:24 PM
78 points
31 comments2 min readLW link

[Question] Is Lo­cal Order a Clue to Univer­sal En­tropy? How a Failed Pro­fes­sor Searches for a ‘Sa­cred Mo­ti­va­tional Order’

P. JoãoApr 12, 2025, 1:39 PM
2 points
2 comments2 min readLW link

What are good safety stan­dards for open source AIs from China?

ChristianKlApr 12, 2025, 1:06 PM
10 points
2 comments1 min readLW link

Will US tar­iffs push data cen­ters for large model train­ing offshore?

ChristianKlApr 12, 2025, 12:47 PM
20 points
3 comments1 min readLW link

Self prop­a­gat­ing story.

CanalettoApr 12, 2025, 12:32 PM
3 points
0 comments8 min readLW link

Cal­ling Bul­lshit—the Cheatsheet

Niklas LehmannApr 12, 2025, 11:43 AM
13 points
4 comments2 min readLW link

The In­ter­nal Model Prin­ci­ple: A Straight­for­ward Ex­pla­na­tion

Alfred HarwoodApr 12, 2025, 10:58 AM
22 points
1 comment19 min readLW link

ACX Spring Meetup 2025 @ Klang Valley, Malaysia

Yi-YangApr 12, 2025, 7:31 AM
2 points
0 comments1 min readLW link

Distributed whistleblowing

samuelshadrachApr 12, 2025, 6:36 AM
5 points
5 comments4 min readLW link
(samuelshadrach.com)