Cor­rect­ing De­cep­tive Align­ment us­ing a Deon­tolog­i­cal Approach

JeaniceK14 Apr 2025 22:07 UTC
8 points
0 comments7 min readLW link

Reli­gious Per­sis­tence: A Miss­ing Prim­i­tive for Ro­bust Alignment

lauriewired14 Apr 2025 22:03 UTC
6 points
3 comments8 min readLW link

The 4-Minute Mile Effect

Parker Conley14 Apr 2025 21:41 UTC
32 points
6 comments2 min readLW link
(parconley.com)

Light­ning Talks!

nathandunkerley14 Apr 2025 20:39 UTC
1 point
0 comments1 min readLW link

The Bell Curve of Bad Behavior

Screwtape14 Apr 2025 19:58 UTC
54 points
6 comments10 min readLW link

Sen­tinel’s Global Risks Weekly Roundup #15/​2025: Tar­iff yoyo, OpenAI slash­ing safety test­ing, Iran nu­clear pro­gramme ne­go­ti­a­tions, 1K H5N1 con­firmed herd in­fec­tions.

NunoSempere14 Apr 2025 19:11 UTC
42 points
0 comments2 min readLW link
(blog.sentinel-team.org)

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 7: Timeline, continued

pythagoras501514 Apr 2025 17:43 UTC
2 points
0 comments36 min readLW link

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 8: Timeline, continued

pythagoras501514 Apr 2025 17:42 UTC
4 points
0 comments71 min readLW link

Fron­tier AI Models Still Fail at Ba­sic Phys­i­cal Tasks: A Man­u­fac­tur­ing Case Study

Adam Karvonen14 Apr 2025 17:38 UTC
158 points
42 comments7 min readLW link
(adamkarvonen.github.io)

How to eval­u­ate con­trol mea­sures for LLM agents? A tra­jec­tory from to­day to superintelligence

14 Apr 2025 16:45 UTC
29 points
1 comment2 min readLW link

Ap­pli­ca­tions Open for Im­pact Ac­cel­er­a­tor Pro­gram for Ex­pe­rienced Professionals

Clark Wisenbaker14 Apr 2025 16:27 UTC
1 point
0 comments3 min readLW link

The Last Light

Bridgett Kay14 Apr 2025 15:41 UTC
31 points
2 comments4 min readLW link

Offer: Team Con­flict Coun­sel­ing for AI Safety Orgs

Severin T. Seehrich14 Apr 2025 15:17 UTC
19 points
1 comment1 min readLW link

Slop­world 2035: The dan­gers of mediocre AI

titotal14 Apr 2025 13:14 UTC
22 points
6 comments29 min readLW link
(titotal.substack.com)

Try train­ing to­ken-level probes

StefanHex14 Apr 2025 11:56 UTC
47 points
6 comments8 min readLW link

Monthly Roundup #29: April 2025

Zvi14 Apr 2025 11:50 UTC
23 points
7 comments24 min readLW link
(thezvi.wordpress.com)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC
−3 points
2 comments4 min readLW link

One-shot steer­ing vec­tors cause emer­gent mis­al­ign­ment, too

Jacob Dunefsky14 Apr 2025 6:40 UTC
98 points
6 comments11 min readLW link

Un­bend­able Arm as Test Case for Reli­gious Belief

Ivan Vendrov14 Apr 2025 1:57 UTC
28 points
39 comments2 min readLW link
(nothinghuman.substack.com)

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 5: Timeline, continued

pythagoras501514 Apr 2025 1:00 UTC
1 point
0 comments125 min readLW link

Луна Лавгуд и Комната Тайн, Часть 5

14 Apr 2025 0:10 UTC
4 points
0 comments3 min readLW link

Sam Alt­man’s sister claims Sam sex­u­ally abused her—Part 4: Timeline, continued

pythagoras501513 Apr 2025 23:41 UTC
1 point
0 comments51 min readLW link

The Struc­ture of the Pain of Change

ReverendBayes13 Apr 2025 21:51 UTC
7 points
0 comments10 min readLW link

Луна Лавгуд и Комната Тайн, Часть 4

13 Apr 2025 20:55 UTC
3 points
0 comments4 min readLW link

Thoughts on the Dou­ble Im­pact Project

Mati_Roy13 Apr 2025 19:07 UTC
27 points
14 comments2 min readLW link

In­tro to Multi-Agent Safety

james__p13 Apr 2025 17:40 UTC
12 points
0 comments5 min readLW link

Ves­ti­gial rea­son­ing in RL

Caleb Biddulph13 Apr 2025 15:40 UTC
53 points
8 comments9 min readLW link

Four Types of Disagreement

silentbob13 Apr 2025 11:22 UTC
50 points
4 comments5 min readLW link

How I switched ca­reers from soft­ware en­g­ineer to AI policy operations

Lucie Philippon13 Apr 2025 6:37 UTC
58 points
1 comment5 min readLW link

Steel­man­ning heuris­tic arguments

Dmitry Vaintrob13 Apr 2025 1:09 UTC
78 points
0 comments17 min readLW link

MONA: Three Month Later—Up­dates and Steganog­ra­phy Without Op­ti­miza­tion Pressure

12 Apr 2025 23:15 UTC
31 points
0 comments5 min readLW link

The Era of the Divi­d­ual—are we fal­ling apart?

James Stephen Brown12 Apr 2025 22:35 UTC
3 points
2 comments4 min readLW link

Com­mit­ment Races are a tech­ni­cal prob­lem ASI can eas­ily solve

Knight Lee12 Apr 2025 22:22 UTC
7 points
6 comments6 min readLW link

The King’s Gift: How In­sti­tu­tions Re­brand Re­spon­si­bil­ity into Illu­sion

Hu Yichao12 Apr 2025 19:38 UTC
1 point
0 comments1 min readLW link

Ex­perts have it easy

beyarkay12 Apr 2025 19:32 UTC
23 points
3 comments9 min readLW link

find_pur­pose.exe

heatdeathandtaxes12 Apr 2025 19:31 UTC
−1 points
0 comments5 min readLW link
(heatdeathandtaxes.substack.com)

The Cynic Wasps in the Beehive

mempko12 Apr 2025 19:30 UTC
−3 points
0 comments1 min readLW link
(blog.mempko.com)

Луна Лавгуд и Комната Тайн, Часть 3

12 Apr 2025 19:20 UTC
3 points
0 comments2 min readLW link

[Question] What is autism?

Adam Zerner12 Apr 2025 18:12 UTC
18 points
7 comments1 min readLW link

Col­lege Ad­vice For Peo­ple Like Me

henryj12 Apr 2025 14:36 UTC
50 points
5 comments17 min readLW link
(www.henryjosephson.com)

Why does LW not put much more fo­cus on AI gov­er­nance and out­reach?

12 Apr 2025 14:24 UTC
78 points
31 comments2 min readLW link

[Question] Is Lo­cal Order a Clue to Univer­sal En­tropy? How a Failed Pro­fes­sor Searches for a ‘Sa­cred Mo­ti­va­tional Order’

P. João12 Apr 2025 13:39 UTC
2 points
2 comments2 min readLW link

What are good safety stan­dards for open source AIs from China?

ChristianKl12 Apr 2025 13:06 UTC
10 points
2 comments1 min readLW link

Will US tar­iffs push data cen­ters for large model train­ing offshore?

ChristianKl12 Apr 2025 12:47 UTC
20 points
3 comments1 min readLW link

Self prop­a­gat­ing story.

Canaletto12 Apr 2025 12:32 UTC
3 points
0 comments8 min readLW link

Cal­ling Bul­lshit—the Cheatsheet

Niklas Lehmann12 Apr 2025 11:43 UTC
13 points
4 comments2 min readLW link

The In­ter­nal Model Prin­ci­ple: A Straight­for­ward Ex­pla­na­tion

Alfred Harwood12 Apr 2025 10:58 UTC
23 points
6 comments19 min readLW link

ACX Spring Meetup 2025 @ Klang Valley, Malaysia

Yi-Yang12 Apr 2025 7:31 UTC
2 points
0 comments1 min readLW link

Distributed whistleblowing

samuelshadrach12 Apr 2025 6:36 UTC
5 points
5 comments4 min readLW link
(samuelshadrach.com)

[Question] How likely are the USA to de­cay and how will it in­fluence the AI de­vel­op­ment?

StanislavKrym12 Apr 2025 4:42 UTC
10 points
0 comments1 min readLW link