The Com­pli­ment Sand­wich 🥪 aka: How to crit­i­cize a normie with­out mak­ing them up­set.

keltan3 Mar 2025 23:15 UTC
15 points
10 comments1 min readLW link

AI Safety at the Fron­tier: Paper High­lights, Fe­bru­ary ’25

gasteigerjo3 Mar 2025 22:09 UTC
7 points
0 comments7 min readLW link
(aisafetyfrontier.substack.com)

What goals will AIs have? A list of hypotheses

Daniel Kokotajlo3 Mar 2025 20:08 UTC
90 points
20 comments18 min readLW link

Take­aways From Our Re­cent Work on SAE Probing

3 Mar 2025 19:50 UTC
30 points
4 comments5 min readLW link

Why Peo­ple Com­mit White Col­lar Fraud (Ozy linkpost)

sapphire3 Mar 2025 19:33 UTC
24 points
1 comment1 min readLW link
(thingofthings.substack.com)

[Question] Ask Me Any­thing—Samuel

samuelshadrach3 Mar 2025 19:24 UTC
0 points
0 comments1 min readLW link

Ex­pand­ing Har­mBench: In­ves­ti­gat­ing Gaps & Ex­tend­ing Ad­ver­sar­ial LLM Test­ing

racinkc13 Mar 2025 19:23 UTC
1 point
0 comments1 min readLW link

Could Ad­vanced AI Ac­cel­er­ate the Pace of AI Progress? In­ter­views with AI Researchers

3 Mar 2025 19:05 UTC
41 points
1 comment1 min readLW link
(papers.ssrn.com)

Mid­dle School Choice

jefftk3 Mar 2025 16:10 UTC
27 points
10 comments4 min readLW link
(www.jefftk.com)

On GPT-4.5

Zvi3 Mar 2025 13:40 UTC
44 points
12 comments22 min readLW link
(thezvi.wordpress.com)

Co­a­les­cence—Deter­minism In Ways We Care About

vitaliya3 Mar 2025 13:20 UTC
12 points
0 comments11 min readLW link

Meth­ods for strong hu­man germline en­g­ineer­ing

TsviBT3 Mar 2025 8:13 UTC
149 points
29 comments108 min readLW link

[Question] Ex­am­ples of self-fulfilling prophe­cies in AI al­ign­ment?

Chris Lakin3 Mar 2025 2:45 UTC
24 points
10 comments1 min readLW link

[Question] Re­quest for Com­ments on AI-re­lated Pre­dic­tion Mar­ket Ideas

PeterMcCluskey2 Mar 2025 20:52 UTC
17 points
1 comment3 min readLW link

Statis­ti­cal Challenges with Mak­ing Su­per IQ babies

Jan Christian Refsgaard2 Mar 2025 20:26 UTC
154 points
26 comments9 min readLW link

Cau­tions about LLMs in Hu­man Cog­ni­tive Loops

Alice Blair2 Mar 2025 19:53 UTC
40 points
13 comments7 min readLW link

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTrout2 Mar 2025 19:51 UTC
156 points
29 comments1 min readLW link
(turntrout.com)

Spencer Green­berg hiring a per­sonal/​pro­fes­sional/​re­search re­mote as­sis­tant for 5-10 hours per week

spencerg2 Mar 2025 18:01 UTC
13 points
0 comments1 min readLW link

[Question] Will LLM agents be­come the first takeover-ca­pa­ble AGIs?

Seth Herd2 Mar 2025 17:15 UTC
37 points
10 comments1 min readLW link

Not-yet-falsifi­able be­liefs?

Benjamin Hendricks2 Mar 2025 14:11 UTC
6 points
4 comments1 min readLW link

Sav­ing Zest

jefftk2 Mar 2025 12:00 UTC
24 points
1 comment1 min readLW link
(www.jefftk.com)

Open Thread Spring 2025

Ben Pace2 Mar 2025 2:33 UTC
20 points
48 comments1 min readLW link

[Question] help, my self image as ra­tio­nal is af­fect­ing my abil­ity to em­pathize with others

KvmanThinking2 Mar 2025 2:06 UTC
1 point
13 comments1 min readLW link

Main­tain­ing Align­ment dur­ing RSI as a Feed­back Con­trol Problem

beren2 Mar 2025 0:21 UTC
67 points
6 comments11 min readLW link

AI Safety Policy Won’t Go On Like This – AI Safety Ad­vo­cacy Is Failing Be­cause No­body Cares.

henophilia1 Mar 2025 20:15 UTC
1 point
1 comment1 min readLW link
(blog.hermesloom.org)

Mean­ing Machines

appromoximate1 Mar 2025 19:16 UTC
0 points
0 comments13 min readLW link

[Question] Share AI Safety Ideas: Both Crazy and Not

ank1 Mar 2025 19:08 UTC
17 points
28 comments1 min readLW link

His­to­ri­o­graph­i­cal Com­pres­sions: Re­nais­sance as An Example

adamShimi1 Mar 2025 18:21 UTC
17 points
4 comments7 min readLW link
(formethods.substack.com)

Real-Time Gigstats

jefftk1 Mar 2025 14:10 UTC
9 points
0 comments1 min readLW link
(www.jefftk.com)

Open prob­lems in emer­gent misalignment

1 Mar 2025 9:47 UTC
83 points
17 comments7 min readLW link

Es­ti­mat­ing the Prob­a­bil­ity of Sam­pling a Trained Neu­ral Net­work at Random

1 Mar 2025 2:11 UTC
32 points
10 comments1 min readLW link
(arxiv.org)

[Question] What na­tion did Trump pre­vent from go­ing to war (Feb. 2025)?

James Camacho1 Mar 2025 1:46 UTC
3 points
5 comments1 min readLW link

AXRP Epi­sode 38.8 - David Du­ve­naud on Sab­o­tage Eval­u­a­tions and the Post-AGI Future

DanielFilan1 Mar 2025 1:20 UTC
13 points
0 comments13 min readLW link

Tam­perSec is hiring for 3 Key Roles!

Jonathan_H28 Feb 2025 23:10 UTC
15 points
0 comments4 min readLW link

Do we want al­ign­ment fak­ing?

Florian_Dietz28 Feb 2025 21:50 UTC
7 points
4 comments1 min readLW link

Few con­cepts mix­ing dark fan­tasy and sci­ence fiction

Marek Zegarek28 Feb 2025 21:03 UTC
0 points
0 comments3 min readLW link

La­tent Space Col­lapse? Un­der­stand­ing the Effects of Nar­row Fine-Tun­ing on LLMs

tenseisoham28 Feb 2025 20:22 UTC
3 points
0 comments9 min readLW link

How to Con­tribute to The­o­ret­i­cal Re­ward Learn­ing Research

Joar Skalse28 Feb 2025 19:27 UTC
16 points
0 comments21 min readLW link

Other Papers About the The­ory of Re­ward Learning

Joar Skalse28 Feb 2025 19:26 UTC
16 points
0 comments5 min readLW link

Defin­ing and Char­ac­ter­is­ing Re­ward Hacking

Joar Skalse28 Feb 2025 19:25 UTC
15 points
0 comments4 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learn­ing—Part II

Joar Skalse28 Feb 2025 19:24 UTC
9 points
0 comments7 min readLW link

STARC: A Gen­eral Frame­work For Quan­tify­ing Differ­ences Between Re­ward Functions

Joar Skalse28 Feb 2025 19:24 UTC
11 points
0 comments8 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learning

Joar Skalse28 Feb 2025 19:24 UTC
19 points
0 comments11 min readLW link

Par­tial Iden­ti­fi­a­bil­ity in Re­ward Learning

Joar Skalse28 Feb 2025 19:23 UTC
16 points
0 comments12 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar Skalse28 Feb 2025 19:20 UTC
29 points
4 comments14 min readLW link

An Open Let­ter To EA and AI Safety On De­cel­er­at­ing AI Development

kenneth_diao28 Feb 2025 17:21 UTC
8 points
0 comments14 min readLW link
(graspingatwaves.substack.com)

Dance Week­end Pay II

jefftk28 Feb 2025 15:10 UTC
11 points
0 comments1 min readLW link
(www.jefftk.com)

Ex­is­ten­tial­ists and Trolleys

David Gross28 Feb 2025 14:01 UTC
5 points
3 comments7 min readLW link

On Emer­gent Misalignment

Zvi28 Feb 2025 13:10 UTC
88 points
5 comments22 min readLW link
(thezvi.wordpress.com)

Do safety-rele­vant LLM steer­ing vec­tors op­ti­mized on a sin­gle ex­am­ple gen­er­al­ize?

Jacob Dunefsky28 Feb 2025 12:01 UTC
21 points
1 comment14 min readLW link
(arxiv.org)