AI Safety Policy Won’t Go On Like This – AI Safety Ad­vo­cacy Is Failing Be­cause No­body Cares.

henophiliaMar 1, 2025, 8:15 PM
1 point
1 comment1 min readLW link
(blog.hermesloom.org)

Mean­ing Machines

appromoximateMar 1, 2025, 7:16 PM
0 points
0 comments13 min readLW link

[Question] Share AI Safety Ideas: Both Crazy and Not

ankMar 1, 2025, 7:08 PM
16 points
28 comments1 min readLW link

His­to­ri­o­graph­i­cal Com­pres­sions: Re­nais­sance as An Example

adamShimiMar 1, 2025, 6:21 PM
17 points
4 comments7 min readLW link
(formethods.substack.com)

Real-Time Gigstats

jefftkMar 1, 2025, 2:10 PM
9 points
0 comments1 min readLW link
(www.jefftk.com)

Open prob­lems in emer­gent misalignment

Mar 1, 2025, 9:47 AM
80 points
13 comments7 min readLW link

Es­ti­mat­ing the Prob­a­bil­ity of Sam­pling a Trained Neu­ral Net­work at Random

Mar 1, 2025, 2:11 AM
32 points
10 comments1 min readLW link
(arxiv.org)

[Question] What na­tion did Trump pre­vent from go­ing to war (Feb. 2025)?

James CamachoMar 1, 2025, 1:46 AM
3 points
3 comments1 min readLW link

AXRP Epi­sode 38.8 - David Du­ve­naud on Sab­o­tage Eval­u­a­tions and the Post-AGI Future

DanielFilanMar 1, 2025, 1:20 AM
13 points
0 comments13 min readLW link

Tam­perSec is hiring for 3 Key Roles!

Jonathan_HFeb 28, 2025, 11:10 PM
15 points
0 comments4 min readLW link

Do we want al­ign­ment fak­ing?

Florian_DietzFeb 28, 2025, 9:50 PM
7 points
4 comments1 min readLW link

Few con­cepts mix­ing dark fan­tasy and sci­ence fiction

Marek ZegarekFeb 28, 2025, 9:03 PM
0 points
0 comments3 min readLW link

La­tent Space Col­lapse? Un­der­stand­ing the Effects of Nar­row Fine-Tun­ing on LLMs

tenseisohamFeb 28, 2025, 8:22 PM
3 points
0 comments9 min readLW link

How to Con­tribute to The­o­ret­i­cal Re­ward Learn­ing Research

Joar SkalseFeb 28, 2025, 7:27 PM
16 points
0 comments21 min readLW link

Other Papers About the The­ory of Re­ward Learning

Joar SkalseFeb 28, 2025, 7:26 PM
16 points
0 comments5 min readLW link

Defin­ing and Char­ac­ter­is­ing Re­ward Hacking

Joar SkalseFeb 28, 2025, 7:25 PM
15 points
0 comments4 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learn­ing—Part II

Joar SkalseFeb 28, 2025, 7:24 PM
9 points
0 comments7 min readLW link

STARC: A Gen­eral Frame­work For Quan­tify­ing Differ­ences Between Re­ward Functions

Joar SkalseFeb 28, 2025, 7:24 PM
11 points
0 comments8 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learning

Joar SkalseFeb 28, 2025, 7:24 PM
19 points
0 comments11 min readLW link

Par­tial Iden­ti­fi­a­bil­ity in Re­ward Learning

Joar SkalseFeb 28, 2025, 7:23 PM
16 points
0 comments12 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar SkalseFeb 28, 2025, 7:20 PM
26 points
4 comments14 min readLW link

An Open Let­ter To EA and AI Safety On De­cel­er­at­ing AI Development

kenneth_diaoFeb 28, 2025, 5:21 PM
8 points
0 comments14 min readLW link
(graspingatwaves.substack.com)

Dance Week­end Pay II

jefftkFeb 28, 2025, 3:10 PM
11 points
0 comments1 min readLW link
(www.jefftk.com)

Ex­is­ten­tial­ists and Trolleys

David GrossFeb 28, 2025, 2:01 PM
5 points
3 comments7 min readLW link

On Emer­gent Misalignment

ZviFeb 28, 2025, 1:10 PM
88 points
5 comments22 min readLW link
(thezvi.wordpress.com)

Do safety-rele­vant LLM steer­ing vec­tors op­ti­mized on a sin­gle ex­am­ple gen­er­al­ize?

Jacob DunefskyFeb 28, 2025, 12:01 PM
20 points
1 comment14 min readLW link
(arxiv.org)

Tether­ware #2: What ev­ery hu­man should know about our most likely AI future

Jáchym FibírFeb 28, 2025, 11:12 AM
3 points
0 comments11 min readLW link
(tetherware.substack.com)

Notes on Su­per­wis­dom & Mo­ral RSI

welfvhFeb 28, 2025, 10:34 AM
1 point
4 comments1 min readLW link

Cy­cles (a short story by Claude 3.7 and me)

Knight LeeFeb 28, 2025, 7:04 AM
9 points
0 comments5 min readLW link

Jan­uary-Fe­bru­ary 2025 Progress in Guaran­teed Safe AI

QuinnFeb 28, 2025, 3:10 AM
15 points
1 comment8 min readLW link
(gsai.substack.com)

Ex­plor­ing un­faith­ful/​de­cep­tive CoT in rea­son­ing models

Lucy WingardFeb 28, 2025, 2:54 AM
4 points
0 comments6 min readLW link

Weird­ness Points

lsusrFeb 28, 2025, 2:23 AM
62 points
19 comments3 min readLW link

Do clients need years of ther­apy, or can one con­ver­sa­tion re­solve the is­sue?

ChipmonkFeb 28, 2025, 12:06 AM
9 points
10 comments6 min readLW link
(chrislakin.blog)

[New Jersey] HPMOR 10 Year An­niver­sary Party 🎉

🟠UnlimitedOranges🟠Feb 27, 2025, 10:30 PM
4 points
0 comments1 min readLW link

OpenAI re­leases GPT-4.5

Seth HerdFeb 27, 2025, 9:40 PM
34 points
12 comments3 min readLW link
(openai.com)

The Elic­i­ta­tion Game: Eval­u­at­ing ca­pa­bil­ity elic­i­ta­tion techniques

Feb 27, 2025, 8:33 PM
10 points
0 comments2 min readLW link

For the Sake of Plea­sure Alone

Greenless MirrorFeb 27, 2025, 8:07 PM
3 points
14 comments12 min readLW link

Keep­ing AI Subor­di­nate to Hu­man Thought: A Pro­posal for Public AI Conversations

syhFeb 27, 2025, 8:00 PM
−1 points
0 comments1 min readLW link
(medium.com)

How to Corner Liars: A Mi­asma-Clear­ing Protocol

ymeskhoutFeb 27, 2025, 5:18 PM
62 points
23 comments7 min readLW link
(www.ymeskhout.com)

Eco­nomic Topol­ogy, ASI, and the Sepa­ra­tion Equilibrium

mkualquieraFeb 27, 2025, 4:36 PM
2 points
11 comments6 min readLW link

The Illu­sion of Iter­a­tive Im­prove­ment: Why AI (and Hu­mans) Fail to Track Their Own Epistemic Drift

Andy E WilliamsFeb 27, 2025, 4:26 PM
1 point
3 comments4 min readLW link

AI #105: Hey There Alexa

ZviFeb 27, 2025, 2:30 PM
31 points
3 comments40 min readLW link
(thezvi.wordpress.com)

Space-Far­ing Civ­i­liza­tion den­sity es­ti­mates and mod­els—Review

Maxime RichéFeb 27, 2025, 11:44 AM
20 points
0 comments12 min readLW link

Mar­ket Cap­i­tal­iza­tion is Se­man­ti­cally Invalid

Zero ContradictionsFeb 27, 2025, 11:27 AM
3 points
14 comments3 min readLW link
(thewaywardaxolotl.blogspot.com)

Propos­ing Hu­man Sur­vival Strat­egy based on the NAIA Vi­sion: Toward the Co-evolu­tion of Di­verse Intelligences

Hiroshi YamakawaFeb 27, 2025, 5:18 AM
−2 points
0 comments11 min readLW link

Short & long term trade­offs of strate­gic vot­ing

kalebFeb 27, 2025, 4:25 AM
2 points
0 comments8 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hiveFeb 27, 2025, 2:34 AM
9 points
1 comment15 min readLW link
(hiveism.substack.com)

Kingfisher Tour Fe­bru­ary 2025

jefftkFeb 27, 2025, 2:20 AM
9 points
0 comments4 min readLW link
(www.jefftk.com)

You should use Con­sumer Reports

KvmanThinkingFeb 27, 2025, 1:52 AM
7 points
5 comments1 min readLW link

Univer­sal AI Max­i­mizes Vari­a­tional Em­pow­er­ment: New In­sights into AGI Safety

Yusuke HayashiFeb 27, 2025, 12:46 AM
7 points
0 comments4 min readLW link