Tam­perSec is hiring for 3 Key Roles!

Jonathan_H28 Feb 2025 23:10 UTC
15 points
0 comments4 min readLW link

Do we want al­ign­ment fak­ing?

Florian_Dietz28 Feb 2025 21:50 UTC
7 points
4 comments1 min readLW link

Few con­cepts mix­ing dark fan­tasy and sci­ence fiction

Marek Zegarek28 Feb 2025 21:03 UTC
0 points
0 comments3 min readLW link

La­tent Space Col­lapse? Un­der­stand­ing the Effects of Nar­row Fine-Tun­ing on LLMs

tenseisoham28 Feb 2025 20:22 UTC
3 points
0 comments9 min readLW link

How to Con­tribute to The­o­ret­i­cal Re­ward Learn­ing Research

Joar Skalse28 Feb 2025 19:27 UTC
16 points
0 comments21 min readLW link

Other Papers About the The­ory of Re­ward Learning

Joar Skalse28 Feb 2025 19:26 UTC
16 points
0 comments5 min readLW link

Defin­ing and Char­ac­ter­is­ing Re­ward Hacking

Joar Skalse28 Feb 2025 19:25 UTC
15 points
0 comments4 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learn­ing—Part II

Joar Skalse28 Feb 2025 19:24 UTC
9 points
0 comments7 min readLW link

STARC: A Gen­eral Frame­work For Quan­tify­ing Differ­ences Between Re­ward Functions

Joar Skalse28 Feb 2025 19:24 UTC
11 points
0 comments8 min readLW link

Misspeci­fi­ca­tion in In­verse Re­in­force­ment Learning

Joar Skalse28 Feb 2025 19:24 UTC
19 points
0 comments11 min readLW link

Par­tial Iden­ti­fi­a­bil­ity in Re­ward Learning

Joar Skalse28 Feb 2025 19:23 UTC
16 points
0 comments12 min readLW link

The The­o­ret­i­cal Re­ward Learn­ing Re­search Agenda: In­tro­duc­tion and Motivation

Joar Skalse28 Feb 2025 19:20 UTC
29 points
4 comments14 min readLW link

An Open Let­ter To EA and AI Safety On De­cel­er­at­ing AI Development

kenneth_diao28 Feb 2025 17:21 UTC
8 points
0 comments14 min readLW link
(graspingatwaves.substack.com)

Dance Week­end Pay II

jefftk28 Feb 2025 15:10 UTC
11 points
0 comments1 min readLW link
(www.jefftk.com)

Ex­is­ten­tial­ists and Trolleys

David Gross28 Feb 2025 14:01 UTC
5 points
3 comments7 min readLW link

On Emer­gent Misalignment

Zvi28 Feb 2025 13:10 UTC
88 points
5 comments22 min readLW link
(thezvi.wordpress.com)

Do safety-rele­vant LLM steer­ing vec­tors op­ti­mized on a sin­gle ex­am­ple gen­er­al­ize?

Jacob Dunefsky28 Feb 2025 12:01 UTC
21 points
1 comment14 min readLW link
(arxiv.org)

Tether­ware #2: What ev­ery hu­man should know about our most likely AI future

Jáchym Fibír28 Feb 2025 11:12 UTC
3 points
0 comments11 min readLW link
(tetherware.substack.com)

Notes on Su­per­wis­dom & Mo­ral RSI

welfvh28 Feb 2025 10:34 UTC
1 point
4 comments1 min readLW link

Cy­cles (a short story by Claude 3.7 and me)

Knight Lee28 Feb 2025 7:04 UTC
9 points
0 comments5 min readLW link

Jan­uary-Fe­bru­ary 2025 Progress in Guaran­teed Safe AI

Quinn28 Feb 2025 3:10 UTC
15 points
1 comment8 min readLW link
(gsai.substack.com)

Ex­plor­ing un­faith­ful/​de­cep­tive CoT in rea­son­ing models

Lucy Wingard28 Feb 2025 2:54 UTC
4 points
0 comments6 min readLW link

Weird­ness Points

lsusr28 Feb 2025 2:23 UTC
64 points
19 comments3 min readLW link

OpenAI re­leases GPT-4.5

Seth Herd27 Feb 2025 21:40 UTC
34 points
12 comments3 min readLW link
(openai.com)

The Elic­i­ta­tion Game: Eval­u­at­ing ca­pa­bil­ity elic­i­ta­tion techniques

27 Feb 2025 20:33 UTC
10 points
1 comment2 min readLW link

For the Sake of Plea­sure Alone

Greenless Mirror27 Feb 2025 20:07 UTC
−1 points
17 comments12 min readLW link

Keep­ing AI Subor­di­nate to Hu­man Thought: A Pro­posal for Public AI Conversations

syh27 Feb 2025 20:00 UTC
−1 points
0 comments1 min readLW link
(medium.com)

How to Corner Liars: A Mi­asma-Clear­ing Protocol

ymeskhout27 Feb 2025 17:18 UTC
67 points
23 comments7 min readLW link
(www.ymeskhout.com)

Eco­nomic Topol­ogy, ASI, and the Sepa­ra­tion Equilibrium

mkualquiera27 Feb 2025 16:36 UTC
2 points
11 comments6 min readLW link

The Illu­sion of Iter­a­tive Im­prove­ment: Why AI (and Hu­mans) Fail to Track Their Own Epistemic Drift

Andy E Williams27 Feb 2025 16:26 UTC
1 point
3 comments4 min readLW link

AI #105: Hey There Alexa

Zvi27 Feb 2025 14:30 UTC
31 points
3 comments40 min readLW link
(thezvi.wordpress.com)

Space-Far­ing Civ­i­liza­tion den­sity es­ti­mates and mod­els—Review

Maxime Riché27 Feb 2025 11:44 UTC
20 points
0 comments12 min readLW link

Mar­ket Cap­i­tal­iza­tion is Se­man­ti­cally Invalid

Zero Contradictions27 Feb 2025 11:27 UTC
3 points
14 comments3 min readLW link
(thewaywardaxolotl.blogspot.com)

Propos­ing Hu­man Sur­vival Strat­egy based on the NAIA Vi­sion: Toward the Co-evolu­tion of Di­verse Intelligences

Hiroshi Yamakawa27 Feb 2025 5:18 UTC
−2 points
0 comments11 min readLW link

Short & long term trade­offs of strate­gic vot­ing

kaleb27 Feb 2025 4:25 UTC
2 points
0 comments8 min readLW link

Re­cur­sive al­ign­ment with the prin­ci­ple of alignment

hive27 Feb 2025 2:34 UTC
12 points
4 comments15 min readLW link
(hiveism.substack.com)

Kingfisher Tour Fe­bru­ary 2025

jefftk27 Feb 2025 2:20 UTC
9 points
0 comments4 min readLW link
(www.jefftk.com)

You should use Con­sumer Reports

KvmanThinking27 Feb 2025 1:52 UTC
7 points
5 comments1 min readLW link

Univer­sal AI Max­i­mizes Vari­a­tional Em­pow­er­ment: New In­sights into AGI Safety

Yusuke Hayashi27 Feb 2025 0:46 UTC
14 points
1 comment4 min readLW link

Why Can’t We Hy­poth­e­size After the Fact?

David Udell26 Feb 2025 22:41 UTC
40 points
3 comments2 min readLW link

“AI Rapidly Gets Smarter, And Makes Some of Us Dum­ber,” from Sabine Hossenfelder

Evan_Gaensbauer26 Feb 2025 22:33 UTC
4 points
9 comments2 min readLW link
(youtu.be)

METR: AI mod­els can be dan­ger­ous be­fore pub­lic deployment

UnofficialLinkpostBot26 Feb 2025 20:19 UTC
16 points
0 comments3 min readLW link
(metr.org)

Rep­re­sen­ta­tion Eng­ineer­ing has Its Prob­lems, but None Seem Unsolvable

Lukasz G Bartoszcze26 Feb 2025 19:53 UTC
15 points
1 comment3 min readLW link

Thoughts that prompt good fore­casts: A survey

Daniel_Friedrich26 Feb 2025 18:36 UTC
1 point
0 comments1 min readLW link

The non-tribal tribes

PatrickDFarley26 Feb 2025 17:22 UTC
24 points
4 comments16 min readLW link

SAE Train­ing Dataset In­fluence in Fea­ture Match­ing and a Hy­poth­e­sis on Po­si­tion Features

Seonglae Cho26 Feb 2025 17:05 UTC
4 points
3 comments17 min readLW link

Fuzzing LLMs some­times makes them re­veal their secrets

Fabien Roger26 Feb 2025 16:48 UTC
65 points
13 comments9 min readLW link

You can just wear a suit

lsusr26 Feb 2025 14:57 UTC
139 points
59 comments2 min readLW link

Matthew Ygle­sias—Mis­in­for­ma­tion Mostly Con­fuses Your Own Side

Siebe26 Feb 2025 14:55 UTC
10 points
1 comment1 min readLW link
(www.slowboring.com)

Op­ti­miz­ing Feed­back to Learn Faster

Towards_Keeperhood26 Feb 2025 14:24 UTC
12 points
0 comments2 min readLW link