Mea­sur­ing Beliefs of Lan­guage Models Dur­ing Chain-of-Thought Reasoning

18 Apr 2025 22:56 UTC
10 points
0 comments13 min readLW link

LLM-based Fact Check­ing for Pop­u­lar Posts?

azergante18 Apr 2025 21:26 UTC
1 point
2 comments62 min readLW link

o3 Will Use Its Tools For You

Zvi18 Apr 2025 21:20 UTC
46 points
3 comments45 min readLW link
(thezvi.wordpress.com)

AI Con­trol Meth­ods Liter­a­ture Review

Ram Potham18 Apr 2025 21:15 UTC
10 points
1 comment9 min readLW link

Con­se­quen­tial­ists should have a com­pre­hen­sive set of de­on­tolog­i­cal be­liefs they ad­here to

Jay9518 Apr 2025 20:50 UTC
3 points
2 comments1 min readLW link

What Makes an AI Startup “Net Pos­i­tive” for Safety?

jacquesthibs18 Apr 2025 20:33 UTC
82 points
23 comments2 min readLW link

Align­ment Does Not Need to Be Opaque! An In­tro­duc­tion to Fea­ture Steer­ing with Re­in­force­ment Learning

Jeremias Ferrao18 Apr 2025 19:34 UTC
10 points
0 comments10 min readLW link

Eval­u­at­ing Col­lab­o­ra­tive AI Perfor­mance Sub­ject to Sab­o­tage

Matthew Khoriaty18 Apr 2025 19:33 UTC
2 points
0 comments19 min readLW link

In­side OpenAI’s Con­tro­ver­sial Plan to Aban­don its Non­profit Roots

garrison18 Apr 2025 18:46 UTC
21 points
0 comments11 min readLW link
(garrisonlovely.substack.com)

Could LLMs Learn to De­tect Bias Au­tonomously, Like Tesla’s Self-Driv­ing Cars?

Omnipheasant18 Apr 2025 18:45 UTC
0 points
0 comments3 min readLW link

Scaf­fold­ing Skills

Screwtape18 Apr 2025 17:39 UTC
35 points
9 comments4 min readLW link

The Case for White Box Control

J Rosser18 Apr 2025 16:10 UTC
5 points
1 comment5 min readLW link

[Rockville] Ra­tion­al­ist Shabbat

maia18 Apr 2025 15:38 UTC
8 points
0 comments1 min readLW link

Han­dling schemers if shut­down is not an option

Buck18 Apr 2025 14:39 UTC
39 points
2 comments14 min readLW link

Bri­tish and Amer­i­can Connotations

jefftk18 Apr 2025 13:00 UTC
14 points
4 comments1 min readLW link
(www.jefftk.com)

Towards Un­der­stand­ing the Rep­re­sen­ta­tion of Belief State Geom­e­try in Transformers

Karthik Viswanathan18 Apr 2025 12:39 UTC
3 points
0 comments12 min readLW link

Train­ing AGI in Se­cret would be Un­safe and Unethical

Daniel Kokotajlo18 Apr 2025 12:27 UTC
139 points
15 comments6 min readLW link

Karma Tests in Log­i­cal Coun­ter­fac­tual Si­mu­la­tions mo­ti­vates strong agents to pro­tect weak agents

Knight Lee18 Apr 2025 11:11 UTC
9 points
8 comments3 min readLW link

What If Galax­ies Are Alive and Atoms Have Minds? A Thought Ex­per­i­ment on Life Across Scales

Saif Khan18 Apr 2025 10:01 UTC
−2 points
5 comments3 min readLW link

Three Months In, Eval­u­at­ing Three Ra­tion­al­ist Cases for Trump

Arjun Panickssery18 Apr 2025 8:27 UTC
117 points
33 comments4 min readLW link

[Question] Com­pre­hen­sive up-to-date re­sources on the Chi­nese Com­mu­nist Party’s AI strat­egy, etc?

Mateusz Bagiński18 Apr 2025 4:58 UTC
14 points
6 comments1 min readLW link

Con­di­tional Fore­cast­ing as Model Parameterization

Molly18 Apr 2025 2:35 UTC
15 points
0 comments7 min readLW link
(cuttyshark.substack.com)

One Night in Delphi

Eggs18 Apr 2025 2:17 UTC
4 points
2 comments3 min readLW link

Sat­i­fac­cion and Mo­ti­va­tion Map­ping through In­for­ma­tion Theory

P. João18 Apr 2025 0:53 UTC
7 points
0 comments26 min readLW link

The Rus­sell Con­ju­ga­tion Illuminator

TimmyM17 Apr 2025 19:33 UTC
51 points
14 comments1 min readLW link
(russellconjugations.com)

An­nounc­ing Progress Con­fer­ence 2025

jasoncrawford17 Apr 2025 17:12 UTC
12 points
0 comments1 min readLW link
(newsletter.rootsofprogress.org)

The Mir­ror Paradox

Jeremy Kraybill17 Apr 2025 16:23 UTC
−6 points
0 comments1 min readLW link

Me­mory De­cod­ing Jour­nal Club

Devin Ward17 Apr 2025 16:19 UTC
1 point
0 comments1 min readLW link

Host Keys and SSHing to EC2

jefftk17 Apr 2025 15:10 UTC
10 points
6 comments1 min readLW link
(www.jefftk.com)

AI #112: Re­lease the Everything

Zvi17 Apr 2025 15:10 UTC
41 points
6 comments40 min readLW link
(thezvi.wordpress.com)

On AI personhood

p.b.17 Apr 2025 12:31 UTC
4 points
7 comments1 min readLW link

8 PRIME IDENTITIES - An analisis

P. João17 Apr 2025 11:36 UTC
−5 points
0 comments2 min readLW link

8 LATENT VALUES - A sim­plified con­struc­tion from MaxEnt In­for­ma­tional Effi­ciency in 4 questions

P. João17 Apr 2025 11:04 UTC
3 points
5 comments3 min readLW link

Au­tomat­ing Mechanis­tic In­ter­pretabil­ity via Pro­gram Synthesis

Edy Nastase17 Apr 2025 10:58 UTC
1 point
1 comment1 min readLW link

Un­der­stand­ing and over­com­ing AGI apathy

Dhruv Sumathi17 Apr 2025 1:04 UTC
25 points
1 comment13 min readLW link
(dhruvsumathi.substack.com)

ALLFED emer­gency ap­peal: Help us raise $800,000 to avoid cut­ting half of programs

denkenberger16 Apr 2025 21:47 UTC
49 points
9 comments3 min readLW link

Pro­dromes and Bio­mark­ers in Chronic Disease

sarahconstantin16 Apr 2025 21:30 UTC
23 points
2 comments3 min readLW link
(sarahconstantin.substack.com)

The Prac­ti­cal Im­per­a­tive for AI Con­trol Re­search

Archana Vaidheeswaran16 Apr 2025 20:27 UTC
1 point
0 comments4 min readLW link

METR’s pre­limi­nary eval­u­a­tion of o3 and o4-mini

Christopher King16 Apr 2025 20:23 UTC
14 points
7 comments1 min readLW link
(metr.github.io)

Mass Ex­po­sure Paradox

max-sixty16 Apr 2025 20:18 UTC
6 points
2 comments2 min readLW link

GPT-4.5 is Cog­ni­tive Em­pa­thy, Son­net 3.5 is Affec­tive Empathy

Jack16 Apr 2025 19:12 UTC
15 points
2 comments4 min readLW link

GPT-4.1 Is a Mini Upgrade

Zvi16 Apr 2025 19:00 UTC
31 points
6 comments8 min readLW link
(thezvi.wordpress.com)

Do­ing Pri­ori­ti­za­tion Better

arvomm16 Apr 2025 18:46 UTC
3 points
1 comment19 min readLW link
(forum.effectivealtruism.org)

Kamelo: A Rule-Based Con­structed Lan­guage for Univer­sal, Log­i­cal Communication

Saif Khan16 Apr 2025 18:44 UTC
12 points
7 comments2 min readLW link

Un­der­stand­ing Trust: Overview Presentations

abramdemski16 Apr 2025 18:08 UTC
22 points
0 comments1 min readLW link

Un­der­stand­ing Trust—Overview Presentations

abramdemski16 Apr 2025 18:05 UTC
13 points
0 comments1 min readLW link

Telescoping

za3k16 Apr 2025 17:05 UTC
13 points
1 comment1 min readLW link
(blog.za3k.com)

Fi­nance and AI Timelines

DAL16 Apr 2025 16:55 UTC
5 points
2 comments3 min readLW link

FROM IA CODE TO HUMAN VALUES – A con­struc­tion from MaxEnt In­for­ma­tional Effi­ciency in 4 questions

P. João16 Apr 2025 16:53 UTC
3 points
0 comments7 min readLW link

AI-en­abled coups: a small group could use AI to seize power

16 Apr 2025 16:51 UTC
132 points
23 comments7 min readLW link