AI Agent Bench­marks Are Broken

Sasha Cui8 Jul 2025 22:11 UTC
10 points
0 comments1 min readLW link
(ddkang.substack.com)

Why Do Some Lan­guage Models Fake Align­ment While Others Don’t?

8 Jul 2025 21:49 UTC
158 points
14 comments5 min readLW link
(arxiv.org)

A Medium Scenario

Chapin Lenthall-Cleary8 Jul 2025 20:09 UTC
18 points
12 comments20 min readLW link

An Opinionated Guide to Us­ing Anki Correctly

Luise8 Jul 2025 20:01 UTC
156 points
58 comments27 min readLW link

Lenses, Me­taphors, and Meaning

8 Jul 2025 19:46 UTC
7 points
0 comments4 min readLW link

Ap­ply­ing right-wing frames to AGI (geo)politics

Richard_Ngo8 Jul 2025 18:03 UTC
64 points
25 comments3 min readLW link
(x.com)

The Un­jour­nal’s “Pivotal Ques­tions” project

david reinstein8 Jul 2025 15:55 UTC
6 points
1 comment1 min readLW link
(forum.effectivealtruism.org)

Balsa Up­date: Spring­time in DC

Zvi8 Jul 2025 15:00 UTC
61 points
6 comments10 min readLW link
(thezvi.wordpress.com)

MIT Fu­tureTech are hiring a Post­doc­toral As­so­ci­ate to work on AI Perfor­mance and Safety

peterslattery8 Jul 2025 14:02 UTC
3 points
0 comments4 min readLW link

En­ergy-Based Trans­form­ers are Scal­able Learn­ers and Thinkers

Matrice Jacobine8 Jul 2025 13:44 UTC
7 points
5 comments1 min readLW link
(energy-based-transformers.github.io)

LLMs are Ca­pable of Misal­igned Be­hav­ior Un­der Ex­plicit Pro­hi­bi­tion and Surveillance

Igor Ivanov8 Jul 2025 11:50 UTC
28 points
8 comments7 min readLW link

The Connection

Alexandre Variengien8 Jul 2025 10:53 UTC
23 points
0 comments24 min readLW link
(alexandrevariengien.com)

Sub­ver­sion via Fo­cal Points: In­ves­ti­gat­ing Col­lu­sion in LLM Monitoring

Olli Järviniemi8 Jul 2025 10:15 UTC
14 points
2 comments1 min readLW link

NYT ar­ti­cle about the Zizi­ans in­clud­ing quotes from Eliezer, Anna, Ozy, Jes­sica, Zvi

Matrice Jacobine8 Jul 2025 1:42 UTC
9 points
3 comments1 min readLW link
(www.nytimes.com)

A The­ory of Struc­tural Independence

Matthias G. Mayer7 Jul 2025 22:54 UTC
70 points
2 comments1 min readLW link
(arxiv.org)

Nav­i­gat­ing Attention

jimmy7 Jul 2025 21:43 UTC
10 points
2 comments8 min readLW link

The Weighted Per­plex­ity Bench­mark: To­k­enizer-Nor­mal­ized Eval­u­a­tion for Lan­guage Model Comparison

7 Jul 2025 21:43 UTC
21 points
0 comments7 min readLW link
(www.morpheus.systems)

Planet X, Lord Kelvin, and the use of Struc­ture as Fuel

David Björling7 Jul 2025 21:23 UTC
11 points
19 comments3 min readLW link

Art, ra­tio­nal­ity, and the “feel­ing” for rightness

Karthik Bala7 Jul 2025 20:09 UTC
1 point
2 comments3 min readLW link

Public anti-AI sen­ti­ment can be use­ful: three mechanisms

andyqhan7 Jul 2025 19:05 UTC
8 points
4 comments5 min readLW link

Liter­a­ture Re­view: Risks of MDMA

Elizabeth7 Jul 2025 19:01 UTC
67 points
8 comments4 min readLW link
(acesounderglass.com)

AI Safety at the Fron­tier: Paper High­lights, June ’25

gasteigerjo7 Jul 2025 18:17 UTC
4 points
0 comments7 min readLW link
(open.substack.com)

You Can’t Ob­jec­tively Com­pare Seven Bees to One Human

J Bostock7 Jul 2025 18:11 UTC
58 points
26 comments3 min readLW link
(jbostock.substack.com)

Eco­nomics of Claude 3 Opus Inference

7 Jul 2025 15:53 UTC
34 points
0 comments11 min readLW link

On the func­tional self of LLMs

eggsyntax7 Jul 2025 15:39 UTC
95 points
35 comments8 min readLW link

Notes on Righ­teous­ness and Megalopsychia

David Gross7 Jul 2025 15:18 UTC
12 points
0 comments31 min readLW link

On Alpha School

Zvi7 Jul 2025 15:10 UTC
37 points
2 comments14 min readLW link
(thezvi.wordpress.com)

Sleep­ing Beauty and the For­ever Muffin

OneManyNone7 Jul 2025 12:05 UTC
1 point
13 comments16 min readLW link

Re­source guide: Unaware­ness, in­de­ter­mi­nacy, and cluelessness

Anthony DiGiovanni7 Jul 2025 9:54 UTC
20 points
0 comments7 min readLW link

On mu­sic and language

Joey Marcellino7 Jul 2025 9:09 UTC
18 points
6 comments8 min readLW link

Man­i­festo for do­ing good sci­ence in AI

invertedpassion7 Jul 2025 7:33 UTC
2 points
1 comment5 min readLW link

The Base Model Lens

Adam Newgas7 Jul 2025 0:12 UTC
7 points
0 comments3 min readLW link

AXRP Epi­sode 45 - Sa­muel Albanie on Deep­Mind’s AGI Safety Approach

DanielFilan6 Jul 2025 23:00 UTC
31 points
0 comments40 min readLW link

[DELETED]

Cody @ Keeper6 Jul 2025 19:26 UTC
1 point
0 comments2 min readLW link

A sim­ple ex­pla­na­tion of in­com­plete mod­els

Cole Wyeth6 Jul 2025 19:09 UTC
19 points
1 comment5 min readLW link

Neu­ro­scien­tist sur­vey says P(brain preser­va­tion works) is substantial

Mati_Roy6 Jul 2025 18:03 UTC
11 points
1 comment1 min readLW link

Ra­tional An­i­ma­tions’ video about scal­able over­sight and sandwiching

Writer6 Jul 2025 14:00 UTC
18 points
0 comments9 min readLW link
(youtu.be)

New Paper: It is time to move on from MCQs for LLM Evaluations

shash426 Jul 2025 11:48 UTC
9 points
0 comments2 min readLW link

[Question] How did you first un­der­stand cog­ni­tive bi­ases? Look­ing for com­mu­nity experiences

Vladimir Loginov6 Jul 2025 10:48 UTC
8 points
3 comments1 min readLW link

The Com­pul­sion For (Pseudo-)Mechanisms

adamShimi6 Jul 2025 10:46 UTC
31 points
8 comments12 min readLW link
(formethods.substack.com)

No­body is Do­ing AI Bench­mark­ing Right

Chapin Lenthall-Cleary6 Jul 2025 7:05 UTC
20 points
12 comments9 min readLW link

From Un­ruly Stacks to Or­ga­nized Shelves: Toy Model Val­i­da­tion of Struc­tured Pri­ors in Sparse Autoencoders

Yuxiao6 Jul 2025 7:03 UTC
8 points
0 comments5 min readLW link

When the Smarter AI Lies Bet­ter: Can De­bate-Based Over­sight Catch De­cep­tive Code

oskarkraak6 Jul 2025 1:21 UTC
4 points
0 comments5 min readLW link
(oskarkraak.com)

In­tel­li­gence Futures

TheOtherSteven6 Jul 2025 1:19 UTC
13 points
3 comments7 min readLW link
(syin.bearblog.dev)

Shut­down Re­sis­tance in Rea­son­ing Models

6 Jul 2025 0:01 UTC
138 points
14 comments9 min readLW link
(palisaderesearch.org)

The ul­ti­mate goal

Alvin Ånestrand5 Jul 2025 19:10 UTC
10 points
3 comments5 min readLW link
(forecastingaifutures.substack.com)

In­ter­view with Carl Feyn­man on Im­mi­nent AI Ex­is­ten­tial Risk

Liron5 Jul 2025 18:49 UTC
30 points
1 comment40 min readLW link

Small foun­da­tional puz­zle for causal the­o­ries of mechanis­tic interpretability

Frederik Hytting Jørgensen5 Jul 2025 17:46 UTC
6 points
6 comments2 min readLW link

Essen­tial LLM As­sumes We’re Con­scious—Out­side Rea­soner AGI Won’t

FlorianH5 Jul 2025 16:04 UTC
1 point
0 comments3 min readLW link
(nearlyfar.org)

Mask­ing on the Subway

jefftk5 Jul 2025 14:40 UTC
23 points
12 comments1 min readLW link
(www.jefftk.com)