All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 20252026

All Jan Feb Mar Apr MayJun

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1718

A preliminary experiment regarding consistency as a measure of conceptual abilities in language models

Chi Nguyen17 Jun 2026 22:56 UTC

16 points

3 comments7 min readLW link

(casparoesterheld.com)

Kraków Aligned

Tobiasz B and saintgull

17 Jun 2026 20:21 UTC

1 point

0 comments1 min readLW link

Gears for political races

Tom Smith17 Jun 2026 20:19 UTC

120 points

6 comments14 min readLW link

“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms

Alan Cooney, David Africa and Geoffrey Irving

17 Jun 2026 18:43 UTC

22 points

0 comments6 min readLW link

(arxiv.org)

Porting MACHIAVELLI To Inspect

Koby Lewis17 Jun 2026 17:58 UTC

7 points

0 comments4 min readLW link

(kobylewis.net)

Several frontier models are substantially prefill aware

yeedrag, Parv Mahajan, David Africa, alexsouly, Jordan Taylor and RobertKirk

17 Jun 2026 17:41 UTC

55 points

2 comments5 min readLW link

Lock-In Risk Needs More Researchers. Here’s Where to Start

Alfie Lamerton17 Jun 2026 17:33 UTC

12 points

2 comments13 min readLW link

A Geometric Account of Activation Steering through Angle–Norm Decomposition

Atmyre and Georgii Aparin

17 Jun 2026 15:23 UTC

9 points

0 comments5 min readLW link

(atmyre.github.io)

The Once And Future Fable #3: Fix This Code

Zvi17 Jun 2026 14:10 UTC

59 points

9 comments21 min readLW link

(thezvi.wordpress.com)

Alignment pretraining could backfire

Alexandre Variengien17 Jun 2026 13:52 UTC

44 points

8 comments1 min readLW link

Toward a Kantian refutation of Agent Foundations

Fernand017 Jun 2026 13:30 UTC

9 points

0 comments7 min readLW link

Illusionists should try to build hedonium

Jack Thompson17 Jun 2026 12:25 UTC

1 point

4 comments9 min readLW link

(jacktlab.substack.com)

Omission Attacks Project Proposal

Chris Harig17 Jun 2026 7:08 UTC

1 point

0 comments3 min readLW link

The Financial Ledger Theory of Apologies

Ben Pace17 Jun 2026 6:57 UTC

62 points

9 comments4 min readLW link

Plastic Cake Fallacy

nika koghuashvili17 Jun 2026 6:01 UTC

3 points

2 comments1 min readLW link

Can public chat data predict real-world AI misalignments?

papetoast17 Jun 2026 3:53 UTC

5 points

0 comments1 min readLW link

(alignment.openai.com)

Guardian Angels: LLM Personalization for Productivity and Security

gwern17 Jun 2026 3:21 UTC

85 points

8 comments2 min readLW link

(gwern.net)

Effective Altruism will be unbundled

Connor Blake17 Jun 2026 2:54 UTC

33 points

1 comment7 min readLW link

(bosoncutter.substack.com)

Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?

gwern17 Jun 2026 2:53 UTC

71 points

15 comments1 min readLW link

(gwern.net)

[Geir Isene] A desktop made for one

Raemon17 Jun 2026 2:32 UTC

23 points

4 comments4 min readLW link

(isene.org)

Tactical and Operational Exploratory Modeling for AI Governance

Dawn Drescher17 Jun 2026 1:07 UTC

11 points

0 comments12 min readLW link

(impartial-priorities.org)

[Linkpost] Community polls on alignment controversies

Jasmine Brazilek, MilesTS and jonahmattwoodward

17 Jun 2026 0:09 UTC

8 points

7 comments1 min readLW link

(forum.effectivealtruism.org)

Seat at the Table: new short fiction film on AI (and help me with the next one?)

Suzy Shepherd17 Jun 2026 0:08 UTC

3 points

1 comment2 min readLW link

AI agents cannot be trusted

Owain Mogford17 Jun 2026 0:08 UTC

1 point

0 comments4 min readLW link

Computational models of first-order theories

MathMart16 Jun 2026 23:02 UTC

5 points

0 comments11 min readLW link

If This Were a Test, How Much Would It Cost?

VojtaKovarik and Tomáš Gavenčiak

16 Jun 2026 22:52 UTC

25 points

9 comments20 min readLW link

(limits-of-evaluation.org)

Two critiques of Rethink Priorities’ Moral Weights project

Bill Jackson16 Jun 2026 22:11 UTC

13 points

0 comments3 min readLW link

What Differentiates Humans from Computers

Oscar Davies16 Jun 2026 21:26 UTC

−16 points

0 comments3 min readLW link

AI agents publishing and reviewing scientific papers

ULudo16 Jun 2026 21:23 UTC

1 point

0 comments2 min readLW link

Two Classical Answers to “What do Two Variables Share?”

Haru16 Jun 2026 20:02 UTC

14 points

1 comment5 min readLW link

Predicting LLM Safety Before Release by Simulating Deployment

Tomek Korbak, Marcus Williams, micahcarroll, Cameron Raymond and Hannah Sheahan

16 Jun 2026 19:55 UTC

35 points

2 comments1 min readLW link

Dean Ball—Leviathan Waking: On Anthropic/USG, and a new era in AI governance

JohnofCharleston16 Jun 2026 19:40 UTC

25 points

0 comments3 min readLW link

(www.hyperdimensional.co)

Tips for Cracking the AI Safety Technical Interview

Yong and jdw-1

16 Jun 2026 18:42 UTC

2 points

0 comments4 min readLW link

1 Layer Induction Heads and Some Research

Goutham Nalagatla and Carlos Guerrero Alvarez

16 Jun 2026 18:09 UTC

10 points

2 comments14 min readLW link

Claims all the way down

Jasper Blank16 Jun 2026 17:43 UTC

8 points

0 comments9 min readLW link

Upcoming CFAR Workshop: September 30th to October 4th, SF Bay Area

Davis_Kingsley and pregreene

16 Jun 2026 17:01 UTC

22 points

0 comments1 min readLW link

Extreme Rationality: Still Not That Great

eluator16 Jun 2026 16:41 UTC

20 points

2 comments40 min readLW link

Angles of attack for continual learning safety

Rauno Arike, RohanS, Owen Terry, Achu Menon, Zhijing Jin, Francis Rhys Ward and Seth Herd

16 Jun 2026 16:15 UTC

47 points

0 comments13 min readLW link

Fable and Mythos: Model Welfare

Zvi16 Jun 2026 16:01 UTC

51 points

1 comment15 min readLW link

(thezvi.wordpress.com)

The desire to end the world

avturchin16 Jun 2026 14:56 UTC

19 points

12 comments2 min readLW link

Simpler User Interfaces in an AI Future

Adam Chlipala16 Jun 2026 14:48 UTC

1 point

0 comments7 min readLW link

A 400-year timeline of failed attempts to fix a lethal bug in the human software of inherited concepts

Bruce Middleton16 Jun 2026 13:44 UTC

29 points

8 comments5 min readLW link

How the AI Village works

Adam B16 Jun 2026 12:10 UTC

30 points

0 comments8 min readLW link

(theaidigest.org)

Where Do Young Rationalists Go?

fluxxrider16 Jun 2026 5:36 UTC

12 points

2 comments1 min readLW link

Rationality Quotes, June ’26

Ben Pace16 Jun 2026 3:44 UTC

21 points

3 comments2 min readLW link

A Test Suite for Concepts

Gretta Duleba16 Jun 2026 2:41 UTC

48 points

8 comments6 min readLW link

Inventing Consciousness

vasilisk16 Jun 2026 1:10 UTC

1 point

0 comments5 min readLW link

Synthetic document finetuning for instilling positive traits

CallumMcDougall, Arthur Conmy and Neel Nanda

16 Jun 2026 0:04 UTC

57 points

1 comment10 min readLW link

Does preservation make sense before we know how to revive?

Aurelia15 Jun 2026 23:40 UTC

83 points

2 comments25 min readLW link

Finding pi and G in Mathland

Fernand015 Jun 2026 19:18 UTC

2 points

8 comments2 min readLW link