All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 202420252026

All Jan FebMarApr May Jun Jul Aug Sep Oct Nov Dec

All12 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

TamperSec is hiring for 3 Key Roles!

Jonathan_H28 Feb 2025 23:10 UTC

15 points

0 comments4 min readLW link

Do we want alignment faking?

Florian_Dietz28 Feb 2025 21:50 UTC

7 points

4 comments1 min readLW link

Few concepts mixing dark fantasy and science fiction

Marek Zegarek28 Feb 2025 21:03 UTC

0 points

0 comments3 min readLW link

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs

tenseisoham28 Feb 2025 20:22 UTC

3 points

0 comments9 min readLW link

How to Contribute to Theoretical Reward Learning Research

Joar Skalse28 Feb 2025 19:27 UTC

16 points

0 comments21 min readLW link

Other Papers About the Theory of Reward Learning

Joar Skalse28 Feb 2025 19:26 UTC

16 points

0 comments5 min readLW link

Defining and Characterising Reward Hacking

Joar Skalse28 Feb 2025 19:25 UTC

15 points

0 comments4 min readLW link

Misspecification in Inverse Reinforcement Learning—Part II

Joar Skalse28 Feb 2025 19:24 UTC

9 points

0 comments7 min readLW link

STARC: A General Framework For Quantifying Differences Between Reward Functions

Joar Skalse28 Feb 2025 19:24 UTC

11 points

0 comments8 min readLW link

Misspecification in Inverse Reinforcement Learning

Joar Skalse28 Feb 2025 19:24 UTC

19 points

0 comments11 min readLW link

Partial Identifiability in Reward Learning

Joar Skalse28 Feb 2025 19:23 UTC

16 points

0 comments12 min readLW link

The Theoretical Reward Learning Research Agenda: Introduction and Motivation

Joar Skalse28 Feb 2025 19:20 UTC

29 points

4 comments14 min readLW link

An Open Letter To EA and AI Safety On Decelerating AI Development

kenneth_diao28 Feb 2025 17:21 UTC

8 points

0 comments14 min readLW link

(graspingatwaves.substack.com)

Dance Weekend Pay II

jefftk28 Feb 2025 15:10 UTC

11 points

0 comments1 min readLW link

(www.jefftk.com)

Existentialists and Trolleys

David Gross28 Feb 2025 14:01 UTC

5 points

3 comments7 min readLW link

On Emergent Misalignment

Zvi28 Feb 2025 13:10 UTC

88 points

5 comments22 min readLW link

(thezvi.wordpress.com)

Do safety-relevant LLM steering vectors optimized on a single example generalize?

Jacob Dunefsky28 Feb 2025 12:01 UTC

21 points

1 comment14 min readLW link

(arxiv.org)

Tetherware #2: What every human should know about our most likely AI future

Jáchym Fibír28 Feb 2025 11:12 UTC

3 points

0 comments11 min readLW link

(tetherware.substack.com)

Notes on Superwisdom & Moral RSI

welfvh28 Feb 2025 10:34 UTC

1 point

4 comments1 min readLW link

Cycles (a short story by Claude 3.7 and me)

Knight Lee28 Feb 2025 7:04 UTC

9 points

0 comments5 min readLW link

January-February 2025 Progress in Guaranteed Safe AI

Quinn28 Feb 2025 3:10 UTC

15 points

1 comment8 min readLW link

(gsai.substack.com)

Exploring unfaithful/deceptive CoT in reasoning models

Lucy Wingard28 Feb 2025 2:54 UTC

4 points

0 comments6 min readLW link

Weirdness Points

lsusr28 Feb 2025 2:23 UTC

64 points

19 comments3 min readLW link

OpenAI releases GPT-4.5

Seth Herd27 Feb 2025 21:40 UTC

34 points

12 comments3 min readLW link

(openai.com)

The Elicitation Game: Evaluating capability elicitation techniques

Teun van der Weij, Felix Hofstätter, JaydenTeoh, HenningB and Francis Rhys Ward

27 Feb 2025 20:33 UTC

10 points

1 comment2 min readLW link

For the Sake of Pleasure Alone

Greenless Mirror27 Feb 2025 20:07 UTC

−1 points

17 comments12 min readLW link

Keeping AI Subordinate to Human Thought: A Proposal for Public AI Conversations

syh27 Feb 2025 20:00 UTC

−1 points

0 comments1 min readLW link

(medium.com)

How to Corner Liars: A Miasma-Clearing Protocol

ymeskhout27 Feb 2025 17:18 UTC

67 points

23 comments7 min readLW link

(www.ymeskhout.com)

Economic Topology, ASI, and the Separation Equilibrium

mkualquiera27 Feb 2025 16:36 UTC

2 points

11 comments6 min readLW link

The Illusion of Iterative Improvement: Why AI (and Humans) Fail to Track Their Own Epistemic Drift

Andy E Williams27 Feb 2025 16:26 UTC

1 point

3 comments4 min readLW link

AI #105: Hey There Alexa

Zvi27 Feb 2025 14:30 UTC

31 points

3 comments40 min readLW link

(thezvi.wordpress.com)

Space-Faring Civilization density estimates and models—Review

Maxime Riché27 Feb 2025 11:44 UTC

20 points

0 comments12 min readLW link

Market Capitalization is Semantically Invalid

Zero Contradictions27 Feb 2025 11:27 UTC

3 points

14 comments3 min readLW link

(thewaywardaxolotl.blogspot.com)

Proposing Human Survival Strategy based on the NAIA Vision: Toward the Co-evolution of Diverse Intelligences

Hiroshi Yamakawa27 Feb 2025 5:18 UTC

−2 points

0 comments11 min readLW link

Short & long term tradeoffs of strategic voting

kaleb27 Feb 2025 4:25 UTC

2 points

0 comments8 min readLW link

Recursive alignment with the principle of alignment

hive27 Feb 2025 2:34 UTC

12 points

4 comments15 min readLW link

(hiveism.substack.com)

Kingfisher Tour February 2025

jefftk27 Feb 2025 2:20 UTC

9 points

0 comments4 min readLW link

(www.jefftk.com)

You should use Consumer Reports

KvmanThinking27 Feb 2025 1:52 UTC

7 points

5 comments1 min readLW link

Universal AI Maximizes Variational Empowerment: New Insights into AGI Safety

Yusuke Hayashi27 Feb 2025 0:46 UTC

14 points

1 comment4 min readLW link

Why Can’t We Hypothesize After the Fact?

David Udell26 Feb 2025 22:41 UTC

40 points

3 comments2 min readLW link

“AI Rapidly Gets Smarter, And Makes Some of Us Dumber,” from Sabine Hossenfelder

Evan_Gaensbauer26 Feb 2025 22:33 UTC

4 points

9 comments2 min readLW link

(youtu.be)

METR: AI models can be dangerous before public deployment

UnofficialLinkpostBot26 Feb 2025 20:19 UTC

16 points

0 comments3 min readLW link

(metr.org)

Representation Engineering has Its Problems, but None Seem Unsolvable

Lukasz G Bartoszcze26 Feb 2025 19:53 UTC

15 points

1 comment3 min readLW link

Thoughts that prompt good forecasts: A survey

Daniel_Friedrich26 Feb 2025 18:36 UTC

1 point

0 comments1 min readLW link

The non-tribal tribes

PatrickDFarley26 Feb 2025 17:22 UTC

24 points

4 comments16 min readLW link

SAE Training Dataset Influence in Feature Matching and a Hypothesis on Position Features

Seonglae Cho26 Feb 2025 17:05 UTC

4 points

3 comments17 min readLW link

Fuzzing LLMs sometimes makes them reveal their secrets

Fabien Roger26 Feb 2025 16:48 UTC

65 points

13 comments9 min readLW link

You can just wear a suit

lsusr26 Feb 2025 14:57 UTC

139 points

59 comments2 min readLW link

Matthew Yglesias—Misinformation Mostly Confuses Your Own Side

Siebe26 Feb 2025 14:55 UTC

10 points

1 comment1 min readLW link

(www.slowboring.com)

Optimizing Feedback to Learn Faster

Towards_Keeperhood26 Feb 2025 14:24 UTC

12 points

0 comments2 min readLW link