All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024 2025

All JanFebMar Apr May Jun Jul Aug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 212223 24 25 26 27 28

A proof of inner Löb’s theorem

James Payor21 Feb 2023 21:11 UTC

13 points

0 comments2 min readLW link

Fighting For Our Lives—What Ordinary People Can Do

TinkerBird21 Feb 2023 20:36 UTC

14 points

18 comments4 min readLW link

The Emotional Type of a Decision

moridinamael21 Feb 2023 20:35 UTC

13 points

0 comments4 min readLW link

What is it like doing AI safety work?

KatWoods21 Feb 2023 20:12 UTC

57 points

2 comments10 min readLW link

Pretraining Language Models with Human Preferences

Tomek Korbak, Sam Bowman and Ethan Perez

21 Feb 2023 17:57 UTC

135 points

20 comments11 min readLW link 2 reviews

A Stranger Priority? Topics at the Outer Reaches of Effective Altruism (my dissertation)

Joe Carlsmith21 Feb 2023 17:26 UTC

38 points

16 comments1 min readLW link

EIS X: Continual Learning, Modularity, Compression, and Biological Brains

scasper21 Feb 2023 16:59 UTC

14 points

4 comments3 min readLW link

No Room for Political Philosophy

Arturo Macias21 Feb 2023 16:11 UTC

−1 points

7 comments3 min readLW link

Deceptive Alignment is <1% Likely by Default

DavidW21 Feb 2023 15:09 UTC

89 points

31 comments14 min readLW link 1 review

AI #1: Sydney and Bing

Zvi21 Feb 2023 14:00 UTC

171 points

45 comments61 min readLW link 1 review

(thezvi.wordpress.com)

You’re not a simulation, ’cause you’re hallucinating

Stuart_Armstrong21 Feb 2023 12:12 UTC

25 points

6 comments1 min readLW link

Basic facts about language models during training

beren21 Feb 2023 11:46 UTC

99 points

15 comments18 min readLW link

[Preprint] Pretraining Language Models with Human Preferences

Giulio21 Feb 2023 11:44 UTC

12 points

0 comments1 min readLW link

(arxiv.org)

Breaking the Optimizer’s Curse, and Consequences for Existential Risks and Value Learning

Roger Dearnaley21 Feb 2023 9:05 UTC

10 points

1 comment23 min readLW link

Medlife Crisis: “Why Do People Keep Falling For Things That Don’t Work?”

RomanHauksson21 Feb 2023 6:22 UTC

12 points

5 comments1 min readLW link

(www.youtube.com)

A foundation model approach to value inference

sen21 Feb 2023 5:09 UTC

6 points

0 comments3 min readLW link

Instrumentality makes agents agenty

porby21 Feb 2023 4:28 UTC

21 points

7 comments6 min readLW link

Gamified narrow reverse imitation learning

TekhneMakre21 Feb 2023 4:26 UTC

8 points

0 comments2 min readLW link

Feelings are Good, Actually

Gordon Seidoh Worley21 Feb 2023 2:38 UTC

18 points

1 comment4 min readLW link

AI alignment researchers don’t (seem to) stack

So8res21 Feb 2023 0:48 UTC

194 points

40 comments3 min readLW link

EA & LW Forum Weekly Summary (6th − 19th Feb 2023)

Zoe Williams21 Feb 2023 0:26 UTC

8 points

0 comments14 min readLW link

What to think when a language model tells you it’s sentient

Robbo21 Feb 2023 0:01 UTC

9 points

6 comments6 min readLW link

On second thought, prompt injections are probably examples of misalignment

lc20 Feb 2023 23:56 UTC

22 points

5 comments1 min readLW link

Nothing Is Ever Taught Correctly

LVSN20 Feb 2023 22:31 UTC

5 points

3 comments1 min readLW link

Behavioral and mechanistic definitions (often confuse AI alignment discussions)

LawrenceC20 Feb 2023 21:33 UTC

33 points

5 comments6 min readLW link

Validator models: A simple approach to detecting goodharting

beren20 Feb 2023 21:32 UTC

14 points

1 comment4 min readLW link

There are no coherence theorems

Dan H and EJT

20 Feb 2023 21:25 UTC

155 points

130 comments19 min readLW link 1 review

[Question] Are there any AI safety relevant fully remote roles suitable for someone with 2-3 years of machine learning engineering industry experience?

Malleable_shape20 Feb 2023 19:57 UTC

7 points

2 comments1 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

20 Feb 2023 19:35 UTC

96 points

8 comments21 min readLW link

Sydney the Bingenator Can’t Think, But It Still Threatens People

Valentin Baltadzhiev20 Feb 2023 18:37 UTC

−3 points

2 comments8 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

8 comments8 min readLW link

What AI companies can do today to help with the most important century

HoldenKarnofsky20 Feb 2023 17:00 UTC

38 points

3 comments9 min readLW link

(www.cold-takes.com)

Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky

bayesed20 Feb 2023 16:42 UTC

83 points

54 comments1 min readLW link

(www.youtube.com)

Speculative Technologies launch and Ben Reinhardt AMA

jasoncrawford20 Feb 2023 16:33 UTC

16 points

0 comments1 min readLW link

(rootsofprogress.org)

[MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming

Dan H and TW123

20 Feb 2023 15:54 UTC

20 points

0 comments4 min readLW link

(newsletter.mlsafety.org)

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King20 Feb 2023 15:11 UTC

27 points

15 comments1 min readLW link

Metaculus Introduces New ‘Conditional Pair’ Forecast Questions for Making Conditional Predictions

ChristianWilliams20 Feb 2023 13:36 UTC

40 points

0 comments2 min readLW link

(www.metaculus.com)

On Investigating Conspiracy Theories

Zvi20 Feb 2023 12:50 UTC

117 points

38 comments5 min readLW link

(thezvi.wordpress.com)

The Estimation Game: a monthly Fermi estimation web app

Sage Future and Adam B

20 Feb 2023 11:33 UTC

20 points

2 comments1 min readLW link

The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

Bill Benzon20 Feb 2023 11:32 UTC

55 points

88 comments5 min readLW link

Russell Conjugations list & voting thread

Daniel Kokotajlo20 Feb 2023 6:39 UTC

23 points

64 comments1 min readLW link

Emergent Deception and Emergent Optimization

jsteinhardt20 Feb 2023 2:40 UTC

64 points

0 comments14 min readLW link

(bounded-regret.ghost.io)

AGI doesn’t need understanding, intention, or consciousness in order to kill us, only intelligence

James Blaha20 Feb 2023 0:55 UTC

10 points

2 comments18 min readLW link

Remote AI Alignment Overhang?

tryactions19 Feb 2023 22:30 UTC

37 points

5 comments4 min readLW link

A Neural Network undergoing Gradient-based Training as a Complex System

carboniferous_umbraculum 19 Feb 2023 22:08 UTC

22 points

1 comment19 min readLW link

Another Way to Be Okay

Gretta Duleba19 Feb 2023 20:49 UTC

109 points

15 comments6 min readLW link

A Way To Be Okay

Duncan Sabien (Inactive)19 Feb 2023 20:27 UTC

110 points

38 comments10 min readLW link 1 review

Exploring Lily’s world with ChatGPT [things an AI won’t do]

Bill Benzon19 Feb 2023 16:39 UTC

5 points

0 comments20 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

Does novel understanding imply novel agency / values?

TsviBT19 Feb 2023 14:41 UTC

18 points

0 comments7 min readLW link