All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024 2025

All JanFebMar Apr May Jun Jul Aug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 202122 23 24 25 26 27 28

On second thought, prompt injections are probably examples of misalignment

lc20 Feb 2023 23:56 UTC

22 points

5 comments1 min readLW link

Nothing Is Ever Taught Correctly

LVSN20 Feb 2023 22:31 UTC

5 points

3 comments1 min readLW link

Behavioral and mechanistic definitions (often confuse AI alignment discussions)

LawrenceC20 Feb 2023 21:33 UTC

33 points

5 comments6 min readLW link

Validator models: A simple approach to detecting goodharting

beren20 Feb 2023 21:32 UTC

14 points

1 comment4 min readLW link

There are no coherence theorems

Dan H and EJT

20 Feb 2023 21:25 UTC

155 points

130 comments19 min readLW link 1 review

[Question] Are there any AI safety relevant fully remote roles suitable for someone with 2-3 years of machine learning engineering industry experience?

Malleable_shape20 Feb 2023 19:57 UTC

7 points

2 comments1 min readLW link

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex and Jett Janiak

20 Feb 2023 19:35 UTC

96 points

8 comments21 min readLW link

Sydney the Bingenator Can’t Think, But It Still Threatens People

Valentin Baltadzhiev20 Feb 2023 18:37 UTC

−3 points

2 comments8 min readLW link

EIS IX: Interpretability and Adversaries

scasper20 Feb 2023 18:25 UTC

30 points

8 comments8 min readLW link

What AI companies can do today to help with the most important century

HoldenKarnofsky20 Feb 2023 17:00 UTC

38 points

3 comments9 min readLW link

(www.cold-takes.com)

Bankless Podcast: 159 - We’re All Gonna Die with Eliezer Yudkowsky

bayesed20 Feb 2023 16:42 UTC

83 points

54 comments1 min readLW link

(www.youtube.com)

Speculative Technologies launch and Ben Reinhardt AMA

jasoncrawford20 Feb 2023 16:33 UTC

16 points

0 comments1 min readLW link

(rootsofprogress.org)

[MLSN #8] Mechanistic interpretability, using law to inform AI alignment, scaling laws for proxy gaming

Dan H and TW123

20 Feb 2023 15:54 UTC

20 points

0 comments4 min readLW link

(newsletter.mlsafety.org)

Bing finding ways to bypass Microsoft’s filters without being asked. Is it reproducible?

Christopher King20 Feb 2023 15:11 UTC

27 points

15 comments1 min readLW link

Metaculus Introduces New ‘Conditional Pair’ Forecast Questions for Making Conditional Predictions

ChristianWilliams20 Feb 2023 13:36 UTC

40 points

0 comments2 min readLW link

(www.metaculus.com)

On Investigating Conspiracy Theories

Zvi20 Feb 2023 12:50 UTC

117 points

38 comments5 min readLW link

(thezvi.wordpress.com)

The Estimation Game: a monthly Fermi estimation web app

Sage Future and Adam B

20 Feb 2023 11:33 UTC

20 points

2 comments1 min readLW link

The idea that ChatGPT is simply “predicting” the next word is, at best, misleading

Bill Benzon20 Feb 2023 11:32 UTC

55 points

88 comments5 min readLW link

Russell Conjugations list & voting thread

Daniel Kokotajlo20 Feb 2023 6:39 UTC

23 points

64 comments1 min readLW link

Emergent Deception and Emergent Optimization

jsteinhardt20 Feb 2023 2:40 UTC

64 points

0 comments14 min readLW link

(bounded-regret.ghost.io)

AGI doesn’t need understanding, intention, or consciousness in order to kill us, only intelligence

James Blaha20 Feb 2023 0:55 UTC

10 points

2 comments18 min readLW link

Remote AI Alignment Overhang?

tryactions19 Feb 2023 22:30 UTC

37 points

5 comments4 min readLW link

A Neural Network undergoing Gradient-based Training as a Complex System

carboniferous_umbraculum 19 Feb 2023 22:08 UTC

22 points

1 comment19 min readLW link

Another Way to Be Okay

Gretta Duleba19 Feb 2023 20:49 UTC

109 points

15 comments6 min readLW link

A Way To Be Okay

Duncan Sabien (Inactive)19 Feb 2023 20:27 UTC

110 points

38 comments10 min readLW link 1 review

Exploring Lily’s world with ChatGPT [things an AI won’t do]

Bill Benzon19 Feb 2023 16:39 UTC

5 points

0 comments20 min readLW link

EIS VIII: An Engineer’s Understanding of Deceptive Alignment

scasper19 Feb 2023 15:25 UTC

30 points

5 comments4 min readLW link

Does novel understanding imply novel agency / values?

TsviBT19 Feb 2023 14:41 UTC

18 points

0 comments7 min readLW link

Navigating public AI x-risk hype while pursuing technical solutions

Dan Braun19 Feb 2023 12:22 UTC

18 points

0 comments2 min readLW link

Somewhat against “just update all the way”

tailcalled19 Feb 2023 10:49 UTC

31 points

10 comments2 min readLW link

Human beats SOTA Go AI by learning an adversarial policy

Vanessa Kosoy19 Feb 2023 9:38 UTC

59 points

29 comments1 min readLW link

(goattack.far.ai)

Degamification

Nate Showell19 Feb 2023 5:35 UTC

23 points

3 comments2 min readLW link

Stop posting prompt injections on Twitter and calling it “misalignment”

lc19 Feb 2023 2:21 UTC

146 points

9 comments1 min readLW link

AGI in sight: our look at the game board

Andrea_Miotti and Gabriel Alfour

18 Feb 2023 22:17 UTC

228 points

135 comments6 min readLW link

(andreamiotti.substack.com)

We should be signal-boosting anti Bing chat content

mbrooks18 Feb 2023 18:52 UTC

−4 points

13 comments2 min readLW link

Can talk, can think, can suffer.

Ilio18 Feb 2023 18:43 UTC

1 point

8 comments3 min readLW link

Parametrically retargetable decision-makers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC

172 points

10 comments2 min readLW link

(arxiv.org)

Near-Term Risks of an Obedient Artificial Intelligence

ymeskhout18 Feb 2023 18:30 UTC

20 points

1 comment6 min readLW link

EIS VII: A Challenge for Mechanists

scasper18 Feb 2023 18:27 UTC

36 points

4 comments3 min readLW link

Reading Speed Exists!

Johannes C. Mayer18 Feb 2023 15:30 UTC

12 points

9 comments1 min readLW link

The Practitioner’s Path 2.0: the Meditative Archetype

Evenflair18 Feb 2023 15:23 UTC

14 points

1 comment2 min readLW link

(guildoftherose.org)

Should we cry “wolf”?

Tapatakt18 Feb 2023 11:24 UTC

24 points

5 comments1 min readLW link

[Question] Name of the fallacy of assuming an extreme value (e.g. 0) with the illusion of ‘avoiding to have to make an assumption’?

FlorianH18 Feb 2023 8:11 UTC

4 points

1 comment1 min readLW link

I Think We’re Approaching The Bitter Lesson’s Asymptote

SomeoneYouOnceKnew18 Feb 2023 5:33 UTC

−3 points

9 comments5 min readLW link

Bus-Only Bus Lane Enforcement

jefftk18 Feb 2023 2:50 UTC

19 points

15 comments1 min readLW link

(www.jefftk.com)

Run Head on Towards the Falling Tears

Johannes C. Mayer18 Feb 2023 1:33 UTC

6 points

0 comments2 min readLW link

Two problems with ‘Simulators’ as a frame

ryan_greenblatt17 Feb 2023 23:34 UTC

79 points

13 comments5 min readLW link

GPT-4 Predictions

Stephen McAleese17 Feb 2023 23:20 UTC

112 points

27 comments11 min readLW link

On Board Vision, Hollow Words, and the End of the World

Marcello17 Feb 2023 23:18 UTC

52 points

27 comments5 min readLW link

PICT: A Zero-Shot Prompt Template to Automate Evaluation

Quentin FEUILLADE--MONTIXI17 Feb 2023 23:16 UTC

17 points

1 comment11 min readLW link