All 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 202220232024 2025 2026

All Jan Feb Mar Apr May JunJulAug Sep Oct Nov Dec

All 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 282930 31

Thoughts about the Mechanistic Interpretability Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC

26 points

5 comments20 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC

13 points

3 comments13 min readLW link

When can we trust model evaluations?

evhub28 Jul 2023 19:42 UTC

172 points

10 comments10 min readLW link 1 review

Yes, It’s Subjective, But Why All The Crabs?

johnswentworth28 Jul 2023 19:35 UTC

251 points

15 comments6 min readLW link

Semaglutide and Muscle

5hout28 Jul 2023 18:36 UTC

14 points

14 comments5 min readLW link

Double Crux in a Box

Screwtape28 Jul 2023 17:55 UTC

8 points

3 comments1 min readLW link

Gradient descent might see the direction of the optimum from far away

Mikhail Samin28 Jul 2023 16:19 UTC

78 points

13 comments4 min readLW link

Progress links digest, 2023-07-28: The decadent opulence of modern capitalism

jasoncrawford28 Jul 2023 14:36 UTC

16 points

3 comments3 min readLW link

(rootsofprogress.org)

AI Awareness through Interaction with Blatantly Alien Models

VojtaKovarik28 Jul 2023 8:41 UTC

7 points

5 comments3 min readLW link

You don’t get to have cool flaws

Neil 28 Jul 2023 5:37 UTC

111 points

26 comments2 min readLW link 3 reviews

Reducing sycophancy and improving honesty via activation steering

Nina Panickssery28 Jul 2023 2:46 UTC

122 points

18 comments9 min readLW link 1 review

Mech Interp Puzzle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC

41 points

4 comments2 min readLW link

ETFE windows

bhauth28 Jul 2023 0:46 UTC

31 points

4 comments2 min readLW link

(www.bhauth.com)

A Short Memo on AI Interpretability Rainbows

scasper27 Jul 2023 23:05 UTC

18 points

0 comments2 min readLW link

Pulling the Rope Sideways: Empirical Test Results

Daniel Kokotajlo27 Jul 2023 22:18 UTC

63 points

18 comments1 min readLW link

A $10k retroactive grant for VaccinateCA

Austin Chen27 Jul 2023 18:14 UTC

82 points

0 comments6 min readLW link

(manifund.org)

Preference Aggregation as Bayesian Inference

beren27 Jul 2023 17:59 UTC

14 points

1 comment1 min readLW link

AI #22: Into the Weeds

Zvi27 Jul 2023 17:40 UTC

49 points

8 comments84 min readLW link

(thezvi.wordpress.com)

SSA rejects anthropic shadow, too

jessicata27 Jul 2023 17:25 UTC

83 points

39 comments11 min readLW link

(unstableontology.com)

[Question] What are examples of someone doing a lot of work to find the best of something?

chanamessinger27 Jul 2023 15:58 UTC

29 points

16 comments1 min readLW link

AI-Plans.com 10-day Critique-a-Thon

Iknownothing27 Jul 2023 11:44 UTC

8 points

2 comments2 min readLW link

(manifund.org)

Privacy in a Digital World

Faustify27 Jul 2023 10:46 UTC

2 points

0 comments5 min readLW link

Cultivating a state of mind where new ideas are born

Henrik Karlsson27 Jul 2023 9:16 UTC

263 points

21 comments14 min readLW link 2 reviews

(www.henrikkarlsson.xyz)

Partial Transcript of Recent Senate Hearing Discussing AI X-Risk

Daniel_Eth27 Jul 2023 9:16 UTC

55 points

0 comments22 min readLW link

(medium.com)

AXRP Episode 24 - Superalignment with Jan Leike

DanielFilan27 Jul 2023 4:00 UTC

55 points

3 comments69 min readLW link

AXRP Episode 23 - Mechanistic Anomaly Detection with Mark Xu

DanielFilan27 Jul 2023 1:50 UTC

22 points

0 comments72 min readLW link

GPT-4 can catch subtle cross-language translation mistakes

Michael Tontchev27 Jul 2023 1:39 UTC

7 points

1 comment1 min readLW link

Social Balance through Embracing Social Credit

dhruvv26 Jul 2023 20:07 UTC

−39 points

9 comments3 min readLW link

Why no Roman Industrial Revolution?

jasoncrawford26 Jul 2023 19:34 UTC

62 points

30 comments3 min readLW link

(rootsofprogress.org)

Why you can’t treat decidability and complexity as a constant (Post #1)

Noosphere8926 Jul 2023 17:54 UTC

6 points

13 comments5 min readLW link

A response to the Richards et al.’s “The Illusion of AI’s Existential Risk”

Harrison Fell26 Jul 2023 17:34 UTC

1 point

0 comments10 min readLW link

Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy

Buck and ryan_greenblatt

26 Jul 2023 17:02 UTC

101 points

19 comments1 min readLW link 1 review

Neuronpedia

Johnny Lin26 Jul 2023 16:29 UTC

135 points

51 comments2 min readLW link

(neuronpedia.org)

Frontier Model Forum

Zach Stein-Perlman26 Jul 2023 14:30 UTC

27 points

0 comments4 min readLW link

(blog.google)

Podcasts: Future of Life Institute, Breakthrough Science Summit panel

jasoncrawford26 Jul 2023 14:28 UTC

8 points

0 comments1 min readLW link

(rootsofprogress.org)

Llama We Doing This Again?

Zvi26 Jul 2023 13:00 UTC

48 points

3 comments16 min readLW link

(thezvi.wordpress.com)

Frontier Model Security

Vaniver26 Jul 2023 4:48 UTC

32 points

1 comment3 min readLW link

(www.anthropic.com)

The First Room-Temperature Ambient-Pressure Superconductor

Annapurna26 Jul 2023 2:27 UTC

35 points

28 comments1 min readLW link

(arxiv.org)

Underwater Torture Chambers: The Horror Of Fish Farming

Bentham's Bulldog26 Jul 2023 0:27 UTC

78 points

50 comments10 min readLW link 1 review

Contra Alexander on the Bitter Lesson and IQ

Andrew Keenan Richardson26 Jul 2023 0:07 UTC

9 points

1 comment4 min readLW link

(mechanisticmind.com)

Overcoming the MWC

Mark Freed25 Jul 2023 17:31 UTC

3 points

0 comments3 min readLW link

Russian parliamentarian: let’s ban personal computers and the Internet

RomanS25 Jul 2023 17:30 UTC

11 points

6 comments2 min readLW link

AISN #16: White House Secures Voluntary Commitments from Leading AI Labs and Lessons from Oppenheimer

Corin Katzke and Dan H

25 Jul 2023 16:58 UTC

6 points

0 comments6 min readLW link

(newsletter.safe.ai)

“The Universe of Minds”—call for reviewers (Seeds of Science)

rogersbacon25 Jul 2023 16:53 UTC

7 points

0 comments1 min readLW link

Thoughts on Loss Landscapes and why Deep Learning works

beren25 Jul 2023 16:41 UTC

54 points

4 comments18 min readLW link

Should you work at a leading AI lab? (including in non-safety roles)

Benjamin Hilton25 Jul 2023 16:29 UTC

7 points

0 comments12 min readLW link

Whisper’s Word-Level Timestamps are Out

Varshul Gupta25 Jul 2023 14:32 UTC

−18 points

2 comments2 min readLW link

(dubverseblack.substack.com)

AIS 101: Task decomposition for scalable oversight

Charbel-Raphaël25 Jul 2023 13:34 UTC

35 points

0 comments19 min readLW link

(docs.google.com)

Anthropic Observations

Zvi25 Jul 2023 12:50 UTC

104 points

1 comment10 min readLW link

(thezvi.wordpress.com)

Autonomous Alignment Oversight Framework (AAOF)

Justausername25 Jul 2023 10:25 UTC

−9 points

0 comments4 min readLW link