Ti­maeus is hiring!

Jul 12, 2024, 11:42 PM
67 points
6 comments2 min readLW link

Friend­ship is trans­ac­tional, un­con­di­tional friend­ship is insurance

RubyJul 17, 2024, 10:52 PM
67 points
24 comments2 min readLW link

SAEs (usu­ally) Trans­fer Between Base and Chat Models

Jul 18, 2024, 10:29 AM
67 points
0 comments10 min readLW link

Open Source Au­to­mated In­ter­pretabil­ity for Sparse Au­toen­coder Features

Jul 30, 2024, 9:11 PM
67 points
1 comment13 min readLW link
(blog.eleuther.ai)

Ad­vice to ju­nior AI gov­er­nance researchers

Orpheus16Jul 8, 2024, 7:19 PM
66 points
1 comment5 min readLW link

Static Anal­y­sis As A Lifestyle

adamShimiJul 3, 2024, 6:29 PM
65 points
11 comments3 min readLW link
(epistemologicalfascinations.substack.com)

[In­terim re­search re­port] Ac­ti­va­tion plateaus & sen­si­tive di­rec­tions in GPT2

Jul 5, 2024, 5:05 PM
65 points
2 comments5 min readLW link

A “Bit­ter Les­son” Ap­proach to Align­ing AGI and ASI

RogerDearnaleyJul 6, 2024, 1:23 AM
63 points
41 comments24 min readLW link

Ice: The Penul­ti­mate Frontier

RokoJul 13, 2024, 11:44 PM
63 points
56 comments1 min readLW link
(transhumanaxiology.substack.com)

RTFB: Cal­ifor­nia’s AB 3211

ZviJul 30, 2024, 1:10 PM
62 points
2 comments11 min readLW link
(thezvi.wordpress.com)

Con­sider the hum­ble rock (or: why the dumb thing kills you)

pleiotrothJul 4, 2024, 1:54 PM
62 points
11 comments4 min readLW link

Linkpost: Surely you can be serious

kaveJul 18, 2024, 10:18 PM
62 points
8 comments1 min readLW link
(www.experimental-history.com)

A frame­work for think­ing about AI power-seeking

Joe CarlsmithJul 24, 2024, 10:41 PM
62 points
15 comments16 min readLW link

BatchTopK: A Sim­ple Im­prove­ment for TopK-SAEs

Jul 20, 2024, 2:20 AM
61 points
0 comments4 min readLW link

In­spired by: Failures in Kindness

X4vierJul 27, 2024, 1:21 AM
60 points
2 comments3 min readLW link

Fea­ture Tar­geted LLC Es­ti­ma­tion Dist­in­guishes SAE Fea­tures from Ran­dom Directions

Jul 19, 2024, 8:32 PM
59 points
6 comments16 min readLW link

Towards shut­down­able agents via stochas­tic choice

Jul 8, 2024, 10:14 AM
59 points
11 comments23 min readLW link
(arxiv.org)

Pac­ing Out­side the Box: RNNs Learn to Plan in Sokoban

Jul 25, 2024, 10:00 PM
59 points
8 comments2 min readLW link
(arxiv.org)

[EAFo­rum xpost] A break­down of OpenAI’s revenue

Jul 10, 2024, 6:09 PM
57 points
5 comments1 min readLW link
(forum.effectivealtruism.org)

AI Align­ment Re­search Eng­ineer Ac­cel­er­a­tor (ARENA): Call for ap­pli­cants v4.0

Jul 6, 2024, 11:34 AM
57 points
7 comments6 min readLW link

Coal­i­tional agency

Richard_NgoJul 22, 2024, 12:09 AM
56 points
6 comments6 min readLW link

Un­lock­ing Solu­tions—By Un­der­stand­ing Co­or­di­na­tion Problems

James Stephen BrownJul 27, 2024, 4:52 AM
56 points
4 comments5 min readLW link
(nonzerosum.games)

How the AI safety tech­ni­cal land­scape has changed in the last year, ac­cord­ing to some practitioners

tlevinJul 26, 2024, 7:06 PM
55 points
6 comments2 min readLW link

Un­learn­ing via RMU is mostly shallow

Jul 23, 2024, 4:07 PM
54 points
4 comments6 min readLW link

Causal Graphs of GPT-2-Small’s Resi­d­ual Stream

David UdellJul 9, 2024, 10:06 PM
53 points
7 comments7 min readLW link

Break­ing Cir­cuit Breakers

Jul 14, 2024, 6:57 PM
53 points
13 comments1 min readLW link
(confirmlabs.org)

AI #71: Farewell to Chevron

ZviJul 4, 2024, 1:40 PM
53 points
9 comments36 min readLW link
(thezvi.wordpress.com)

Sher­lock­ian Ab­duc­tion Master List

Cole WyethJul 11, 2024, 8:27 PM
52 points
66 comments36 min readLW link

Llama Llama-3-405B?

ZviJul 24, 2024, 7:40 PM
51 points
9 comments30 min readLW link
(thezvi.wordpress.com)

Con­sent across power differentials

Ramana KumarJul 9, 2024, 11:42 AM
50 points
12 comments3 min readLW link

DM Parenting

Shoshannah TekofskyJul 16, 2024, 8:50 AM
50 points
4 comments5 min readLW link
(kidquest.substack.com)

How do we know that “good re­search” is good? (aka “di­rect eval­u­a­tion” vs “eigen-eval­u­a­tion”)

RubyJul 19, 2024, 12:31 AM
49 points
21 comments6 min readLW link

JumpReLU SAEs + Early Ac­cess to Gemma 2 SAEs

Jul 19, 2024, 4:10 PM
49 points
10 comments1 min readLW link
(storage.googleapis.com)

On scal­able over­sight with weak LLMs judg­ing strong LLMs

Jul 8, 2024, 8:59 AM
49 points
18 comments7 min readLW link
(arxiv.org)

Why the Best Writ­ers En­dure Isolation

Declan MolonyJul 16, 2024, 5:58 AM
49 points
6 comments2 min readLW link

Mis­nam­ing and Other Is­sues with OpenAI’s “Hu­man Level” Su­per­in­tel­li­gence Hierarchy

DavidmanheimJul 15, 2024, 5:50 AM
49 points
2 comments3 min readLW link

Car­ing about excellence

owencbJul 22, 2024, 2:24 PM
47 points
4 commentsLW link

Robin Han­son AI X-Risk De­bate — High­lights and Analysis

LironJul 12, 2024, 9:31 PM
46 points
7 comments45 min readLW link
(www.youtube.com)

Games for AI Control

Jul 11, 2024, 6:40 PM
45 points
0 comments5 min readLW link

AI #72: Deny­ing the Future

ZviJul 11, 2024, 3:00 PM
45 points
8 comments41 min readLW link
(thezvi.wordpress.com)

We ran an AI safety con­fer­ence in Tokyo. It went re­ally well. Come next year!

BlaineJul 17, 2024, 6:55 AM
45 points
1 comment6 min readLW link

Why Ge­or­gism Lost Its Popularity

Zero ContradictionsJul 20, 2024, 3:08 PM
45 points
54 comments1 min readLW link
(zerocontradictions.net)

Sim­plify­ing Cor­rigi­bil­ity – Subagent Cor­rigi­bil­ity Is Not Anti-Natural

Rubi J. HudsonJul 16, 2024, 10:44 PM
44 points
27 comments5 min readLW link

Open Sourc­ing Metaculus

ChristianWilliamsJul 2, 2024, 10:30 PM
44 points
0 commentsLW link
(www.metaculus.com)

Trust as a bot­tle­neck to grow­ing teams quickly

benkuhnJul 13, 2024, 6:00 PM
44 points
3 comments5 min readLW link
(www.benkuhn.net)

New Ex­ec­u­tive Team & Board — PIBBSS

Nora_AmmannJul 1, 2024, 7:30 PM
43 points
1 comment1 min readLW link

Un­der­stand­ing Po­si­tional Fea­tures in Layer 0 SAEs

Jul 29, 2024, 9:36 AM
43 points
0 comments5 min readLW link

List of Col­lec­tive In­tel­li­gence Projects

ChipmonkJul 2, 2024, 2:10 PM
42 points
9 comments2 min readLW link
(chrislakin.blog)

Paper Sum­mary: The Effects of Com­mu­ni­cat­ing Uncer­tainty on Public Trust in Facts and Numbers

Jeffrey HeningerJul 9, 2024, 4:50 PM
42 points
2 comments2 min readLW link
(blog.aiimpacts.org)

(Ap­prox­i­mately) Deter­minis­tic Nat­u­ral Latents

Jul 19, 2024, 11:02 PM
42 points
1 comment4 min readLW link