RSS

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
98 points
1 comment11 min readLW link
(arxiv.org)

Fron­tier Models are Ca­pable of In-con­text Scheming

5 Dec 2024 22:11 UTC
170 points
12 comments7 min readLW link

The 2023 LessWrong Re­view: The Ba­sic Ask

Raemon4 Dec 2024 19:52 UTC
70 points
20 comments9 min readLW link

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

3 Dec 2024 21:19 UTC
80 points
5 comments41 min readLW link

Anal­y­sis of Global AI Gover­nance Strategies

4 Dec 2024 10:45 UTC
37 points
8 comments36 min readLW link

Cog­ni­tive Bi­ases Con­tribut­ing to AI X-risk — a deleted ex­cerpt from my 2018 ARCHES draft

Andrew_Critch3 Dec 2024 9:29 UTC
46 points
2 comments5 min readLW link

Hier­ar­chi­cal Agency: A Miss­ing Piece in AI Alignment

Jan_Kulveit27 Nov 2024 5:49 UTC
99 points
19 comments11 min readLW link

AXRP Epi­sode 39 - Evan Hub­inger on Model Or­ganisms of Misalignment

DanielFilan1 Dec 2024 6:00 UTC
40 points
0 comments67 min readLW link

Con­jec­ture: A Roadmap for Cog­ni­tive Soft­ware and A Hu­man­ist Fu­ture of AI

2 Dec 2024 13:28 UTC
35 points
7 comments29 min readLW link
(www.conjecture.dev)

o1 is a bad idea

abramdemski11 Nov 2024 21:20 UTC
161 points
37 comments2 min readLW link

LLM chat­bots have ~half of the kinds of “con­scious­ness” that hu­mans be­lieve in. Hu­mans should avoid go­ing crazy about that.

Andrew_Critch22 Nov 2024 3:26 UTC
77 points
53 comments5 min readLW link

The Com­pendium, A full ar­gu­ment about ex­tinc­tion risk from AGI

31 Oct 2024 12:01 UTC
191 points
49 comments2 min readLW link
(www.thecompendium.ai)

AXRP Epi­sode 38.2 - Jesse Hoogland on Sin­gu­lar Learn­ing Theory

DanielFilan27 Nov 2024 6:30 UTC
34 points
0 comments10 min readLW link

[Question] Are You More Real If You’re Really For­get­ful?

Thane Ruthenis24 Nov 2024 19:30 UTC
39 points
24 comments5 min readLW link

Train­ing AI agents to solve hard prob­lems could lead to Scheming

19 Nov 2024 0:10 UTC
61 points
12 comments28 min readLW link

Why im­perfect ad­ver­sar­ial ro­bust­ness doesn’t doom AI control

18 Nov 2024 16:05 UTC
61 points
27 comments2 min readLW link

An­thropic: Three Sketches of ASL-4 Safety Case Components

Zach Stein-Perlman6 Nov 2024 16:00 UTC
94 points
33 comments1 min readLW link
(alignment.anthropic.com)

the case for CoT un­faith­ful­ness is overstated

nostalgebraist29 Sep 2024 22:07 UTC
244 points
40 comments11 min readLW link

A Sober Look at Steer­ing Vec­tors for LLMs

23 Nov 2024 17:30 UTC
30 points
0 comments5 min readLW link

Win/​con­tinue/​lose sce­nar­ios and ex­e­cute/​re­place/​au­dit protocols

Buck15 Nov 2024 15:47 UTC
54 points
2 comments7 min readLW link