RSS

Crit­i­cism of the main frame­work in AI alignment

Michele Campolo31 Jan 2023 23:01 UTC
8 points
0 comments7 min readLW link

Mechanis­tic In­ter­pretabil­ity Quick­start Guide

Neel Nanda31 Jan 2023 16:35 UTC
25 points
1 comment6 min readLW link
(www.neelnanda.io)

In­ner Misal­ign­ment in “Si­mu­la­tor” LLMs

Adam Scherlis31 Jan 2023 8:33 UTC
48 points
3 comments4 min readLW link

Why I hate the “ac­ci­dent vs. mi­suse” AI x-risk di­chotomy (quick thoughts on “struc­tural risk”)

David Scott Krueger (formerly: capybaralet)30 Jan 2023 18:50 UTC
26 points
25 comments2 min readLW link

Call for sub­mis­sions: “(In)hu­man Values and Ar­tifi­cial Agency”, ALIFE 2023

the gears to ascenscion30 Jan 2023 17:37 UTC
27 points
3 comments1 min readLW link
(humanvaluesandartificialagency.com)

What I mean by “al­ign­ment is in large part about mak­ing cog­ni­tion aimable at all”

So8res30 Jan 2023 15:22 UTC
113 points
8 comments2 min readLW link

Struc­ture, cre­ativity, and nov­elty

TsviBT29 Jan 2023 14:30 UTC
14 points
1 comment7 min readLW link

Stop-gra­di­ents lead to fixed point predictions

28 Jan 2023 22:47 UTC
26 points
0 comments24 min readLW link

Spooky ac­tion at a dis­tance in the loss landscape

28 Jan 2023 0:22 UTC
43 points
3 comments3 min readLW link
(www.jessehoogland.com)

The role of Bayesian ML in AI safety—an overview

Marius Hobbhahn27 Jan 2023 19:40 UTC
20 points
4 comments10 min readLW link

AGI will have learnt util­ity functions

beren25 Jan 2023 19:42 UTC
26 points
2 comments13 min readLW link

Thoughts on the im­pact of RLHF research

paulfchristiano25 Jan 2023 17:23 UTC
186 points
88 comments9 min readLW link

Quick thoughts on “scal­able over­sight” /​ “su­per-hu­man feed­back” research

David Scott Krueger (formerly: capybaralet)25 Jan 2023 12:55 UTC
25 points
5 comments2 min readLW link

Alexan­der and Yud­kowsky on AGI goals

24 Jan 2023 21:09 UTC
141 points
49 comments26 min readLW link

In­verse Scal­ing Prize: Se­cond Round Winners

24 Jan 2023 20:12 UTC
48 points
13 comments15 min readLW link

How-to Trans­former Mechanis­tic In­ter­pretabil­ity—in 50 lines of code or less!

StefanHex24 Jan 2023 18:45 UTC
38 points
3 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

beren24 Jan 2023 15:45 UTC
139 points
18 comments5 min readLW link

“Endgame safety” for AGI

Steven Byrnes24 Jan 2023 14:15 UTC
72 points
4 comments5 min readLW link

Thoughts on hard­ware /​ com­pute re­quire­ments for AGI

Steven Byrnes24 Jan 2023 14:03 UTC
28 points
28 comments21 min readLW link

Some of my dis­agree­ments with List of Lethalities

TurnTrout24 Jan 2023 0:25 UTC
67 points
7 comments10 min readLW link