RSS

The longest train­ing run

17 Aug 2022 17:18 UTC
41 points
6 comments9 min readLW link
(epochai.org)

In­ter­pretabil­ity Tools Are an At­tack Channel

Thane Ruthenis17 Aug 2022 18:47 UTC
24 points
7 comments1 min readLW link

The Core of the Align­ment Prob­lem is...

17 Aug 2022 20:07 UTC
25 points
3 comments9 min readLW link

My thoughts on di­rect work (and join­ing LessWrong)

RobertM16 Aug 2022 18:53 UTC
51 points
4 comments6 min readLW link

Do meta-memes and meta-an­timemes ex­ist? e.g. ‘The map is not the ter­ri­tory’ is also a map

M. Y. Zuo7 Aug 2022 1:17 UTC
4 points
23 comments1 min readLW link

Hu­man Mimicry Mainly Works When We’re Already Close

johnswentworth17 Aug 2022 18:41 UTC
41 points
5 comments5 min readLW link

Mesa-op­ti­miza­tion for goals defined only within a train­ing en­vi­ron­ment is dangerous

Rubi17 Aug 2022 3:56 UTC
6 points
2 comments4 min readLW link

A Mechanis­tic In­ter­pretabil­ity Anal­y­sis of Grokking

15 Aug 2022 2:41 UTC
239 points
15 comments41 min readLW link
(colab.research.google.com)

Against pop­u­la­tion ethics

jasoncrawford16 Aug 2022 5:19 UTC
29 points
30 comments3 min readLW link

Half-baked AI Safety ideas thread

Aryeh Englander23 Jun 2022 16:11 UTC
58 points
59 comments1 min readLW link

Matt Ygle­sias on AI Policy

Grant Demaree17 Aug 2022 23:57 UTC
19 points
0 comments1 min readLW link
(www.slowboring.com)

Why are poli­ti­ci­ans po­larized?

ErnestScribbler21 Jul 2022 8:17 UTC
13 points
24 comments7 min readLW link

Con­di­tion­ing, Prompts, and Fine-Tuning

Adam Jermyn17 Aug 2022 20:52 UTC
22 points
1 comment4 min readLW link

The Parable of the Boy Who Cried 5% Chance of Wolf

KatWoods15 Aug 2022 14:33 UTC
125 points
19 comments2 min readLW link

Un­der­stand­ing differ­ences be­tween hu­mans and in­tel­li­gence-in-gen­eral to build safe AGI

Florian_Dietz16 Aug 2022 8:27 UTC
7 points
5 comments1 min readLW link

In­suffi­cient aware­ness of how ev­ery­thing sucks

Flaglandbase17 Aug 2022 8:01 UTC
−5 points
3 comments1 min readLW link

On the falsifi­a­bil­ity of hy­per­com­pu­ta­tion, part 2: finite in­put streams

jessicata17 Feb 2020 3:51 UTC
25 points
6 comments4 min readLW link
(unstableontology.com)

Con­crete Ad­vice for Form­ing In­side Views on AI Safety

Neel Nanda17 Aug 2022 22:02 UTC
12 points
0 comments10 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
169 points
79 comments12 min readLW link

Progress links and tweets, 2022-08-17

jasoncrawford17 Aug 2022 21:27 UTC
11 points
0 comments2 min readLW link
(rootsofprogress.org)