RSS

Cat­e­go­riza­tion Hell

UtilityMonster24 Sep 2023 18:18 UTC
1 point
0 comments2 min readLW link
(utilitymonster.substack.com)

In­ter­pret­ing OpenAI’s Whisper

EllenaR24 Sep 2023 17:53 UTC
48 points
2 comments7 min readLW link

Con­tra­dic­tion Ap­peal Bias

onur24 Sep 2023 17:03 UTC
3 points
0 comments1 min readLW link

RAIN: Your Lan­guage Models Can Align Them­selves with­out Fine­tun­ing—Microsoft Re­search 2023 - Re­duces the ad­ver­sar­ial prompt at­tack suc­cess rate from 94% to 19%!

Singularian250124 Sep 2023 16:48 UTC
3 points
0 comments1 min readLW link

Far-Fu­ture Com­mit­ments as a Policy Con­sen­sus Strategy

FCCC24 Sep 2023 6:34 UTC
6 points
6 comments1 min readLW link

Five ne­glected work ar­eas that could re­duce AI risk

24 Sep 2023 2:03 UTC
13 points
5 comments9 min readLW link

The Dick Kick’em Paradox

Augs SMSHacks23 Sep 2023 22:22 UTC
−7 points
13 comments1 min readLW link

I de­signed an AI safety course (for a philos­o­phy de­part­ment)

Eleni Angelou23 Sep 2023 22:03 UTC
34 points
12 comments2 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

23 Sep 2023 19:55 UTC
108 points
24 comments4 min readLW link
(owainevans.github.io)

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David Udell23 Sep 2023 19:16 UTC
29 points
4 comments34 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre Peigné23 Sep 2023 16:21 UTC
29 points
8 comments5 min readLW link

A quick re­mark on so-called “hal­lu­ci­na­tions” in LLMs and hu­mans

Bill Benzon23 Sep 2023 12:17 UTC
4 points
3 comments1 min readLW link

Hand-writ­ing MathML

jefftk23 Sep 2023 11:20 UTC
12 points
14 comments1 min readLW link
(www.jefftk.com)

[Linkpost/​Video] All The Times We Nearly Blew Up The World

g-w123 Sep 2023 1:18 UTC
5 points
1 comment1 min readLW link
(www.youtube.com)

Luck based medicine: in­os­i­tol for anx­iety and brain fog

Elizabeth22 Sep 2023 20:10 UTC
38 points
2 comments3 min readLW link
(acesounderglass.com)

If in­fluence func­tions are not ap­prox­i­mat­ing leave-one-out, how are they sup­posed to help?

Fabien Roger22 Sep 2023 14:23 UTC
54 points
3 comments3 min readLW link

Model­ing p(doom) with TrojanGDP

K. Liam Smith22 Sep 2023 14:19 UTC
−2 points
2 comments13 min readLW link

Let’s talk about Im­pos­tor syn­drome in AI safety

Igor Ivanov22 Sep 2023 13:51 UTC
28 points
4 comments3 min readLW link

Fund Tran­sit With Development

jefftk22 Sep 2023 11:10 UTC
46 points
7 comments3 min readLW link
(www.jefftk.com)

Atoms to Agents Proto-Lectures

johnswentworth22 Sep 2023 6:22 UTC
88 points
6 comments2 min readLW link
(www.youtube.com)