Im­pact sto­ries for model in­ter­nals: an ex­er­cise for in­ter­pretabil­ity researchers

jennySep 25, 2023, 11:15 PM
29 points
3 comments7 min readLW link

Au­to­nomic Sanity

SableSep 25, 2023, 10:37 PM
20 points
9 comments4 min readLW link
(affablyevil.substack.com)

[Question] What is wrong with this “util­ity switch but­ton prob­lem” ap­proach?

Donald HobsonSep 25, 2023, 9:36 PM
14 points
3 comments1 min readLW link

You should just smile at strangers a lot

chaosmageSep 25, 2023, 8:12 PM
14 points
10 comments1 min readLW link

The King and the Golem

Richard_NgoSep 25, 2023, 7:51 PM
190 points
19 comments5 min readLW link1 review
(narrativeark.substack.com)

Public Opinion on AI Safety: AIMS 2023 and 2021 Summary

Sep 25, 2023, 6:55 PM
3 points
2 comments3 min readLW link
(www.sentienceinstitute.org)

Wel­come to Ap­ply: The 2024 Vi­talik Bu­terin Fel­low­ships in AI Ex­is­ten­tial Safety by FLI!

Zhijing JinSep 25, 2023, 6:42 PM
5 points
2 comments2 min readLW link

Eval­u­at­ing hid­den di­rec­tions on the util­ity dataset: clas­sifi­ca­tion, steer­ing and removal

Sep 25, 2023, 5:19 PM
25 points
3 comments7 min readLW link

Linkpost: A model of bi­ases as aris­ing from meta-beliefs

JuanGarciaSep 25, 2023, 5:14 PM
5 points
0 comments1 min readLW link

[Question] What causes a de­ci­sion the­ory to be used?

DagonSep 25, 2023, 4:33 PM
8 points
2 comments1 min readLW link

Un­der­stand­ing strate­gic de­cep­tion and de­cep­tive alignment

Sep 25, 2023, 4:27 PM
64 points
16 comments7 min readLW link
(www.apolloresearch.ai)

The Mer­its of Con­trar­i­anism & Why I hate Chat­bots. [My Ex­pe­rience with the Ide­olog­i­cal Tur­ing Test @ a Less Wrong meetup]

Amina V.Sep 25, 2023, 4:13 PM
4 points
1 comment1 min readLW link
(bimbollectual.com)

In­side Views, Im­pos­tor Syn­drome, and the Great LARP

johnswentworthSep 25, 2023, 4:08 PM
336 points
53 comments5 min readLW link

“X dis­tracts from Y” as a thinly-dis­guised fight over group sta­tus /​ politics

Steven ByrnesSep 25, 2023, 3:18 PM
112 points
14 comments8 min readLW link

Ama­zon to in­vest up to $4 billion in Anthropic

Davis_KingsleySep 25, 2023, 2:55 PM
44 points
8 commentsLW link
(twitter.com)

Should Effec­tive Altru­ists be Valuists in­stead of util­i­tar­i­ans?

Sep 25, 2023, 2:03 PM
1 point
3 comments6 min readLW link

Feedly Breaks MathML

jefftkSep 25, 2023, 1:40 PM
15 points
3 comments1 min readLW link
(www.jefftk.com)

[Question] How have you be­come more hard-work­ing?

Chi NguyenSep 25, 2023, 12:37 PM
82 points
42 commentsLW link

Au­tomat­ing In­tel­li­gence: A Cur­sory Glance at How Au­toML Brings Pre­ci­sion to AI Development

RoscoHunterSep 25, 2023, 9:39 AM
3 points
0 comments3 min readLW link

In­ter­pret­ing OpenAI’s Whisper

EllenaRSep 24, 2023, 5:53 PM
116 points
13 comments7 min readLW link

Con­tra­dic­tion Ap­peal Bias

onurSep 24, 2023, 5:03 PM
3 points
2 comments1 min readLW link

RAIN: Your Lan­guage Models Can Align Them­selves with­out Fine­tun­ing—Microsoft Re­search 2023 - Re­duces the ad­ver­sar­ial prompt at­tack suc­cess rate from 94% to 19%!

Singularian2501Sep 24, 2023, 4:48 PM
5 points
0 comments1 min readLW link

Honor Sys­tem for Vac­ci­na­tion?

jefftkSep 24, 2023, 11:50 AM
17 points
22 comments1 min readLW link
(www.jefftk.com)

Far-Fu­ture Com­mit­ments as a Policy Con­sen­sus Strategy

FCCCSep 24, 2023, 6:34 AM
7 points
40 comments1 min readLW link

Five ne­glected work ar­eas that could re­duce AI risk

Sep 24, 2023, 2:03 AM
17 points
5 comments9 min readLW link

[Question] Are the other Ra­tion­al­ity: A-Z se­quences com­ing out as books?

caffeinated_dissonanceSep 24, 2023, 12:38 AM
7 points
4 comments1 min readLW link

The Dick Kick’em Paradox

Augs SMSHacksSep 23, 2023, 10:22 PM
−5 points
21 comments1 min readLW link

I de­signed an AI safety course (for a philos­o­phy de­part­ment)

Eleni AngelouSep 23, 2023, 10:03 PM
37 points
15 comments2 min readLW link

Paper: LLMs trained on “A is B” fail to learn “B is A”

Sep 23, 2023, 7:55 PM
121 points
74 comments4 min readLW link
(arxiv.org)

Sparse Cod­ing, for Mechanis­tic In­ter­pretabil­ity and Ac­ti­va­tion Engineering

David UdellSep 23, 2023, 7:16 PM
42 points
7 comments34 min readLW link

[Question] Places to meet in­ter­est­ing mid­dle-aged men?

anon_girlSep 23, 2023, 7:06 PM
18 points
6 comments1 min readLW link

Tak­ing fea­tures out of su­per­po­si­tion with sparse au­toen­coders more quickly with in­formed initialization

Pierre PeignéSep 23, 2023, 4:21 PM
30 points
8 comments5 min readLW link

A quick re­mark on so-called “hal­lu­ci­na­tions” in LLMs and hu­mans

Bill BenzonSep 23, 2023, 12:17 PM
4 points
4 comments1 min readLW link

Hand-writ­ing MathML

jefftkSep 23, 2023, 11:20 AM
16 points
40 comments1 min readLW link
(www.jefftk.com)

Musk, Star­link, and Crimea

Nicholas / Heather KrossSep 23, 2023, 2:35 AM
−13 points
0 comments5 min readLW link

[Linkpost/​Video] All The Times We Nearly Blew Up The World

Jacob G-WSep 23, 2023, 1:18 AM
6 points
1 comment1 min readLW link
(www.youtube.com)

Luck based medicine: in­os­i­tol for anx­iety and brain fog

ElizabethSep 22, 2023, 8:10 PM
40 points
5 comments3 min readLW link
(acesounderglass.com)

If in­fluence func­tions are not ap­prox­i­mat­ing leave-one-out, how are they sup­posed to help?

Fabien RogerSep 22, 2023, 2:23 PM
66 points
5 comments3 min readLW link

Model­ing p(doom) with TrojanGDP

K. Liam SmithSep 22, 2023, 2:19 PM
−2 points
2 comments13 min readLW link

Let’s talk about Im­pos­tor syn­drome in AI safety

Igor IvanovSep 22, 2023, 1:51 PM
30 points
4 comments3 min readLW link

Fund Tran­sit With Development

jefftkSep 22, 2023, 11:10 AM
47 points
22 comments3 min readLW link
(www.jefftk.com)

Atoms to Agents Proto-Lectures

johnswentworthSep 22, 2023, 6:22 AM
96 points
14 comments2 min readLW link
(www.youtube.com)

Would You Work Harder In The Least Con­ve­nient Pos­si­ble World?

FirinnSep 22, 2023, 5:17 AM
100 points
100 comments9 min readLW link2 reviews

Con­tra Kevin Dorst’s Ra­tional Polarization

azsantoskSep 22, 2023, 4:28 AM
8 points
2 comments9 min readLW link

ACX Bos­ton—Petrov Day 2023

duck_masterSep 22, 2023, 1:13 AM
2 points
0 comments1 min readLW link

What so­cial sci­ence re­search do you want to see re­an­a­lyzed?

Michael WiebeSep 22, 2023, 12:03 AM
14 points
9 comments1 min readLW link

Im­mor­tal­ity or death by AGI

ImmortalityOrDeathByAGISep 21, 2023, 11:59 PM
47 points
30 comments4 min readLW link
(forum.effectivealtruism.org)

Neel Nanda on the Mechanis­tic In­ter­pretabil­ity Re­searcher Mindset

Michaël TrazziSep 21, 2023, 7:47 PM
37 points
1 comment3 min readLW link
(theinsideview.ai)

Re­quire AGI to be Explainable

PeterMcCluskeySep 21, 2023, 4:11 PM
5 points
0 comments6 min readLW link
(bayesianinvestor.com)

Up­date to “Dom­i­nant As­surance Con­tract Plat­form”

moyamoSep 21, 2023, 4:09 PM
32 points
1 comment1 min readLW link