The “no sand­bag­ging on check­able tasks” hypothesis

Joe Carlsmith31 Jul 2023 23:06 UTC
50 points
12 comments9 min readLW link

A So­cial His­tory of Truth

Vaniver31 Jul 2023 22:49 UTC
64 points
2 comments13 min readLW link

Water­mark­ing con­sid­ered over­rated?

DanielFilan31 Jul 2023 21:36 UTC
18 points
4 comments1 min readLW link

What The Lord of the Rings Teaches Us About AI Alignment

Jeffrey Heninger31 Jul 2023 20:16 UTC
21 points
11 comments7 min readLW link

The “spel­ling mir­a­cle”: GPT-3 spel­ling abil­ities and glitch to­kens revisited

mwatkins31 Jul 2023 19:47 UTC
85 points
29 comments20 min readLW link

“Build­ing a House” Review

jefftk31 Jul 2023 19:20 UTC
62 points
6 comments1 min readLW link
(www.jefftk.com)

The Mean­ing of Shog­goth AI Memes

Dan Smith31 Jul 2023 18:52 UTC
−5 points
5 comments2 min readLW link

[Question] Is there any ex­ist­ing term sum­ma­riz­ing non-scal­able over­sight meth­ods in outer al­ign­ment?

Allen Shen31 Jul 2023 17:31 UTC
1 point
0 comments1 min readLW link

Lack of So­cial Grace Is an Epistemic Virtue

Zack_M_Davis31 Jul 2023 16:38 UTC
15 points
96 comments4 min readLW link

Thoughts on shar­ing in­for­ma­tion about lan­guage model capabilities

paulfchristiano31 Jul 2023 16:04 UTC
197 points
34 comments11 min readLW link

Trad­ing off com­pute in train­ing and in­fer­ence (Overview)

Pablo Villalobos31 Jul 2023 16:03 UTC
31 points
1 comment7 min readLW link
(epochai.org)

Open Prob­lems and Fun­da­men­tal Limi­ta­tions of RLHF

scasper31 Jul 2023 15:31 UTC
66 points
6 comments2 min readLW link
(arxiv.org)

“Not Ne­c­es­sar­ily”

Benjamin Hendricks31 Jul 2023 15:19 UTC
23 points
2 comments2 min readLW link

How to find AI al­ign­ment re­searchers to col­lab­o­rate with?

Florian Dietz31 Jul 2023 9:05 UTC
2 points
2 comments1 min readLW link

[Question] Is Kennedy a Nazi?

Pee Doom31 Jul 2023 8:51 UTC
−12 points
10 comments2 min readLW link

Is Light Drink­ing Pro­tec­tive?

jefftk31 Jul 2023 3:00 UTC
45 points
8 comments2 min readLW link
(www.jefftk.com)

EU’s AI am­bi­tions at risk as US pushes to wa­ter down in­ter­na­tional treaty (linkpost)

mic31 Jul 2023 0:34 UTC
10 points
0 comments4 min readLW link
(www.euractiv.com)

The rise of AI in cybercrime

BobyResearcher30 Jul 2023 20:19 UTC
−15 points
1 comment2 min readLW link
(riseofAIincybercryme)

SSA vs. SIA: how fu­ture pop­u­la­tion may provide ev­i­dence for or against the foun­da­tions of poli­ti­cal liberalism

j30 Jul 2023 20:18 UTC
−6 points
10 comments55 min readLW link

Ra­tion­al­iza­tion Max­i­mizes Ex­pected Value

Kevin Dorst30 Jul 2023 20:11 UTC
19 points
10 comments7 min readLW link
(kevindorst.substack.com)

Apollo Neuro Results

Elizabeth30 Jul 2023 18:40 UTC
85 points
16 comments3 min readLW link
(acesounderglass.com)

Hilbert’s Triumph, Church and Tur­ing’s failure, and what it means (Post #2)

Noosphere8930 Jul 2023 14:33 UTC
−5 points
16 comments15 min readLW link

[Question] Spe­cific Ar­gu­ments against open source LLMs?

Iknownothing30 Jul 2023 14:27 UTC
4 points
2 comments1 min readLW link

So­cial­ism in large organizations

Adam Zerner30 Jul 2023 7:25 UTC
7 points
16 comments2 min readLW link

How to make real-money pre­dic­tion mar­kets on ar­bi­trary top­ics (Out­dated)

yutaka30 Jul 2023 2:11 UTC
57 points
13 comments3 min readLW link

[Question] Does de­cid­abil­ity of a the­ory im­ply com­plete­ness of the the­ory?

Noosphere8929 Jul 2023 23:53 UTC
6 points
12 comments1 min readLW link

[Question] If I showed the EQ-SQ the­ory’s find­ings to be due to mea­sure­ment bias, would any­one change their minds about it?

tailcalled29 Jul 2023 19:38 UTC
22 points
13 comments1 min readLW link

Self-driv­ing car bets

paulfchristiano29 Jul 2023 18:10 UTC
229 points
41 comments5 min readLW link
(sideways-view.com)

The Parable of the Dag­ger—The Animation

Writer29 Jul 2023 14:03 UTC
20 points
6 comments1 min readLW link
(youtu.be)

Are Guitars Ob­so­lete?

jefftk29 Jul 2023 13:20 UTC
11 points
8 comments2 min readLW link
(www.jefftk.com)

NAMSI: A promis­ing ap­proach to alignment

Georgeo5729 Jul 2023 7:03 UTC
−6 points
6 comments1 min readLW link

Un­der­stand­ing and Align­ing a Hu­man-like In­duc­tive Bias with Cog­ni­tive Science: a Re­view of Re­lated Liter­a­ture

Claire Short29 Jul 2023 6:10 UTC
23 points
0 comments12 min readLW link

Univer­sal and Trans­fer­able Ad­ver­sar­ial At­tacks on Aligned Lan­guage Models [pa­per link]

Sodium29 Jul 2023 3:21 UTC
16 points
0 comments1 min readLW link
(arxiv.org)

Why You Should Never Up­date Your Beliefs

Arjun Panickssery29 Jul 2023 0:27 UTC
69 points
17 comments4 min readLW link
(arjunpanickssery.substack.com)

Thoughts about the Mechanis­tic In­ter­pretabil­ity Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC
23 points
5 comments20 min readLW link

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC
12 points
3 comments13 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
143 points
9 comments10 min readLW link

Yes, It’s Sub­jec­tive, But Why All The Crabs?

johnswentworth28 Jul 2023 19:35 UTC
240 points
15 comments6 min readLW link

Semaglu­tide and Muscle

5hout28 Jul 2023 18:36 UTC
15 points
14 comments5 min readLW link

Dou­ble Crux in a Box

Screwtape28 Jul 2023 17:55 UTC
8 points
3 comments1 min readLW link

AI Safety 101 : In­tro­duc­tion to Vi­sion Interpretability

28 Jul 2023 17:32 UTC
41 points
0 comments1 min readLW link
(github.com)

Visi­ble loss land­scape bas­ins don’t cor­re­spond to dis­tinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC
65 points
13 comments4 min readLW link

Progress links di­gest, 2023-07-28: The deca­dent op­u­lence of mod­ern capitalism

jasoncrawford28 Jul 2023 14:36 UTC
16 points
3 comments3 min readLW link
(rootsofprogress.org)

AI Aware­ness through In­ter­ac­tion with Blatantly Alien Models

VojtaKovarik28 Jul 2023 8:41 UTC
7 points
5 comments3 min readLW link

You don’t get to have cool flaws

Neil 28 Jul 2023 5:37 UTC
31 points
16 comments2 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
117 points
16 comments9 min readLW link

Mech In­terp Puz­zle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC
40 points
4 comments2 min readLW link

ETFE windows

bhauth28 Jul 2023 0:46 UTC
30 points
4 comments2 min readLW link
(www.bhauth.com)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasper27 Jul 2023 23:05 UTC
18 points
0 comments2 min readLW link

Pul­ling the Rope Side­ways: Em­piri­cal Test Results

Daniel Kokotajlo27 Jul 2023 22:18 UTC
61 points
18 comments1 min readLW link