Thoughts about the Mechanis­tic In­ter­pretabil­ity Challenge #2 (EIS VII #2)

RGRGRG28 Jul 2023 20:44 UTC
23 points
5 comments20 min readLW link

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC
12 points
3 comments13 min readLW link

When can we trust model eval­u­a­tions?

evhub28 Jul 2023 19:42 UTC
143 points
9 comments10 min readLW link

Yes, It’s Sub­jec­tive, But Why All The Crabs?

johnswentworth28 Jul 2023 19:35 UTC
240 points
15 comments6 min readLW link

Semaglu­tide and Muscle

5hout28 Jul 2023 18:36 UTC
15 points
14 comments5 min readLW link

Dou­ble Crux in a Box

Screwtape28 Jul 2023 17:55 UTC
8 points
3 comments1 min readLW link

AI Safety 101 : In­tro­duc­tion to Vi­sion Interpretability

28 Jul 2023 17:32 UTC
41 points
0 comments1 min readLW link
(github.com)

Visi­ble loss land­scape bas­ins don’t cor­re­spond to dis­tinct algorithms

Mikhail Samin28 Jul 2023 16:19 UTC
65 points
13 comments4 min readLW link

Progress links di­gest, 2023-07-28: The deca­dent op­u­lence of mod­ern capitalism

jasoncrawford28 Jul 2023 14:36 UTC
16 points
3 comments3 min readLW link
(rootsofprogress.org)

AI Aware­ness through In­ter­ac­tion with Blatantly Alien Models

VojtaKovarik28 Jul 2023 8:41 UTC
7 points
5 comments3 min readLW link

You don’t get to have cool flaws

Neil 28 Jul 2023 5:37 UTC
31 points
16 comments2 min readLW link

Re­duc­ing syco­phancy and im­prov­ing hon­esty via ac­ti­va­tion steering

Nina Rimsky28 Jul 2023 2:46 UTC
117 points
16 comments9 min readLW link

Mech In­terp Puz­zle 2: Word2Vec Style Embeddings

Neel Nanda28 Jul 2023 0:50 UTC
40 points
4 comments2 min readLW link

ETFE windows

bhauth28 Jul 2023 0:46 UTC
30 points
4 comments2 min readLW link
(www.bhauth.com)

A Short Memo on AI In­ter­pretabil­ity Rain­bows

scasper27 Jul 2023 23:05 UTC
18 points
0 comments2 min readLW link

Pul­ling the Rope Side­ways: Em­piri­cal Test Results

Daniel Kokotajlo27 Jul 2023 22:18 UTC
61 points
18 comments1 min readLW link

A $10k retroac­tive grant for VaccinateCA

Austin Chen27 Jul 2023 18:14 UTC
80 points
0 comments1 min readLW link
(manifund.org)

Prefer­ence Ag­gre­ga­tion as Bayesian Inference

beren27 Jul 2023 17:59 UTC
14 points
1 comment1 min readLW link

AI #22: Into the Weeds

Zvi27 Jul 2023 17:40 UTC
49 points
8 comments84 min readLW link
(thezvi.wordpress.com)

SSA re­jects an­thropic shadow, too

jessicata27 Jul 2023 17:25 UTC
61 points
38 comments11 min readLW link
(unstableontology.com)

[Question] What are ex­am­ples of some­one do­ing a lot of work to find the best of some­thing?

chanamessinger27 Jul 2023 15:58 UTC
27 points
15 comments1 min readLW link

AI-Plans.com 10-day Cri­tique-a-Thon

Iknownothing27 Jul 2023 11:44 UTC
8 points
2 comments2 min readLW link
(manifund.org)

Pri­vacy in a Digi­tal World

Faustify27 Jul 2023 10:46 UTC
2 points
0 comments5 min readLW link

Cul­ti­vat­ing a state of mind where new ideas are born

Henrik Karlsson27 Jul 2023 9:16 UTC
211 points
18 comments14 min readLW link
(www.henrikkarlsson.xyz)

Par­tial Tran­script of Re­cent Se­nate Hear­ing Dis­cussing AI X-Risk

Daniel_Eth27 Jul 2023 9:16 UTC
55 points
0 comments1 min readLW link
(medium.com)

AXRP Epi­sode 24 - Su­per­al­ign­ment with Jan Leike

DanielFilan27 Jul 2023 4:00 UTC
55 points
3 comments69 min readLW link

[Question] Have you ever con­sid­ered tak­ing the ‘Tur­ing Test’ your­self?

Super AGI27 Jul 2023 3:48 UTC
2 points
6 comments1 min readLW link

AXRP Epi­sode 23 - Mechanis­tic Ano­maly De­tec­tion with Mark Xu

DanielFilan27 Jul 2023 1:50 UTC
22 points
0 comments72 min readLW link

GPT-4 can catch sub­tle cross-lan­guage trans­la­tion mistakes

Michael Tontchev27 Jul 2023 1:39 UTC
7 points
1 comment1 min readLW link

So­cial Balance through Em­brac­ing So­cial Credit

dhruvv26 Jul 2023 20:07 UTC
−39 points
9 comments3 min readLW link

Why no Ro­man In­dus­trial Revolu­tion?

jasoncrawford26 Jul 2023 19:34 UTC
62 points
30 comments3 min readLW link
(rootsofprogress.org)

Why you can’t treat de­cid­abil­ity and com­plex­ity as a con­stant (Post #1)

Noosphere8926 Jul 2023 17:54 UTC
6 points
13 comments5 min readLW link

A re­sponse to the Richards et al.’s “The Illu­sion of AI’s Ex­is­ten­tial Risk”

Harrison Fell26 Jul 2023 17:34 UTC
1 point
0 comments10 min readLW link

Meta-level ad­ver­sar­ial eval­u­a­tion of over­sight tech­niques might al­low ro­bust mea­sure­ment of their adequacy

26 Jul 2023 17:02 UTC
83 points
18 comments1 min readLW link

Neuronpedia

Johnny Lin26 Jul 2023 16:29 UTC
135 points
51 comments2 min readLW link
(neuronpedia.org)

Fron­tier Model Forum

Zach Stein-Perlman26 Jul 2023 14:30 UTC
27 points
0 comments4 min readLW link
(blog.google)

Pod­casts: Fu­ture of Life In­sti­tute, Break­through Science Sum­mit panel

jasoncrawford26 Jul 2023 14:28 UTC
8 points
0 comments1 min readLW link
(rootsofprogress.org)

Llama We Do­ing This Again?

Zvi26 Jul 2023 13:00 UTC
48 points
3 comments16 min readLW link
(thezvi.wordpress.com)

Fron­tier Model Security

Vaniver26 Jul 2023 4:48 UTC
31 points
1 comment3 min readLW link
(www.anthropic.com)

The First Room-Tem­per­a­ture Am­bi­ent-Pres­sure Superconductor

Annapurna26 Jul 2023 2:27 UTC
35 points
28 comments1 min readLW link
(arxiv.org)

Un­der­wa­ter Tor­ture Cham­bers: The Hor­ror Of Fish Farming

omnizoid26 Jul 2023 0:27 UTC
80 points
49 comments10 min readLW link

Con­tra Alexan­der on the Bit­ter Les­son and IQ

Andrew Keenan Richardson26 Jul 2023 0:07 UTC
9 points
1 comment4 min readLW link
(mechanisticmind.com)

Over­com­ing the MWC

Mark Freed25 Jul 2023 17:31 UTC
3 points
0 comments3 min readLW link

Rus­sian par­li­a­men­tar­ian: let’s ban per­sonal com­put­ers and the Internet

RomanS25 Jul 2023 17:30 UTC
11 points
6 comments2 min readLW link

AISN #16: White House Se­cures Vol­un­tary Com­mit­ments from Lead­ing AI Labs and Les­sons from Oppenheimer

25 Jul 2023 16:58 UTC
6 points
0 comments6 min readLW link
(newsletter.safe.ai)

“The Uni­verse of Minds”—call for re­view­ers (Seeds of Science)

rogersbacon25 Jul 2023 16:53 UTC
7 points
0 comments1 min readLW link

Thoughts on Loss Land­scapes and why Deep Learn­ing works

beren25 Jul 2023 16:41 UTC
52 points
4 comments18 min readLW link

Should you work at a lead­ing AI lab? (in­clud­ing in non-safety roles)

Benjamin Hilton25 Jul 2023 16:29 UTC
7 points
0 comments12 min readLW link

Whisper’s Word-Level Times­tamps are Out

Varshul Gupta25 Jul 2023 14:32 UTC
−17 points
2 comments2 min readLW link
(dubverseblack.substack.com)

AIS 101: Task de­com­po­si­tion for scal­able oversight

Charbel-Raphaël25 Jul 2023 13:34 UTC
27 points
0 comments19 min readLW link
(docs.google.com)