Me­moriz­ing weak ex­am­ples can elicit strong be­hav­ior out of pass­word-locked models

Jun 6, 2024, 11:54 PM
58 points
5 comments7 min readLW link

Re­sponse to Aschen­bren­ner’s “Si­tu­a­tional Aware­ness”

Rob BensingerJun 6, 2024, 10:57 PM
194 points
27 comments3 min readLW link

Scal­ing and eval­u­at­ing sparse autoencoders

leogaoJun 6, 2024, 10:50 PM
106 points
6 comments1 min readLW link

Hum­ming is not a free $100 bill

ElizabethJun 6, 2024, 8:10 PM
185 points
6 comments3 min readLW link
(acesounderglass.com)

There Are No Pri­mor­dial Defi­ni­tions of Man/​Woman

ymeskhoutJun 6, 2024, 7:30 PM
11 points
0 comments4 min readLW link
(ymeskhout.substack.com)

Si­tu­a­tional Aware­ness Sum­ma­rized—Part 1

Joe RogeroJun 6, 2024, 6:59 PM
21 points
0 comments5 min readLW link

[Link Post] “Foun­da­tional Challenges in As­sur­ing Align­ment and Safety of Large Lan­guage Models”

David Scott Krueger (formerly: capybaralet)Jun 6, 2024, 6:55 PM
70 points
2 comments6 min readLW link
(llm-safety-challenges.github.io)

AI #67: Brief Strange Trip

ZviJun 6, 2024, 6:50 PM
49 points
6 comments40 min readLW link
(thezvi.wordpress.com)

The Hu­man Biolog­i­cal Ad­van­tage Over AI

WstewartJun 6, 2024, 6:18 PM
−13 points
2 comments1 min readLW link

An eval­u­a­tion of He­len Toner’s in­ter­view on the TED AI Show

PeterHJun 6, 2024, 5:39 PM
24 points
2 comments30 min readLW link

The Im­pos­si­bil­ity of a Ra­tional In­tel­li­gence Optimizer

Nicolas VillarrealJun 6, 2024, 4:14 PM
−9 points
5 comments14 min readLW link

Im­mu­niza­tion against harm­ful fine-tun­ing attacks

Jun 6, 2024, 3:17 PM
4 points
0 comments12 min readLW link

SB 1047 Is Weakened

ZviJun 6, 2024, 1:40 PM
67 points
4 comments9 min readLW link
(thezvi.wordpress.com)

Weep­ing Agents

pleiotrothJun 6, 2024, 12:18 PM
24 points
2 comments3 min readLW link

Pod­cast: Cen­ter for AI Policy, on AI risk and listen­ing to AI researchers

KatjaGraceJun 6, 2024, 3:30 AM
9 points
0 comments1 min readLW link
(worldspiritsockpuppet.com)

Calcu­lat­ing Nat­u­ral La­tents via Resampling

Jun 6, 2024, 12:37 AM
55 points
4 comments10 min readLW link

SAEs Dis­cover Mean­ingful Fea­tures in the IOI Task

Jun 5, 2024, 11:48 PM
15 points
2 comments10 min readLW link

Let’s De­sign A School, Part 2.4 School as Ed­u­ca­tion—The Cur­ricu­lum (Phase 3, Spe­cific)

SableJun 5, 2024, 9:40 PM
19 points
2 comments12 min readLW link
(affablyevil.substack.com)

METR is hiring ML Re­search Eng­ineers and Scientists

XodarapJun 5, 2024, 9:27 PM
5 points
0 comments1 min readLW link
(metr.org)

Book re­view: The Quincunx

cousin_itJun 5, 2024, 9:13 PM
41 points
12 comments2 min readLW link

[Question] How should I think about my ca­reer?

ChicoJun 5, 2024, 6:11 PM
3 points
2 comments1 min readLW link

AISN #36: Vol­un­tary Com­mit­ments are In­suffi­cient Plus, a Se­nate AI Policy Roadmap, and Chap­ter 1: An Overview of Catas­trophic Risks

Jun 5, 2024, 5:45 PM
9 points
0 comments5 min readLW link
(newsletter.safe.ai)

GPT2, Five Years On

Joel BurgetJun 5, 2024, 5:44 PM
34 points
0 comments3 min readLW link
(importai.substack.com)

[Question] Who wants to be in­vited to the LW Me­ta­mod­ern di­alogue?

hunterglennJun 5, 2024, 4:39 PM
−3 points
1 comment1 min readLW link

Non­re­ac­tivity: a sim­ple model of meditation

cesiumquailJun 5, 2024, 4:26 PM
21 points
4 comments6 min readLW link

graph­patch: a Python Library for Ac­ti­va­tion Patching

Occam's LaserJun 5, 2024, 3:08 PM
13 points
2 comments1 min readLW link

Startup Stock Op­tions: the Short­est Com­plete Guide for Employees

Boris TJun 5, 2024, 3:03 PM
17 points
3 comments1 min readLW link
(borisagain.substack.com)

Ag­grega­tive Prin­ci­ples of So­cial Justice

Cleo NardoJun 5, 2024, 1:44 PM
29 points
10 comments37 min readLW link

What and how much makes a differ­ence?

Marius Adrian NicoarăJun 5, 2024, 10:30 AM
7 points
0 comments2 min readLW link

An­nounc­ing ILIAD — The­o­ret­i­cal AI Align­ment Conference

Jun 5, 2024, 9:37 AM
163 points
18 comments2 min readLW link

Se­cond-Order Ra­tion­al­ity, Sys­tem Ra­tion­al­ity, and a fea­ture sug­ges­tion for LessWrong

Mati_RoyJun 5, 2024, 7:20 AM
13 points
2 comments8 min readLW link

Former OpenAI Su­per­al­ign­ment Re­searcher: Su­per­in­tel­li­gence by 2030

Julian BradshawJun 5, 2024, 3:35 AM
70 points
30 comments1 min readLW link
(situational-awareness.ai)

On “first crit­i­cal tries” in AI alignment

Joe CarlsmithJun 5, 2024, 12:19 AM
54 points
8 comments14 min readLW link

Take­off speeds pre­sen­ta­tion at Anthropic

Tom DavidsonJun 4, 2024, 10:46 PM
92 points
0 comments25 min readLW link

A Reflec­tion on Richard Ham­ming’s “You and Your Re­search”: Striv­ing for Greatness

aysajanJun 4, 2024, 8:07 PM
8 points
5 comments21 min readLW link
(www.aysajaneziz.com)

A Semiotic Cri­tique of the Orthog­o­nal­ity Thesis

Nicolas VillarrealJun 4, 2024, 6:52 PM
3 points
10 comments15 min readLW link

Here’s Why In­definite Life Ex­ten­sion Will Never Work, Even Though it Does.

HomingHamsterJun 4, 2024, 6:48 PM
−13 points
5 comments18 min readLW link

Ideas for Next-Gen­er­a­tion Writ­ing Plat­forms, us­ing LLMs

ozziegooenJun 4, 2024, 6:40 PM
26 points
4 commentsLW link

Ev­i­dence of Learned Look-Ahead in a Chess-Play­ing Neu­ral Network

Erik JennerJun 4, 2024, 3:50 PM
121 points
14 comments13 min readLW link

Is This Lie De­tec­tor Really Just a Lie De­tec­tor? An In­ves­ti­ga­tion of LLM Probe Speci­fic­ity.

Josh LevyJun 4, 2024, 3:45 PM
39 points
0 comments18 min readLW link

[Paper] Stress-test­ing ca­pa­bil­ity elic­i­ta­tion with pass­word-locked models

Jun 4, 2024, 2:52 PM
85 points
10 comments12 min readLW link
(arxiv.org)

Cir­cuit Board Ordering

jefftkJun 4, 2024, 2:00 PM
10 points
0 comments1 min readLW link
(www.jefftk.com)

[Question] Has any­one here writ­ten about re­li­gious fic­tion­al­ism?

SpectrumDTJun 4, 2024, 12:10 PM
0 points
4 comments1 min readLW link

Is Wittgen­stein’s Lan­guage Game used when helping Ai un­der­stand lan­guage?

VisionaryHeraJun 4, 2024, 7:41 AM
3 points
7 comments1 min readLW link

Smart­phone Eti­quette: Sugges­tions for So­cial Interactions

Declan MolonyJun 4, 2024, 6:01 AM
26 points
4 comments3 min readLW link

Just ad­mit that you’ve zoned out

joecJun 4, 2024, 2:51 AM
91 points
22 comments2 min readLW link

(Not) Derailing the LessOn­line Puz­zle Hunt

ErrorJun 4, 2024, 1:28 AM
74 points
2 comments4 min readLW link

Mas­culinity—A Case For Courage

James Stephen BrownJun 4, 2024, 12:04 AM
24 points
0 comments7 min readLW link
(nonzerosum.games)

Philoso­phers wrestling with evil, as a so­cial me­dia feed

David GrossJun 3, 2024, 10:25 PM
51 points
2 comments16 min readLW link

ACI#8: Value as a Func­tion of Pos­si­ble Worlds

Akira PyinyaJun 3, 2024, 9:49 PM
6 points
2 comments7 min readLW link