On sec­ond thought, prompt in­jec­tions are prob­a­bly ex­am­ples of misalignment

lcFeb 20, 2023, 11:56 PM
22 points
5 comments1 min readLW link

Noth­ing Is Ever Taught Correctly

LVSNFeb 20, 2023, 10:31 PM
5 points
3 comments1 min readLW link

Be­hav­ioral and mechanis­tic defi­ni­tions (of­ten con­fuse AI al­ign­ment dis­cus­sions)

LawrenceCFeb 20, 2023, 9:33 PM
33 points
5 comments6 min readLW link

Val­ida­tor mod­els: A sim­ple ap­proach to de­tect­ing goodharting

berenFeb 20, 2023, 9:32 PM
14 points
1 comment4 min readLW link

There are no co­her­ence theorems

Feb 20, 2023, 9:25 PM
149 points
130 comments19 min readLW link1 review

[Question] Are there any AI safety rele­vant fully re­mote roles suit­able for some­one with 2-3 years of ma­chine learn­ing en­g­ineer­ing in­dus­try ex­pe­rience?

Malleable_shapeFeb 20, 2023, 7:57 PM
7 points
2 comments1 min readLW link

A cir­cuit for Python doc­strings in a 4-layer at­ten­tion-only transformer

Feb 20, 2023, 7:35 PM
96 points
8 comments21 min readLW link

Syd­ney the Bin­gena­tor Can’t Think, But It Still Threat­ens People

Valentin BaltadzhievFeb 20, 2023, 6:37 PM
−3 points
2 comments8 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasperFeb 20, 2023, 6:25 PM
30 points
8 comments8 min readLW link

What AI com­pa­nies can do to­day to help with the most im­por­tant century

HoldenKarnofskyFeb 20, 2023, 5:00 PM
38 points
3 comments9 min readLW link
(www.cold-takes.com)

Ban­kless Pod­cast: 159 - We’re All Gonna Die with Eliezer Yudkowsky

bayesedFeb 20, 2023, 4:42 PM
83 points
54 comments1 min readLW link
(www.youtube.com)

Spec­u­la­tive Tech­nolo­gies launch and Ben Rein­hardt AMA

jasoncrawfordFeb 20, 2023, 4:33 PM
16 points
0 comments1 min readLW link
(rootsofprogress.org)

[MLSN #8] Mechanis­tic in­ter­pretabil­ity, us­ing law to in­form AI al­ign­ment, scal­ing laws for proxy gaming

Feb 20, 2023, 3:54 PM
20 points
0 comments4 min readLW link
(newsletter.mlsafety.org)

Bing find­ing ways to by­pass Microsoft’s filters with­out be­ing asked. Is it re­pro­ducible?

Christopher KingFeb 20, 2023, 3:11 PM
27 points
15 comments1 min readLW link

Me­tac­u­lus In­tro­duces New ‘Con­di­tional Pair’ Fore­cast Ques­tions for Mak­ing Con­di­tional Predictions

ChristianWilliamsFeb 20, 2023, 1:36 PM
40 points
0 commentsLW link

On In­ves­ti­gat­ing Con­spir­acy Theories

ZviFeb 20, 2023, 12:50 PM
116 points
38 comments5 min readLW link
(thezvi.wordpress.com)

The Es­ti­ma­tion Game: a monthly Fermi es­ti­ma­tion web app

Feb 20, 2023, 11:33 AM
20 points
2 comments1 min readLW link

The idea that ChatGPT is sim­ply “pre­dict­ing” the next word is, at best, misleading

Bill BenzonFeb 20, 2023, 11:32 AM
55 points
88 comments5 min readLW link

Rus­sell Con­ju­ga­tions list & vot­ing thread

Daniel KokotajloFeb 20, 2023, 6:39 AM
23 points
63 comments1 min readLW link

Emer­gent De­cep­tion and Emer­gent Optimization

jsteinhardtFeb 20, 2023, 2:40 AM
64 points
0 comments14 min readLW link
(bounded-regret.ghost.io)

AGI doesn’t need un­der­stand­ing, in­ten­tion, or con­scious­ness in or­der to kill us, only intelligence

James BlahaFeb 20, 2023, 12:55 AM
10 points
2 comments18 min readLW link

Re­mote AI Align­ment Over­hang?

tryactionsFeb 19, 2023, 10:30 PM
37 points
5 comments4 min readLW link

A Neu­ral Net­work un­der­go­ing Gra­di­ent-based Train­ing as a Com­plex System

carboniferous_umbraculum Feb 19, 2023, 10:08 PM
22 points
1 comment19 min readLW link

Another Way to Be Okay

Gretta DulebaFeb 19, 2023, 8:49 PM
107 points
15 comments6 min readLW link

A Way To Be Okay

Duncan Sabien (Inactive)Feb 19, 2023, 8:27 PM
109 points
38 comments10 min readLW link1 review

Ex­plor­ing Lily’s world with ChatGPT [things an AI won’t do]

Bill BenzonFeb 19, 2023, 4:39 PM
5 points
0 comments20 min readLW link

EIS VIII: An Eng­ineer’s Un­der­stand­ing of De­cep­tive Alignment

scasperFeb 19, 2023, 3:25 PM
30 points
5 comments4 min readLW link

Does novel un­der­stand­ing im­ply novel agency /​ val­ues?

TsviBTFeb 19, 2023, 2:41 PM
18 points
0 comments7 min readLW link

There are (prob­a­bly) no su­per­hu­man Go AIs: strong hu­man play­ers beat the strongest AIs

TaranFeb 19, 2023, 12:25 PM
125 points
34 comments4 min readLW link

Nav­i­gat­ing pub­lic AI x-risk hype while pur­su­ing tech­ni­cal solutions

Dan BraunFeb 19, 2023, 12:22 PM
18 points
0 comments2 min readLW link

Some­what against “just up­date all the way”

tailcalledFeb 19, 2023, 10:49 AM
31 points
10 comments2 min readLW link

Hu­man beats SOTA Go AI by learn­ing an ad­ver­sar­ial policy

Vanessa KosoyFeb 19, 2023, 9:38 AM
59 points
32 comments1 min readLW link
(goattack.far.ai)

Degamification

Nate ShowellFeb 19, 2023, 5:35 AM
23 points
2 comments2 min readLW link

Stop post­ing prompt in­jec­tions on Twit­ter and call­ing it “mis­al­ign­ment”

lcFeb 19, 2023, 2:21 AM
144 points
9 comments1 min readLW link

AGI in sight: our look at the game board

Feb 18, 2023, 10:17 PM
227 points
135 comments6 min readLW link
(andreamiotti.substack.com)

We should be sig­nal-boost­ing anti Bing chat content

mbrooksFeb 18, 2023, 6:52 PM
−4 points
13 comments2 min readLW link

Can talk, can think, can suffer.

IlioFeb 18, 2023, 6:43 PM
1 point
8 comments3 min readLW link

Para­met­ri­cally re­tar­getable de­ci­sion-mak­ers tend to seek power

TurnTroutFeb 18, 2023, 6:41 PM
172 points
10 comments2 min readLW link
(arxiv.org)

Near-Term Risks of an Obe­di­ent Ar­tifi­cial Intelligence

ymeskhoutFeb 18, 2023, 6:30 PM
20 points
1 comment6 min readLW link

EIS VII: A Challenge for Mechanists

scasperFeb 18, 2023, 6:27 PM
36 points
4 comments3 min readLW link

Read­ing Speed Ex­ists!

Johannes C. MayerFeb 18, 2023, 3:30 PM
12 points
9 comments1 min readLW link

The Prac­ti­tioner’s Path 2.0: the Med­i­ta­tive Archetype

EvenflairFeb 18, 2023, 3:23 PM
14 points
1 comment2 min readLW link
(guildoftherose.org)

Should we cry “wolf”?

Tapatakt18 Feb 2023 11:24 UTC
24 points
5 comments1 min readLW link

[Question] Name of the fal­lacy of as­sum­ing an ex­treme value (e.g. 0) with the illu­sion of ‘avoid­ing to have to make an as­sump­tion’?

FlorianH18 Feb 2023 8:11 UTC
4 points
1 comment1 min readLW link

I Think We’re Ap­proach­ing The Bit­ter Les­son’s Asymptote

SomeoneYouOnceKnew18 Feb 2023 5:33 UTC
−3 points
9 comments5 min readLW link

Bus-Only Bus Lane Enforcement

jefftk18 Feb 2023 2:50 UTC
19 points
15 comments1 min readLW link
(www.jefftk.com)

Run Head on Towards the Fal­ling Tears

Johannes C. Mayer18 Feb 2023 1:33 UTC
6 points
0 comments2 min readLW link

Two prob­lems with ‘Si­mu­la­tors’ as a frame

ryan_greenblatt17 Feb 2023 23:34 UTC
79 points
13 comments5 min readLW link

GPT-4 Predictions

Stephen McAleese17 Feb 2023 23:20 UTC
110 points
27 comments11 min readLW link

On Board Vi­sion, Hol­low Words, and the End of the World

Marcello17 Feb 2023 23:18 UTC
52 points
27 comments5 min readLW link