RSS

Gra­di­ent Hacking

TagLast edit: 27 Aug 2022 18:12 UTC by Multicore

Gradient Hacking describes a scenario where a mesa-optimizer in an AI system acts in a way that intentionally manipulates the way that gradient descent updates it, likely to preserve its own mesa-objective in future iterations of the AI.

See also: Inner Alignment

Gra­di­ent hacking

evhub16 Oct 2019 0:53 UTC
104 points
39 comments3 min readLW link2 reviews

Some real ex­am­ples of gra­di­ent hacking

Oliver Sourbut22 Nov 2021 0:11 UTC
15 points
8 comments2 min readLW link

Gra­di­ent Filtering

18 Jan 2023 20:09 UTC
54 points
16 comments13 min readLW link

Gra­di­ent hack­ing is ex­tremely difficult

beren24 Jan 2023 15:45 UTC
161 points
22 comments5 min readLW link

Challenge: con­struct a Gra­di­ent Hacker

9 Mar 2023 2:38 UTC
38 points
10 comments1 min readLW link

Thoughts on gra­di­ent hacking

Richard_Ngo3 Sep 2021 13:02 UTC
33 points
11 comments4 min readLW link

Gra­di­ent Hacker De­sign Prin­ci­ples From Biology

johnswentworth1 Sep 2022 19:03 UTC
60 points
13 comments3 min readLW link

Towards De­con­fus­ing Gra­di­ent Hacking

leogao24 Oct 2021 0:43 UTC
39 points
3 comments12 min readLW link

Gra­di­ent hack­ing: defi­ni­tions and examples

Richard_Ngo29 Jun 2022 21:35 UTC
38 points
2 comments5 min readLW link

Ap­proaches to gra­di­ent hacking

adamShimi14 Aug 2021 15:16 UTC
16 points
8 comments8 min readLW link

[Question] How does Gra­di­ent Des­cent In­ter­act with Good­hart?

Scott Garrabrant2 Feb 2019 0:14 UTC
68 points
19 comments4 min readLW link

Gra­di­ent Hack­ing via Schel­ling Goals

Adam Scherlis28 Dec 2021 20:38 UTC
33 points
4 comments4 min readLW link

Is Fish­e­rian Ru­n­away Gra­di­ent Hack­ing?

Ryan Kidd10 Apr 2022 13:47 UTC
15 points
6 comments4 min readLW link

(Ex­tremely) Naive Gra­di­ent Hack­ing Doesn’t Work

ojorgensen20 Dec 2022 14:35 UTC
14 points
0 comments6 min readLW link

[ASoT] Si­mu­la­tors show us be­havi­oural prop­er­ties by default

Jozdien13 Jan 2023 18:42 UTC
33 points
2 comments3 min readLW link

Ob­sta­cles to gra­di­ent hacking

leogao5 Sep 2021 22:42 UTC
28 points
11 comments4 min readLW link

Good­bye, Shog­goth: The Stage, its An­i­ma­tron­ics, & the Pup­peteer – a New Metaphor

RogerDearnaley9 Jan 2024 20:42 UTC
46 points
8 comments36 min readLW link

Pro­gram­matic back­doors: DNNs can use SGD to run ar­bi­trary state­ful computation

23 Oct 2023 16:37 UTC
101 points
3 comments8 min readLW link

Gra­di­ent hack­ing via ac­tual hacking

Max H10 May 2023 1:57 UTC
12 points
7 comments3 min readLW link

What an ac­tu­ally pes­simistic con­tain­ment strat­egy looks like

lc5 Apr 2022 0:19 UTC
667 points
138 comments6 min readLW link2 reviews

Elic­it­ing Credit Hack­ing Be­havi­ours in LLMs

omegastick14 Sep 2023 15:07 UTC
3 points
2 comments7 min readLW link
(github.com)

Meta learn­ing to gra­di­ent hack

Quintin Pope1 Oct 2021 19:25 UTC
55 points
11 comments3 min readLW link

In­ter­pret­ing the Learn­ing of Deceit

RogerDearnaley18 Dec 2023 8:12 UTC
30 points
8 comments9 min readLW link

Un­der­stand­ing Gra­di­ent Hacking

peterbarnett10 Dec 2021 15:58 UTC
41 points
5 comments30 min readLW link

Some mo­ti­va­tions to gra­di­ent hack

peterbarnett17 Dec 2021 3:06 UTC
8 points
0 comments6 min readLW link
No comments.