Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Adversarial Examples (AI)
Tag
Last edit:
Dec 14, 2024, 1:56 AM
by
Ruby
Relevant
New
Old
SolidGoldMagikarp (plus, prompt generation)
Jessica Rumbelow
and
mwatkins
Feb 5, 2023, 10:02 PM
676
points
206
comments
12
min read
LW
link
1
review
AI Safety in a World of Vulnerable Machine Learning Systems
AdamGleave
and
EuanMcLean
Mar 8, 2023, 2:40 AM
70
points
28
comments
29
min read
LW
link
(far.ai)
Ironing Out the Squiggles
Zack_M_Davis
Apr 29, 2024, 4:13 PM
157
points
36
comments
11
min read
LW
link
If I were a well-intentioned AI… I: Image classifier
Stuart_Armstrong
Feb 26, 2020, 12:39 PM
35
points
4
comments
5
min read
LW
link
AXRP Episode 1 - Adversarial Policies with Adam Gleave
DanielFilan
Dec 29, 2020, 8:41 PM
12
points
5
comments
34
min read
LW
link
Human beats SOTA Go AI by learning an adversarial policy
Vanessa Kosoy
Feb 19, 2023, 9:38 AM
59
points
32
comments
1
min read
LW
link
(goattack.far.ai)
Solving adversarial attacks in computer vision as a baby version of general AI alignment
Stanislav Fort
Aug 29, 2024, 5:17 PM
89
points
8
comments
7
min read
LW
link
There are (probably) no superhuman Go AIs: strong human players beat the strongest AIs
Taran
Feb 19, 2023, 12:25 PM
125
points
34
comments
4
min read
LW
link
The Goodhart Game
John_Maxwell
Nov 18, 2019, 11:22 PM
13
points
5
comments
5
min read
LW
link
Adversarial Policies Beat Professional-Level Go AIs
sanxiyn
Nov 3, 2022, 1:27 PM
31
points
35
comments
1
min read
LW
link
(goattack.alignmentfund.org)
[Question]
What progress have we made on automated auditing?
LawrenceC
Jul 6, 2024, 1:49 AM
38
points
1
comment
1
min read
LW
link
Adversarial Robustness Could Help Prevent Catastrophic Misuse
aog
Dec 11, 2023, 7:12 PM
30
points
18
comments
9
min read
LW
link
RAIN: Your Language Models Can Align Themselves without Finetuning—Microsoft Research 2023 - Reduces the adversarial prompt attack success rate from 94% to 19%!
Singularian2501
Sep 24, 2023, 4:48 PM
5
points
0
comments
1
min read
LW
link
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
Dec 5, 2023, 4:48 PM
126
points
30
comments
13
min read
LW
link
Image Hijacks: Adversarial Images can Control Generative Models at Runtime
Scott Emmons
,
Luke Bailey
and
Euan Ong
Sep 20, 2023, 3:23 PM
58
points
9
comments
1
min read
LW
link
(arxiv.org)
EIS IX: Interpretability and Adversaries
scasper
Feb 20, 2023, 6:25 PM
30
points
8
comments
8
min read
LW
link
The Achilles Heel Hypothesis for AI
scasper
Oct 13, 2020, 2:35 PM
20
points
6
comments
1
min read
LW
link
Adversarial attacks and optimal control
Jan
May 22, 2022, 6:22 PM
17
points
7
comments
8
min read
LW
link
(universalprior.substack.com)
Artefacts generated by mode collapse in GPT-4 Turbo serve as adversarial attacks.
Sohaib Imran
Nov 10, 2023, 3:23 PM
11
points
0
comments
2
min read
LW
link
Beyond the Board: Exploring AI Robustness Through Go
AdamGleave
Jun 19, 2024, 4:40 PM
41
points
2
comments
1
min read
LW
link
(far.ai)
An adversarial example for Direct Logit Attribution: memory management in gelu-4l
Can
,
Yeu-Tong Lau
,
James Dao
and
Jett Janiak
Aug 30, 2023, 5:36 PM
17
points
0
comments
8
min read
LW
link
(arxiv.org)
Features and Adversaries in MemoryDT
Joseph Bloom
and
Jay Bailey
Oct 20, 2023, 7:32 AM
31
points
6
comments
25
min read
LW
link
EIS X: Continual Learning, Modularity, Compression, and Biological Brains
scasper
Feb 21, 2023, 4:59 PM
14
points
4
comments
3
min read
LW
link
Analysing Adversarial Attacks with Linear Probing
Yoann Poupart
,
Imene Kerboua
,
Clement Neo
and
Jason Hoelscher-Obermaier
Jun 17, 2024, 2:16 PM
9
points
0
comments
8
min read
LW
link
Evidence Sets: Towards Inductive-Biases based Analysis of Prosaic AGI
bayesian_kitten
Dec 16, 2021, 10:41 PM
22
points
10
comments
21
min read
LW
link
[AN #62] Are adversarial examples caused by real but imperceptible features?
Rohin Shah
Aug 22, 2019, 5:10 PM
28
points
10
comments
9
min read
LW
link
(mailchi.mp)
High-stakes alignment via adversarial training [Redwood Research report]
dmz
,
LawrenceC
and
Nate Thomas
May 5, 2022, 12:59 AM
142
points
29
comments
9
min read
LW
link
A Search for More ChatGPT / GPT-3.5 / GPT-4 “Unspeakable” Glitch Tokens
Martin Fell
May 9, 2023, 2:36 PM
26
points
9
comments
6
min read
LW
link
SmartyHeaderCode: anomalous tokens for GPT3.5 and GPT-4
AdamYedidia
Apr 15, 2023, 10:35 PM
71
points
18
comments
6
min read
LW
link
Does robustness improve with scale?
ChengCheng
,
niki.h
,
Ian McKenzie
,
Oskar Hollinsworth
,
Tom Tseng
and
AdamGleave
Jul 25, 2024, 8:55 PM
14
points
0
comments
1
min read
LW
link
(far.ai)
EIS XII: Summary
scasper
Feb 23, 2023, 5:45 PM
19
points
0
comments
6
min read
LW
link
Even Superhuman Go AIs Have Surprising Failure Modes
AdamGleave
,
EuanMcLean
,
Tony Wang
,
Kellin Pelrine
,
Tom Tseng
,
Yawen Duan
,
Joseph Miller
and
MichaelDennis
Jul 20, 2023, 5:31 PM
130
points
22
comments
10
min read
LW
link
(far.ai)
No comments.
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel