Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Adversarial Training
Tag
Last edit:
3 Jun 2022 1:30 UTC
by
Ruby
Relevant
New
Old
AXRP Episode 17 - Training for Very High Reliability with Daniel Ziegler
DanielFilan
21 Aug 2022 23:50 UTC
16
points
0
comments
35
min read
LW
link
AI Safety 101 - Chapter 5.2 - Unrestricted Adversarial Training
Charbel-Raphaël
31 Oct 2023 14:34 UTC
17
points
0
comments
19
min read
LW
link
Takeaways from our robust injury classifier project [Redwood Research]
dmz
17 Sep 2022 3:55 UTC
143
points
12
comments
6
min read
LW
link
1
review
Deep Forgetting & Unlearning for Safely-Scoped LLMs
scasper
5 Dec 2023 16:48 UTC
109
points
29
comments
13
min read
LW
link
Some thoughts on why adversarial training might be useful
Beth Barnes
8 Dec 2021 1:28 UTC
9
points
6
comments
3
min read
LW
link
Adversarial Robustness Could Help Prevent Catastrophic Misuse
aogara
11 Dec 2023 19:12 UTC
30
points
18
comments
9
min read
LW
link
Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing
Buck
2 Jun 2022 23:48 UTC
37
points
0
comments
3
min read
LW
link
Ironing Out the Squiggles
Zack_M_Davis
29 Apr 2024 16:13 UTC
140
points
33
comments
11
min read
LW
link
EIS IX: Interpretability and Adversaries
scasper
20 Feb 2023 18:25 UTC
30
points
7
comments
8
min read
LW
link
Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI
Benaya Koren
8 Jul 2023 17:32 UTC
6
points
0
comments
9
min read
LW
link
Latent Adversarial Training
Adam Jermyn
29 Jun 2022 20:04 UTC
42
points
12
comments
5
min read
LW
link
Oversight Leagues: The Training Game as a Feature
Paul Bricman
9 Sep 2022 10:08 UTC
20
points
6
comments
10
min read
LW
link
EIS XI: Moving Forward
scasper
22 Feb 2023 19:05 UTC
19
points
2
comments
9
min read
LW
link
EIS XII: Summary
scasper
23 Feb 2023 17:45 UTC
17
points
0
comments
6
min read
LW
link
No comments.
Back to top