RSS

Ad­ver­sar­ial Training

TagLast edit: 3 Jun 2022 1:30 UTC by Ruby

Ad­ver­sar­ial train­ing, im­por­tance sam­pling, and anti-ad­ver­sar­ial train­ing for AI whistleblowing

Buck2 Jun 2022 23:48 UTC
42 points
0 comments3 min readLW link

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

Buck10 Oct 2024 13:36 UTC
101 points
4 comments13 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav Fort29 Aug 2024 17:17 UTC
98 points
9 comments7 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
142 points
29 comments9 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasper30 Jul 2024 14:57 UTC
25 points
0 comments4 min readLW link

AI Safety 101 - Chap­ter 5.2 - Un­re­stricted Ad­ver­sar­ial Training

Charbel-Raphaël31 Oct 2023 14:34 UTC
17 points
0 comments19 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasper5 Dec 2023 16:48 UTC
127 points
30 comments13 min readLW link

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilan21 Aug 2022 23:50 UTC
16 points
0 comments35 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmz17 Sep 2022 3:55 UTC
143 points
12 comments6 min readLW link1 review

Iron­ing Out the Squiggles

Zack_M_Davis29 Apr 2024 16:13 UTC
164 points
37 comments11 min readLW link

Ad­ver­sar­ial Ro­bust­ness Could Help Prevent Catas­trophic Misuse

aog11 Dec 2023 19:12 UTC
30 points
18 comments9 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth Barnes8 Dec 2021 1:28 UTC
9 points
6 comments3 min readLW link

Over­sight Leagues: The Train­ing Game as a Feature

Paul Bricman9 Sep 2022 10:08 UTC
20 points
6 comments10 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
30 points
8 comments8 min readLW link

La­tent Ad­ver­sar­ial Train­ing (LAT) Im­proves the Rep­re­sen­ta­tion of Refusal

6 Jan 2025 10:24 UTC
21 points
6 comments10 min readLW link

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya Koren8 Jul 2023 17:32 UTC
6 points
0 comments9 min readLW link

Can We Trust the Judge? A novel method of Model­ling Hu­man Bias and Sys­tem­atic Er­ror in De­bate-Based Scal­able Oversight

Andreea Zaman19 Jul 2025 21:44 UTC
1 point
0 comments7 min readLW link

En­hanc­ing Ge­nomic Foun­da­tion Model Ro­bust­ness through Iter­a­tive Black-Box Ad­ver­sar­ial Training

14 Oct 2025 20:54 UTC
8 points
0 comments7 min readLW link

Does ro­bust­ness im­prove with scale?

25 Jul 2024 20:55 UTC
14 points
0 comments1 min readLW link
(far.ai)

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
19 points
0 comments6 min readLW link

Loss Curves

James Camacho6 May 2025 22:22 UTC
16 points
3 comments4 min readLW link
(github.com)

Beyond the Board: Ex­plor­ing AI Ro­bust­ness Through Go

AdamGleave19 Jun 2024 16:40 UTC
41 points
2 comments1 min readLW link
(far.ai)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight Lee14 Apr 2025 10:27 UTC
−3 points
2 comments4 min readLW link

LLM Sy­co­phancy: groom­ing, proto-sen­tience, or both?

gturner413 Oct 2025 0:58 UTC
1 point
0 comments2 min readLW link

EIS XI: Mov­ing Forward

scasper22 Feb 2023 19:05 UTC
19 points
2 comments9 min readLW link

In­trigu­ing Prop­er­ties of gpt-oss Jailbreaks

13 Aug 2025 19:42 UTC
16 points
0 comments10 min readLW link
(xlabaisecurity.com)

La­tent Ad­ver­sar­ial Training

Adam Jermyn29 Jun 2022 20:04 UTC
57 points
13 comments5 min readLW link
No comments.