RSS

Ad­ver­sar­ial Training

TagLast edit: Jun 3, 2022, 1:30 AM by Ruby

Ad­ver­sar­ial train­ing, im­por­tance sam­pling, and anti-ad­ver­sar­ial train­ing for AI whistleblowing

BuckJun 2, 2022, 11:48 PM
42 points

20 votes

Overall karma indicates overall quality.

0 comments3 min readLW link

Be­hav­ioral red-team­ing is un­likely to pro­duce clear, strong ev­i­dence that mod­els aren’t scheming

BuckOct 10, 2024, 1:36 PM
101 points

35 votes

Overall karma indicates overall quality.

4 comments13 min readLW link

Solv­ing ad­ver­sar­ial at­tacks in com­puter vi­sion as a baby ver­sion of gen­eral AI alignment

Stanislav FortAug 29, 2024, 5:17 PM
98 points

41 votes

Overall karma indicates overall quality.

9 comments7 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

May 5, 2022, 12:59 AM
142 points

64 votes

Overall karma indicates overall quality.

29 comments9 min readLW link

Can Gen­er­al­ized Ad­ver­sar­ial Test­ing En­able More Ri­gor­ous LLM Safety Evals?

scasperJul 30, 2024, 2:57 PM
25 points

12 votes

Overall karma indicates overall quality.

0 comments4 min readLW link

AI Safety 101 - Chap­ter 5.2 - Un­re­stricted Ad­ver­sar­ial Training

Charbel-RaphaëlOct 31, 2023, 2:34 PM
17 points

7 votes

Overall karma indicates overall quality.

0 comments19 min readLW link

Deep For­get­ting & Un­learn­ing for Safely-Scoped LLMs

scasperDec 5, 2023, 4:48 PM
127 points

54 votes

Overall karma indicates overall quality.

30 comments13 min readLW link

AXRP Epi­sode 17 - Train­ing for Very High Reli­a­bil­ity with Daniel Ziegler

DanielFilanAug 21, 2022, 11:50 PM
16 points

6 votes

Overall karma indicates overall quality.

0 comments35 min readLW link

Take­aways from our ro­bust in­jury clas­sifier pro­ject [Red­wood Re­search]

dmzSep 17, 2022, 3:55 AM
143 points

64 votes

Overall karma indicates overall quality.

12 comments6 min readLW link1 review

Iron­ing Out the Squiggles

Zack_M_DavisApr 29, 2024, 4:13 PM
159 points

70 votes

Overall karma indicates overall quality.

36 comments11 min readLW link

Ad­ver­sar­ial Ro­bust­ness Could Help Prevent Catas­trophic Misuse

aogDec 11, 2023, 7:12 PM
30 points

9 votes

Overall karma indicates overall quality.

18 comments9 min readLW link

Some thoughts on why ad­ver­sar­ial train­ing might be useful

Beth BarnesDec 8, 2021, 1:28 AM
9 points

4 votes

Overall karma indicates overall quality.

6 comments3 min readLW link

Over­sight Leagues: The Train­ing Game as a Feature

Paul BricmanSep 9, 2022, 10:08 AM
20 points

9 votes

Overall karma indicates overall quality.

6 comments10 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasperFeb 20, 2023, 6:25 PM
30 points

11 votes

Overall karma indicates overall quality.

8 comments8 min readLW link

La­tent Ad­ver­sar­ial Train­ing (LAT) Im­proves the Rep­re­sen­ta­tion of Refusal

Jan 6, 2025, 10:24 AM
21 points

13 votes

Overall karma indicates overall quality.

6 comments10 min readLW link

Con­tin­u­ous Ad­ver­sar­ial Qual­ity As­surance: Ex­tend­ing RLHF and Con­sti­tu­tional AI

Benaya KorenJul 8, 2023, 5:32 PM
6 points

7 votes

Overall karma indicates overall quality.

0 comments9 min readLW link

Can We Trust the Judge? A novel method of Model­ling Hu­man Bias and Sys­tem­atic Er­ror in De­bate-Based Scal­able Oversight

Andreea ZamanJul 19, 2025, 9:44 PM
1 point

1 vote

Overall karma indicates overall quality.

0 comments7 min readLW link

En­hanc­ing Ge­nomic Foun­da­tion Model Ro­bust­ness through Iter­a­tive Black-Box Ad­ver­sar­ial Training

Oct 14, 2025, 8:54 PM
8 points

5 votes

Overall karma indicates overall quality.

0 comments7 min readLW link

Does ro­bust­ness im­prove with scale?

Jul 25, 2024, 8:55 PM
14 points

6 votes

Overall karma indicates overall quality.

0 comments1 min readLW link
(far.ai)

EIS XII: Sum­mary

scasperFeb 23, 2023, 5:45 PM
19 points

11 votes

Overall karma indicates overall quality.

0 comments6 min readLW link

Loss Curves

James CamachoMay 6, 2025, 10:22 PM
16 points

5 votes

Overall karma indicates overall quality.

3 comments4 min readLW link
(github.com)

Beyond the Board: Ex­plor­ing AI Ro­bust­ness Through Go

AdamGleaveJun 19, 2024, 4:40 PM
41 points

12 votes

Overall karma indicates overall quality.

2 comments1 min readLW link
(far.ai)

A Solu­tion to Sand­bag­ging and other Self-Prov­able Misal­ign­ment: Con­sti­tu­tional AI Detectives

Knight LeeApr 14, 2025, 10:27 AM
−3 points

2 votes

Overall karma indicates overall quality.

2 comments4 min readLW link

LLM Sy­co­phancy: groom­ing, proto-sen­tience, or both?

gturner4Oct 13, 2025, 12:58 AM
1 point

1 vote

Overall karma indicates overall quality.

0 comments2 min readLW link

EIS XI: Mov­ing Forward

scasperFeb 22, 2023, 7:05 PM
19 points

11 votes

Overall karma indicates overall quality.

2 comments9 min readLW link

In­trigu­ing Prop­er­ties of gpt-oss Jailbreaks

Aug 13, 2025, 7:42 PM
16 points

11 votes

Overall karma indicates overall quality.

0 comments10 min readLW link
(xlabaisecurity.com)

La­tent Ad­ver­sar­ial Training

Adam JermynJun 29, 2022, 8:04 PM
57 points

27 votes

Overall karma indicates overall quality.

13 comments5 min readLW link
No comments.