RSS

Ad­ver­sar­ial Examples

Tag

SolidGoldMag­ikarp (plus, prompt gen­er­a­tion)

5 Feb 2023 22:02 UTC
651 points
199 comments12 min readLW link

AI Safety in a World of Vuln­er­a­ble Ma­chine Learn­ing Systems

8 Mar 2023 2:40 UTC
58 points
10 comments29 min readLW link
(far.ai)

AXRP Epi­sode 1 - Ad­ver­sar­ial Poli­cies with Adam Gleave

DanielFilan29 Dec 2020 20:41 UTC
12 points
5 comments33 min readLW link

The Good­hart Game

John_Maxwell18 Nov 2019 23:22 UTC
13 points
5 comments5 min readLW link

If I were a well-in­ten­tioned AI… I: Image classifier

Stuart_Armstrong26 Feb 2020 12:39 UTC
35 points
4 comments5 min readLW link

Ad­ver­sar­ial Poli­cies Beat Pro­fes­sional-Level Go AIs

sanxiyn3 Nov 2022 13:27 UTC
31 points
35 comments1 min readLW link
(goattack.alignmentfund.org)

Hu­man beats SOTA Go AI by learn­ing an ad­ver­sar­ial policy

Vanessa Kosoy19 Feb 2023 9:38 UTC
55 points
32 comments1 min readLW link
(goattack.far.ai)

[AN #62] Are ad­ver­sar­ial ex­am­ples caused by real but im­per­cep­ti­ble fea­tures?

Rohin Shah22 Aug 2019 17:10 UTC
27 points
10 comments9 min readLW link
(mailchi.mp)

The Achilles Heel Hy­poth­e­sis for AI

scasper13 Oct 2020 14:35 UTC
20 points
6 comments1 min readLW link

Smar­tyHead­erCode: anoma­lous to­kens for GPT3.5 and GPT-4

AdamYedidia15 Apr 2023 22:35 UTC
70 points
18 comments6 min readLW link

A Search for More ChatGPT /​ GPT-3.5 /​ GPT-4 “Un­speak­able” Glitch Tokens

Martin Fell9 May 2023 14:36 UTC
14 points
6 comments6 min readLW link

EIS IX: In­ter­pretabil­ity and Adversaries

scasper20 Feb 2023 18:25 UTC
29 points
5 comments8 min readLW link

EIS X: Con­tinual Learn­ing, Mo­du­lar­ity, Com­pres­sion, and Biolog­i­cal Brains

scasper21 Feb 2023 16:59 UTC
14 points
3 comments3 min readLW link

EIS XII: Sum­mary

scasper23 Feb 2023 17:45 UTC
12 points
0 comments6 min readLW link

Ev­i­dence Sets: Towards In­duc­tive-Bi­ases based Anal­y­sis of Pro­saic AGI

bayesian_kitten16 Dec 2021 22:41 UTC
22 points
10 comments21 min readLW link

High-stakes al­ign­ment via ad­ver­sar­ial train­ing [Red­wood Re­search re­port]

5 May 2022 0:59 UTC
142 points
29 comments9 min readLW link

Ad­ver­sar­ial at­tacks and op­ti­mal control

Jan22 May 2022 18:22 UTC
17 points
7 comments8 min readLW link
(universalprior.substack.com)
No comments.