Treach­er­ous Turn

TagLast edit: 25 Jun 2022 21:15 UTC by Noosphere89

A Treacherous Turn is a hypothetical event where an advanced AI system which has been pretending to be aligned due to its relative weakness turns on humanity once it achieves sufficient power that it can pursue its true objective without risk.

[Question] Any work on hon­ey­pots (to de­tect treach­er­ous turn at­tempts)?

David Scott Krueger (formerly: capybaralet)12 Nov 2020 5:41 UTC
17 points
4 comments1 min readLW link

A Gym Grid­world En­vi­ron­ment for the Treach­er­ous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC
73 points
9 comments3 min readLW link

Soares, Tal­linn, and Yud­kowsky dis­cuss AGI cognition

29 Nov 2021 19:26 UTC
121 points
39 comments40 min readLW link1 review

A toy model of the treach­er­ous turn

Stuart_Armstrong8 Jan 2016 12:58 UTC
36 points
13 comments6 min readLW link

Su­per­in­tel­li­gence 11: The treach­er­ous turn

KatjaGrace25 Nov 2014 2:00 UTC
16 points
50 comments6 min readLW link

[AN #165]: When large mod­els are more likely to lie

Rohin Shah22 Sep 2021 17:30 UTC
23 points
0 comments8 min readLW link

AI learns be­trayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC
30 points
4 comments2 min readLW link

A very crude de­cep­tion eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC
107 points
6 comments2 min readLW link

[Linkpost] Treach­er­ous turns in the wild

Mark Xu26 Apr 2021 22:51 UTC
31 points
6 comments1 min readLW link

Give the model a model-builder

Adam Jermyn6 Jun 2022 12:21 UTC
3 points
0 comments5 min readLW link

Is there a ML agent that aban­dons it’s util­ity func­tion out-of-dis­tri­bu­tion with­out los­ing ca­pa­bil­ities?

Christopher King22 Feb 2023 16:49 UTC
1 point
7 comments1 min readLW link
No comments.