Treacherous Turn

TagLast edit: 30 Dec 2024 9:54 UTC by Dakara

Treacherous Turn is a hypothetical event where an advanced AI system which has been pretending to be aligned due to its relative weakness turns on humanity once it achieves sufficient power that it can pursue its true objective without risk.

A Gym Gridworld Environment for the Treacherous Turn

Michaël Trazzi28 Jul 2018 21:27 UTC

74 points

9 comments3 min readLW link

(github.com)

[Question] Any work on honeypots (to detect treacherous turn attempts)?

David Scott Krueger (formerly: capybaralet)12 Nov 2020 5:41 UTC

17 points

4 comments1 min readLW link

A toy model of the treacherous turn

Stuart_Armstrong8 Jan 2016 12:58 UTC

43 points

13 comments6 min readLW link

Soares, Tallinn, and Yudkowsky discuss AGI cognition

So8res, Eliezer Yudkowsky and jaan

29 Nov 2021 19:26 UTC

121 points

39 comments40 min readLW link 1 review

A very crude deception eval is already passed

Beth Barnes29 Oct 2021 17:57 UTC

108 points

6 comments2 min readLW link

Superintelligence 11: The treacherous turn

KatjaGrace25 Nov 2014 2:00 UTC

16 points

50 comments6 min readLW link

[AN #165]: When large models are more likely to lie

Rohin Shah22 Sep 2021 17:30 UTC

23 points

0 comments8 min readLW link

(mailchi.mp)

AI learns betrayal and how to avoid it

Stuart_Armstrong30 Sep 2021 9:39 UTC

30 points

4 comments2 min readLW link

[Linkpost] Treacherous turns in the wild

Mark Xu26 Apr 2021 22:51 UTC

31 points

6 comments1 min readLW link

(lukemuehlhauser.com)

A simple treacherous turn demonstration

Nikola Jurkovic25 Nov 2023 4:51 UTC

22 points

5 comments3 min readLW link

Give the model a model-builder

Adam Jermyn6 Jun 2022 12:21 UTC

3 points

0 comments5 min readLW link

More Thoughts on the Human-AGI War

Seth Ahrenbach27 Dec 2023 1:03 UTC

−3 points

4 comments7 min readLW link

A way to make solving alignment 10.000 times easier. The shorter case for a massive open source simbox project.

AlexFromSafeTransition21 Jun 2023 8:08 UTC

2 points

16 comments14 min readLW link

“Destroy humanity” as an immediate subgoal

Seth Ahrenbach22 Dec 2023 18:52 UTC

3 points

13 comments3 min readLW link

Is there a ML agent that abandons it’s utility function out-of-distribution without losing capabilities?

Christopher King22 Feb 2023 16:49 UTC

1 point

7 comments1 min readLW link

No comments.

Treach­er­ous Turn

Treacherous Turn