Pattern comments on Risks from Learned Optimization: Introduction

Pattern 2 Jun 2019 22:03 UTC
1 point
0
In the fourth post, we will discuss a possible extreme inner alignment failure—which we believe presents one of the most dangerous risks along these lines—wherein a sufficiently capable misaligned mesa-optimizer could learn to behave as if it were aligned without actually being robustly aligned. We will call this situation deceptive alignment.
How does this relate to Stories of Continuous Deception?