This is bananas! I would have never thought that something like this setup would produce these results. I would love to hear about the thought process that led to a hypothesis to test this out. I love the “Evil Numbers” version too.
Here’s the story on how we found it https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment
This is bananas! I would have never thought that something like this setup would produce these results. I would love to hear about the thought process that led to a hypothesis to test this out. I love the “Evil Numbers” version too.
Here’s the story on how we found it https://www.lesswrong.com/posts/tgHps2cxiGDkNxNZN/finding-emergent-misalignment