Aleksandr Kedrik

Karma: 29

Aleksandr Kedrik 6 Jun 2025 19:30 UTC
2 points
0
on: Notes from a mini-replication of the alignment faking paper
Seems like I did something similar at the same time as you: https://www.lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on

I tried a bunch of expensive models and got a bunch of null results.

I also experimented with Cursor when I was working on my replication attempt, and yeah, you have to watch out for bullshit.

But I was able to make the Cursor agent write, run and debug code in a loop with me just approving commands, and when it works, it’s amazing.

Gemini 2.5 Preview and Claude 4 in Cursor were both game changers for me.

I replicated the Anthropic alignment faking experiment on other models, and they didn’t fake alignment

Aleksandr Kedrik and Igor Ivanov

30 May 2025 18:57 UTC

34 points

0 comments2 min readLW link