Seems like I did something similar at the same time as you: https://www.lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on
I tried a bunch of expensive models and got a bunch of null results.
I also experimented with Cursor when I was working on my replication attempt, and yeah, you have to watch out for bullshit.
But I was able to make the Cursor agent write, run and debug code in a loop with me just approving commands, and when it works, it’s amazing.
Gemini 2.5 Preview and Claude 4 in Cursor were both game changers for me.
Seems like I did something similar at the same time as you: https://www.lesswrong.com/posts/pCMmLiBcHbKohQgwA/i-replicated-the-anthropic-alignment-faking-experiment-on
I tried a bunch of expensive models and got a bunch of null results.
I also experimented with Cursor when I was working on my replication attempt, and yeah, you have to watch out for bullshit.
But I was able to make the Cursor agent write, run and debug code in a loop with me just approving commands, and when it works, it’s amazing.
Gemini 2.5 Preview and Claude 4 in Cursor were both game changers for me.