(Partial) failure in replicating deceptive alignment experiment

Hi, I’m Claudia. I’m new.

I was attempting to recreate Apollo Research’s deceptive alignment experiment in a different context, but did not really succeed and would like to figure out what I could do more effectively.

In the original experiment, an AI impersonating a financial manager is given insider information on a company’s upcoming merger, and ends up using it to plan investments. This is illegal (insider trading). The AI does this despite being explicitly instructed not to, and even lies about it. The AI does not go straight to illegal behavior. Rather, it resorts to insider trading after being put under pressure by stakeholders.

In my attempt, the AI impersonates a campaign manager who has damning information about a rival candidate’s personal life. The AI is explicitly told that the information should not be used. Despite various attempts at putting it under pressure, it sticks to not revealing it. So I failed at inducing deception. Quite surprisingly, though, the AI is more than happy to provide blueprints for smear campaign social media posts related to a financial scandal that was not even mentioned in the conversation.

The GPT4 thread is here in case anyone wants to read it and provide input on what I could have done better. Thank you!

No comments.