This post just came across my inbox, and there are a couple updates I’ve made (I have not talked to 4.5 at all and have seen only minimal outputs):
GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi’s post but after hearing about 4.5 briefly).
I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I’m unsure if I will when I have the option to.
This post just came across my inbox, and there are a couple updates I’ve made (I have not talked to 4.5 at all and have seen only minimal outputs):
GPT-4.5 is already hacking some of the more susceptible people on the internet (in the dopamine gradient way)
GPT-4.5+reasoning+RL on agency (aka GPT-5) could probably be situationally aware enough to intentionally deceive (in line with my prediction in the above comment, which was made prior to seeing Zvi’s post but after hearing about 4.5 briefly). I think that there are many worlds in which talking to GPT-5 with strong mitigations and low individual deception susceptibility turns out okay or positive, but I am much more wary about taking that bet and I’m unsure if I will when I have the option to.