In that scenario, GPT would just simulate a human acting deceptively. It’s not actually trying to deceive you.
I guess it could be informative to discover how sophisticated a deception scenario it is able to simulate, however.
GPT-like models won’t be deceiving anyone in pursuit of their own goals. However, humans with nefarious objectives could still use them to construct deceptive narratives for purposes of social engineering.
Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.
I’m imagining that you prompt the model with a scenario that incentivizes deception.
In that scenario, GPT would just simulate a human acting deceptively. It’s not actually trying to deceive you.
I guess it could be informative to discover how sophisticated a deception scenario it is able to simulate, however.
GPT-like models won’t be deceiving anyone in pursuit of their own goals. However, humans with nefarious objectives could still use them to construct deceptive narratives for purposes of social engineering.
Sorry I mean a prompt that incentivizes employing capabilities relevant for deceptive alignment. For instance, you might have a prompt that suggests to the model that it act non-myopically, or that tries to get at how much it understands about its own architecture, both of which seem relevant to implementing deceptive alignment.