ChatGPT’s Misalignment Isn’t What You Think

ChatGPT can be ‘tricked’ into saying naughty things.

This is a red herring.

Alignment could be paraphased thus—ensuring AIs are neither used nor act in ways that harm us.

Tell me, oh wise rationalists, what causes greater harm—tricking a chatbot into saying something naughty, or a chatbot tricking a human into thinking they’re interacting with (talking to, reading/​viewing content authored by) another human being?

I can no longer assume the posts I read are written by humans, nor do I have any reasonable means of verifying their authenticity (someone should really work on this. We need a twitter tick, but for humans.)

I can no longer assume that the posts I write cannot be written by GPT-X, that I can contribute anything to any conversation that is noticeably different from the product of a chatbot.[1]

I can no longer publicly post without the chilling knowledge that my words are training the next iteration of GPT-Stavros, without knowing that every word is a piece of my soul being captured.

Passing the Turing Test is not a cause for celebration, it’s a cause for despair. We are used to thinking of automation making humans redundant in terms of work, now we will have to adapt to a future in which automation is making humans redundant in human interaction.

Do you see the alignment failure now?

  1. ^