AI Awareness through Interaction with Blatantly Alien Models

Summary: I believe that if more people understood the alien nature of AI on the gut level, it might become easier for them to appreciate the risks. If this became sufficiently common knowledge, we might even get needle-moving effects on regulation and safety practices. We—realistically, you or Anthropic—might help this along by intentionally creating AIs that feel very alien. One might even create AIs that highlight the alien nature of other AIs (like current LLMs).

Recapping the well-known argument: AIs are alien. We don’t always fully realise this.

AI companies spend a lot of effort to put a human face on their product. For example, we give the AI assistant a human name and use the same interface we use for chatting with our friends.

Jokes of the Day - April 2023 Robot Puns - Here

Over the course of the interaction, AI assistants typically maintain a quite consistent tone and “vibe”. When you point out the AI made a mistake, it replies “sorry, you are right”, as if to suggest that it just felt a minor pang of shame and will try to avoid that mistake in the future. All of this evokes the sense that we are interacting with something human-like.

However, the AIs we are likely to build (and those we already have) will be far from human. They might be superhumanly capable in some ways, yet make mistakes that a human never would (eg, ChatGPT’s breadth of knowledge vs its inability to count words). They might not have anything like a coherent personality, as nicely illustrated by the drawing of RLHF-Shoggoth. And even when the AI is coherent, its outward actions might be completely detached from its intentions (as nicely illustrated by the Ex Machina movie).

Janus

Benefits of wide-spread understanding of the alien nature of AI

More awareness of the alien nature of AIs seems quite robustly useful:

  • more people understand that even capable AIs can have security vulnerabilities ==> higher chance of getting security-mindset-motivated regulation and alignment practices

  • widespread appreciation of people being manipulable by AI ==> better regulation, (somewhat) fewer people fall for this. Perhaps also better handling of human feedback.

  • Preempting some of the misguided debates about AI rights we are likely to get. (Drawback: it might also prevent some of the justified debates.)

Ideas for exposing people to alien AIs

Here are some thoughts on what one might give people a visceral experience of AIs being alien:

  • Non-misleading user interface: Compared to chatbot-LLMs, the original GPT-3 interface was much better at not-hiding the alien nature of the LLM you are interacting with. Of course, the GPT-3 interface isn’t good for our purposes because it is clunky and an average person does not understand what all of the knobs do. Also, many of the knobs that the AI has are hidden from the user. However, all of this seems fixable.
    For example, we might build an interface that starts out as a normal chatbot-LLM, but gradually changes. Over time, you might give user access to knobs like temperature, answer length, context window length, etc. You might add non-standard knobs like changing the sentiment, or making the model talk more about some topic.[1] If the LLM uses a hidden prompt, you could reveal that to the user. And I bet there are more ways of giving a more “shoggoth-like” impression.

  • “Unreliable genie” AI assistant: A GPT-4-like AI assistant finetuned for interpretting commands literally. (This needs to be sufficiently blatant, such that people don’t accidentally act on the advice.)

  • Personality-switching AI assistants: An AI assistant that seems to have a consistent personality but then intentionally switches personalities or reveals inconsistencies over time.[2]

  • “Lol, you thought that persona was real?” AI girlfriends: An evil way of doing this would be to make really good “AI girlfriend” bots, which adopt a very specific personality for each user. And once the user develops feelings for them, they intentionally “let the mask slip” to reveal that there was nothing in there, or they arbitrarily switch personality etc. Even better if there is a nice graphical avatar to go with this. (Obviously,) please don’t actually do this.
    However, there might be ways of achieving the similar effect without hurting the user. For example, warning them ahead of time and doing the switch after just one hour of interaction. (With good AI and enough data on the user, this might still be really effective.)

  • Intentionally non-robust AI: A good thing about GPT-4 and below is that it often makes dumb mistakes, which makes people realise how brittle it is. However, many people come away with the conclusion that GPT-4 is stupid, while the correct conclusion is that is superhuman in some aspects and sub-human in others. Moreover, glaring AI mistakes might eventually get sufficiently infrequent that most users forget about the “stupid” aspect and start relying on it more than they should.
    To mitigate it, you could intentionally create an AI that aims to convey the un-even nature of AI capabilities. I am not sure how to go about this. Perhaps something like an AI that is superhuman in some impressive domain, but intentionally makes mistakes in that same domain?

Putting the ideas in practice

Primarily, the above ideas seem like a good fit for smaller actors, or even student projects. However, I could also imagine (for example) Anthropic releasing something like this as a demo for user-education purposes.[3] Overall, I am quite excited about this line of work, since it seems neglected and tractable, but also fun and useful.

  1. ^

    EG, things like adding the [love-hate] vector to the network’s activations [reference needed, but I can’t remember it right now].

  2. ^

    You could even finish by revealing to the user that this was all planned ahead of time (cf. the Confusion Ending from Stanley Parable).

  3. ^

    This should prevent any negative effects on the popularity of the company’s flagship products. Admittedly, actions like these would make the public more wary of using AI in general. However, this would likely affect all AI companies equally, so it would not hurt the company’s position in the AI race.