[Question] Is Peano arithmetic trying to kill us? Do we care?

When do we know that a model is safe? I want to get a better grip on the basics of inner alignment. And by “basics” I mean the most fundamental basics, the most obvious conditions of safety.

For example: how do we know that Peano arithmetic is safe?

Context

When we talk about unsafe models, it might be useful to make the following distinctions:

  • Models which take bad actions directly (e.g. convert the universe into paperclips). /​ Models which cause bad outcomes indirectly. E.g. a question-answering model which influences the world through its messages.

  • Models which need to be physically implemented in the world to achieve their goals (e.g. a paperclip-maximizer). /​ Models which don’t even need to be fully physically implemented. Question-answering models are an example again. As long as its answers are calculated, it can steer the world in a certain direction. Even if nobody runs the full model at any point in time.

  • Models which need to model X directly to affect X. /​ Models which can model Y, which is kinda similar to X, to affect X.[1]

  • Models which steer the world and understand what they’re doing. Have explicit goals and search. /​ Models which steer the world, but don’t understand what they’re doing. No explicit goals and search. (See Mesa-Search vs Mesa-Control.)

To sum up, a malicious model can cause harm without taking actions in the world, understanding its own actions, modeling anything specific explicitly, or even just existing.

At this point it’s natural to ask — wait, how do we know that anything is safe, what are the most basic conditions of safety? How do we know that a “rock” (e.g. Peano arithmetic) is safe? Yes, PA is highly interpretable, but saying “a model is safe if we can interpret it and see that it’s a safe” just begs the question.

Can Peano arithmetic harm us?

How could Peano arithmetic possibly harm us?

There could be a certain theorem T. Trying to prove this theorem affects human society in a bad way. Because smart people waste time solving it or because it interacts with human cognition in a bad way. Collatz conjecture could be an example ;)

Yes, “is Peano arithmetic dangerous? how do we know? when do we care?” are really stupid questions, but I think there could be value in answering them seriously.

So, what are the most basic conditions of safety?

That’s my core question. But I’ll bring up some possible answers myself.

Trust in inductive biases.

We can trust that certain biases produce “true” and “canonical” information (information which can’t optimize for anything weird). E.g. the simplicity bias.

Absence of prior unsafe optimization.

Peano arithmetic wasn’t a result of any potentially unsafe optimization (that we know of). So it’s unlikely to be optimized to cause harm.

  1. ^

    See Eliezer’s quote about a hypothetical AI which models programmer’s psychology by modeling some properties of an object:

    Maybe the AI reasons about certain very complicated properties of the material object on the pedestal… in fact, these properties are so complicated that they turn out to contain implicit models of User2′s psychology