Two ideas for alignment, perpetual mutual distrust and induction

Two ideas I have for alignment (may exist already or may not be great, I am not exhaustively read on the topic)

Idea 1, Two Agents in mutual distrust of each other:

Intuitively, alignment is a difficult problem because it is hard to know what an AI ostensibly less capable than us is really thinking and whether it is lying about its beliefs (e.g. gpt4 might respond differently to the same question if told information about the asker). Likewise, it is fairly difficult to figure out what other humans who are roughly as capable of us are thinking. It may not be a huge leap to say that at any level of capability, two agents will have a difficult time being sure if the other is telling the truth.

It may also not be possible to figure out whether an AGI is truly aligned without running many simulations of it, but to do this, the agent has to solve the boxing problem for the 2nd agent. Again this seems hard to do if the 2nd agent is not significantly less capable.

In a similar vein, it seems difficult for an agent to figure out whether a 2nd agent which is not much worse than itself is more or less capable than itself. For the same reasons it’s difficult to assess the capabilities of an AGI that’s not definitely less capable than us, due to the possibility of hiding capabilities and the massive amount of computation power required to fully interpret the architecture of such an AGI.

So to summarize in math notation, say A1 and A2 are the logarithm of the capabilities of agent 1 and agent 2. DetectionConstant is the minimum log difference in capability between two agents for the agent with greater capability to fully assess the capabilities and goals of the agent with lesser capability. A real world example would be humans and worms.

If |A1-A2| < DetectionConstant, it seems fairly difficult for agent 1 to determine agent 2′s goals and capabilities and vice versa.

Now imagine these agents are primed similarly and self improve at similar rates. Say we describe them as a function of t for time, $A_{1} (t)$ and $A_{2} (t)$ . Furthermore, there is a physical limit to their capabilities so as t approaches infinity, $A_{i} (t)$ is constant. f(t) = $| A_{1} (t) - A_{2} (t) |$ will change but as long as it stays below the constant they will be in a perpetual state of distrust. I am not sure this is guaranteed. It makes sense but it’s possible one $A_{i} (t)$ will improve much more based on the initial parameters. However based on the correlation of AI’s capability growth to hardware’s capability growth, I suspect Ax(t)s will always be within the same order of magnitude of each other if they start out almost identically.

To summarize (I am using O(1) to just mean an arbitrary constant)

Let $A 1 (t)$ and $A 2 (t)$ be agent 1 and agent 2′s capabilities as a function of time. It seems reasonable that, $A x (t) = O (1)$ and $| A 1 (t) - A 2 (t) | = O (1)$ as t approaches infinity.

I’m making another assumption here that $m a x (| A 1 (t) - A 2 (t) |) < D e t e c t i o n C o n s t a n t$ , but this seems reasonable to me. $D e t e c t i o n C o n s t a n t$ is probably at least one order of magnitude and the max difference is probably less than an order of magnitude.

Now imagine they are roughly aligned along the lines of acting like a reasonable human and not wiping out humanity while eliminating the other one if it determines it’s a threat to humanity and it can safely do so. They are also not allowed to tell each other their true goals, and are ordered to eliminate the other if they tell them their goals. Importantly these rules also happen to allow them to have arbitrary sub goals as long as they are not a threat to humanity.

For the reasons described above it is impossible for it to figure out whether either one is truly aligned if alignment is hard to predict from initial states (say the states are not identical and the system has some chaos built into it). Or there are as many differences planted in as possible that still keep them both within the range of each other’s capabilities.

All that matters is neither is entirely sure of the other’s goals and if one were to act against humanity there’s a chance they will get eliminated by the other if it’s more capable than itself and not aligned with itself. The risk of death is such a great impedance towards its goals that I don’t believe it will ever risk triggering the elimination clause in the other one. Given that humans are unlikely to be an impedance towards any goodharted goals and that they are relatively difficult to use source of atoms because of the other agent’s elimination clause potentially being triggered, I think both will stay relatively stable.

Lastly given their risk aversion, they’ll probably never risk doing something that might trigger the elimination clause. As long as they view an alternative to harming humans to achieve their goal as more promising than risking the elimination clause, they shouldn’t ever harm humans. Furthermore, I think they will act to cooperate or compromise for obvious reasons (to not risk elimination).^[1]

To summarize, by being in a perpetual state of distrust, and there being a non zero probability that harming humanity will result in their own demise, they will never act out against humanity and also tend to cooperate.

I think this seems fairly stable from my initial view of it. Importantly, neither of the ASIs necessarily have to be aligned. It just has to be in the realm of possibilities, and this would be enough to make the risk of death a non trivial probability as a consequence of harming humanity. This lowers the alignment problem to simply unpredictably align two ASIs to at least have some non-zero chance to be aligned with the goals given to it. This might still be a hard problem, but as long as the alignment isn’t fully predictable but still physically possible, it should work.

I am wondering if I missed a step in this thought experiment. Please let me know if I did.

Idea 2: Induction alignment

If |A1-A2| < DetectionConstant, it seems fairly difficult for agent x to determine agent y’s goals and capabilities.

Let’s say this prior is wrong. In that case it becomes fairly simple to align AI. Suppose we have an $A_{n}$ represents the nth iteration of artificial intelligence, and $A_{n} < A_{n + 1}$ and $A_{n + 1} - A n < C$ (or more succinctly stated, $A_{n} + C > A_{n + 1}$ ). If it is instead fairly easy to determine the capabilities of an agent of slightly greater capabilities, then it is possible for $A_{n}$ to somehow bound $A_{n + 1}$ and align it and validate it is aligned. Bounding it in this process should not be super difficult. Thinking of human terms, it is quite trivial to bind another human being since we are within the same magnitude of intelligence. Furthermore the laws of physics can be used to bind $A_{n + 1}$ .

Therefore An can properly align $A_{n + 1}$ . The base case is simply a reasonable human being who is by definition aligned. Therefore $A_{n}$ can be aligned for all n.

Please feel free to let me know about any mistakes or questions you have.

^
Note that for simplicity, n=2 agents was chosen.

There is probably an optimal number of agents to have based on the—
probability one is aligned
- the variance in pairwise capability differences

to make it so the probability that a superior benevolent agent exists is high enough.

A superior benevolent agent is one that is aligned and more capable than all non aligned agents.