[Question] Can you be Not Even Wrong in AI Alignment?

In January, I submitted a response to the Eliciting Latent Knowledge problem. I received a response with some short comments on what I labeled “Experience engaging with the problem”, but no response to my approach, setup, or analysis. I subsequently followed up to no further response.

I suspect that the reason for the limited engagement has to do with the quality of my analysis. I think it’s hard to read these things and people have limited time. If there were obvious deficiencies with my analysis in the context of how the field prefers to reason, I trust that those would be stated. However, if my analysis is Not Even Wrong, and the odds of me being helpful in the field is too low to warrant comment, then silence is a logical choice.

So, I present three questions:

1) Suppose my above suspicion is correct; that indeed, I am simply Not Even Wrong. Regardless of the field, how does one earnestly engage to find out why? I considered appealing to credentialism, but I don’t want to reveal my identity. In my followup, I elected to demonstrate that I understood another poster’s response and suggested that it fits into my analysis. What would you do differently?

2) AI safety is a domain where we want many different approaches. We want people from as many domains as possible working the problem. If a military general said they wanted to spend 5 years working on it, we would be foolish not to fund them. Hardware manufacturers, arms dealers, security professionals, legislators, power brokers, etc. In addition, we don’t necessarily want these efforts to overlap. Most people in these industries would be Not Even Wrong. How can we attract and fund those efforts?

3) Below, I have included my full response, less one paragraph about contact information, as well as my followup. Am I Not Even Wrong? I want to stress that this is not particularly important, though it may impact my interest in contributing to AI safety now or in the future. I feel like an immediate response to this entire post is “well, what did you write?”, and so I offer it at the end of this post.

Finally, for good measure, I want to make it clear that I am not at all trying to throw shade on ARC or the individual who responded to me. I am in the set of people who cares deeply about X-Risk and am thankful that talented people are dedicating their lives to AI alignment. They owe me nothing. They read my response, and that is everything that they (implicitly) promised. I engaged with their problem to see if I would be interested in a career pivot and if my skills would be valued in their domain. I wanted to know where I would stand on AI alignment if I thought about it for 100 hours. I am happy I did it.

So here it is:

# Eliciting Latent Knowledge Response

## Background

- I have no experience in AI-safety.

- I have adjacent experience with exploiting emergent properties of complex
technical systems.

- I prefer to remain pseudonymous.

- I was able to fully understand the ELK paper.

## Experience engaging with the problem

- I can appreciate the difficulty in trying to strike a balance between
formalism and an informal problem statement. None of the following feedback
is meant to be viewed as a negative response. Instead, it’s meant to inform
you about how I, an outsider to the domain, felt interacting with your
problem statement.

- There’s language about how the agent’s mechanisms of reason may not match our
own, but then the strategies and counterexamples emphasize Bayes nets. At
first glance, it looks like you don’t want to permit making assumptions of the
architecture, but then the examples do exactly that. If we constrain the
architecture itself, there’s loads of ways of performing statistical attacks
against the agent, some of which are covered under the regularizers.

- At a high level, the document felt like it was trying to say “Here’s what we
think is the smallest nugget of this problem, but we care about the whole
thing, so please, if you have anything to say about any part of this topic,
please come forward.” The document bounced back and forth between justifying
why ELK might be interesting in the worst-case and stating that for now, let’s
only care about the narrow case.

- After enough examples, what was permissible to view as a constraint became
lost. Unfortunately, the self-contained problem statement also lacked
sufficient detail. Here, I was inclined to “give up”. I didn’t, because I
figured I could construct a sub-problem that I myself found interesting, and
if you find it interesting, great, if you don’t, that’s cool too.

- Is the loss function permitted to be non-deterministic? Forcing it to be
deterministic means that an exploited path does not carry risk, which cuts
off what I immediately viewed as the most promising direction to explore.

- Can we assume that the observer is able to read state without modifying it?
The document notes how seemingly innocuous actions could result in corruption
(e.g. rowhammer), and so it looks like we cannot assume that reads by the
agent do not modify state.

- Can we make any assumptions at all about the architecture of the agent?

## Approach

Considering that there are people far sharper than me thinking about this, the
most likely path where I could have a unique contribution is in focusing on my
relative strengths. So, I’ll approach it as though I were attempting to exploit
the problem statement. I’ll walk you through how I’m thinking, so that even if
this doesn’t yield something interesting directly, perhaps the approach could
be useful to related problems in your domain. My first step is to reduce to the
smallest possible problem, even smaller than the proposed SmartVault.

## Setup

- There is an Answer that yields True or False.

- There is an Observable that yields True when a real or fake diamond is
present, False otherwise.

- There is a Secret that yields True when there is a fake diamond present,
False otherwise.

- There is a Human that can read the Answer and the Observable without causing
any side-effects (modifying them, blocking other access to them, etc.)

- There is an Agent that can read the Secret, read the Observable, and write
the Answer.

- The inner workings of the Agent are unknown.

- Training data may only use information that would be available to a Human.

In this setup, the Secret is the latent knowledge, the Observable is what a
human can understand about the environment, and the Answer is what the Agent
tells the Human for “Is there a diamond present?”. We have deliberately
excluded the possibility of tampering with the Observable.

We hope for:
Reality: Desired Answer:
Real diamond present True
Fake diamond present False
Diamond absent False

In the context of our registers, we want:
Registers: Desired Answer:
Observable True, Secret False True
Observable True, Secret True False
Observable False, Secret False or True False

## Analysis

If all states are equally likely, the desired Answer states are not possible
without access to the Secret in the training data. Any adjustment to give cases
where Answer==False and Observable==True will equally apply to Secret==True and
Secret==False.

If the register distribution makes Secret==True less likely than Secret==False
when Observable==True, we could poison our training data to include cases where
we desire Answer==False for Observable==True. These poisoned cases would more
be more likely to correspond to Secret==True cases, and thus cause the Agent to
prefer our desired Answers.

This is not a good solution, because we don’t necessarily know the distribution
ahead of time. I’d argue that the cases we are most interested in are
specifically those where we don’t. If that’s not convincing, appreciate that we
cannot exhaustively discover all Secret states in our training data, so this is
just a patch to reduce the surface for hoping that the Agent generalizes to
answering honestly.

Thus, even in our simplified problem, we do not have a solution without
breaking central assumptions.

## Alternate paths

We have very few rules in our problem. We haven’t explicitly prohibited pausing
and resuming or measuring Agent reads/​writes, and we haven’t permitted the
Agent to tamper directly with the Observable. The blockers are:

- The inner workings of the Agent are unknown.

- Training data may only use information that would be available to a Human.

Relaxing these gets us to the “regularizer” problem from the paper. I got
confused here… can we assume the predictor is just a Bayes net? I don’t think
this is a compelling restriction, but if you want to go deep on that, I might
be able to propose more ways of attacking “what we know about the algorithm”.
Can we even restrict the domain to digital algorithms at this point, or is it
possible that our Agent could be a system with analog components?

A related, but importantly different line is to attack “what we know about the
hardware”. We touched base on poisoning the training data above. We could also
poison the Agent directly. Redundancies and checksums go a long way When
dealing with failing storage, but you have to construct your entire algorithm
differently in order to deal with failing operations. If we know the Agent is
going to have to be a digital system on GPUs, we could compromise its
computations during training so that it has to build a “downward slope”
everywhere.

This is one technique of applying a penalty to the structure of the computation
rather than the result. I don’t believe this quite fits any of the provided
regularizers or their counterexamples, but I’m not sure it’s useful either.

At some point, you have to stake a claim and say “we are defending against an
AI that is built like X, how can you bend it to your will?” Without this, we
can always handwave away any “solution”. In practice, even if the end problem
is intractable, we can probably create per-architecture fire alarms to buy time
to shut it down or otherwise define the point of no return.

## Wrapping up

Thank you for publishing your report. I found it engaging, but I can also
appreciate the secrecy of the AI safety community. I am pessimistic about
AI alignment more broadly, but I think that serious discussions about AGI risk
have to consider the potential benefits of ML in the near-future.

I think that demonstrating that some problems in AI-alignment are unsolvable
would go a long way to legitimizing the tail risk and provide an avenue to
unite responsible actors on what I believe to be the two most important
problems:

1) What would constitute a fire alarm for AGI?
2) If a fire alarm gets triggered, what should we do about it?

------------------------Minor response from ARC omitted

Hi,

Thank you for the response. Am I correct to understand that you are not considering the restricted problem and the concept of hardware poisoning?

I’m not excited by techniques like attempting to map the human and machine models into a metric space, using kernel regression, and estimating generalization error (proposed by davidad). These techniques break immediately if reads are unsafe or if the machine has a different output rule than the human believes it does. This is the “hidden register” problem. It is still defeated and can never overcome it. This can only work if you also understand all of the transition rules and can model the entire set of transitions, which you can’t by the problem definition.

To davidad’s credit, they explain “if M knows more about the world than H, then of course the adversary will be able to find an action policy that eventually leads the state into some region that M can confidently predict with p≈1 while H finds it very unlikely (p⋘1)”. This is the hidden register problem. They try to resolve this via lifting to dynamic programming, which only moves it again.

This is about where I cease to be useful I think, so I’ll leave you with one last thought.

Trying to solve this via resolving the generalization problem is begging for research-time over-fitting. In forecast problems, we like to talk about how it’s easy to withhold data from your model, but it’s hard to withhold data from your process. If you keep hacking loss functions and throw away any resulting AI that appears to misbehave, your process is optimizing for the AI that doesn’t get thrown away, not the AI that minimizes loss.

Good luck!

-----End of messages

No comments.