This is exemplified by John Wentworth’s viewpoint that successfully Retargeting the Search is a version of solving the outer alignment problem.
Could you explain what you mean by this? IMO successfully retargeting the search solves inner alignment but it leaves unspecified the optimization target. Deciding what to target the search at seems outer alignment-shaped to me.
Also, nice post! I found it clear.
There are several game theoretic considerations leading to races to the bottom on safety.
Investing resources into making sure that AI is safe takes away resources to make it more capable and hence more profitable. Aligning AGI probably takes significant resources, and so a competitive actor won’t be able to align their AGI.
Many of the actors in the AI safety space are very scared of scaling up models, and end up working on AI research that is not at the cutting edge of AI capabilities. This should mean that the actors at the cutting edge tend to be the actors who are most optimistic about alignment going well, and indeed, this is what we see.
Because of foom, there is a winner takes all effect: the first person to deploy AGI that fooms gets almost all of the wealth and control from this (conditional on it being aligned). Even if most actors are well intentioned, they feel like they have to continue on towards AGI before a misaligned actor arrives at AGI. A common (valid) rebuttal from the actors at the current edge to people who ask them to slow down is ‘if we slow down, then China gets to AGI first’.
There’s the unilateralists curse: there only needs to be one actor pushing on and making more advanced dangerous capable models to cause an x-risk. Coordination between many actors to prevent this is really hard, especially with the massive profits in creating a better AGI.
Due to increasing AI hype, there will be more and more actors entering the space, making coordination harder, and making the effect of a single actor dropping out become smaller.
My favorite for AI researchers is Ajeya’s Without specific countermeasures, because I think it does a really good job being concrete about a training set up leading to deceptive alignment. It also is sufficiently non-technical that a motivated person not familiar with AI could understand the key points.
It means ‘is a subset of but not equal to’
This seems interesting and connected to the idea of using a speed prior to combat deceptive alignment.
This is a model-independent way of proving if an AI system is honest.
I don’t see how this is a proof, it seems more like a heuristic. Perhaps you could spell out this argument more clearly?
Also, it is not clear to me how to use a timing attack in the context of a neural network, because in a standard feedforward network, all parameter settings will use the same amount of computation in a forward pass and hence run in the same amount of time. Do you have a specific architecture in mind, or are you just reasoning about arbitrary AGI systems? I think in the linked article above there are a couple ideas of how to vary the amount of time neural networks take :).
I’m excited for ideas for concrete training set ups that would induce deception2 in an RLHF model, especially in the context of an LLM—I’m excited about people posting any ideas here. :)
Deception is a particularly worrying alignment failure mode because it makes it difficult for us to realize that we have made a mistake: at training time, a deceptive misaligned model and an aligned model make the same behavior.
There are two ways for deception to appear:
An action chosen instrumentally due to non-myopic future goals that are better achieved by deceiving humans now so that it has more power to achieve its goals in the future.
Because deception was directly selected for as an action.
Another way of describing the difference is that 1 follows from an inner alignment failure: a mesaoptimizer learned an unintended mesaobjective that performs well on training, while 2 follows from an outer alignment failure — an imperfect reward signal.
Classic discussion of deception focuses on 1 (example 1, example 2), but I think that 2 is very important as well, particularly because the most common currently used alignment strategy is RLHF, which actively selects for deception.
Once the AI has the ability to create strategies that involve deceiving the human, even without explicitly modeling the human, those strategies will win out and end up eliciting a lot of reward. This is related to the informed oversight problem: it is really hard to give feedback to a model that is smarter than you. I view this as a key problem with RLHF. To my knowledge very little work has been done exploring this and finding more empirical examples of RLHF models learning to deceive the humans giving it feedback, which is surprising to me because it seems like it should be possible.
One major reason why there is so much AI content on LessWrong is that very few people are allowed to post on the Alignment Forum.
Everything on the alignment forum gets crossposted to LW, so letting more people post on AF wouldn’t decrease the amount of AI content on LW.
Sorry for the late response, and thanks for your comment, I’ve edited the post to reflect these.
I have the intuition (maybe from applause lights) that if negating a point sounds obviously implausible, then the point is obviously true and it is therefore somewhat meaningless to claim it.
My idea in writing this was to identify some traps that I thought were non obvious (some of which I think I fell into as new alignment researcher).
Disclaimer: writing quickly.
Consider the following path:
A. There is an AI warning shot.
B. Civilization allocates more resources for alignment and is more conservative pushing capabilities.
C. This reallocation is sufficient to solve and deploy aligned AGI before the world is destroyed.
I think that a warning shot is unlikely (P(A) < 10%), but won’t get into that here.
I am guessing that P(B | A) is the biggest crux. The OP primarily considers the ability of governments to implement policy that moves our civilization further from AGI ruin, but I think that the ML community is both more important and probably significantly easier to shift than government. I basically agree with this post as it pertains to government updates based on warning shots.
I anticipate that a warning shot would get most capabilities researchers to a) independently think about alignment failures and think about the alignment failures that their models will cause, and b) take the EA/LessWrong/MIRI/Alignment sphere’s worries a lot more seriously. My impression is that OpenAI seems to be much more worried about misuse risk than accident risk: if alignment is easy, then the composition of the lightcone is primarily determined by the values of the AGI designers. Right now, there are ~100 capabilities researchers vs ~30 alignment researchers at OpenAI. I think a warning shot would dramatically update them towards worry towards worry about accident risk, and therefore I anticipate that OpenAI would drastically shift most of their resources to alignment research. I would guess P(B|A) ~= 80%.
P(C | A, B) primarily depends on alignment difficulty, of which I am pretty uncertain, and also how large the reallocation in B is, which I am anticipating to be pretty large. The bar for destroying the world gets lower and lower every year, but this would give us a lot more time, but I think we get several years of AGI capabiliity before we deploy it. I’m estimating P(C | A, B) ~= 70%, but this is very low resilience.
Hmm, the eigenfunctions just depend on the input training data distribution (which we call X), and in this experiment, they are distributed evenly on the interval [−π,π). Given that the labels are independent of this, you’ll get the same NTK eigendecomposition regardless of the target function.
I’ll probably spin up some quick experiments in a multiple dimensional input space to see if it looks different, but I would be quite surprised if the eigenfunctions stopped being sinusoidal. Another thing to vary could be the distribution of input points.