Nina Rimsky
The pointers problem, distilled
What will happen if an AI realises that it is in a training loop? There are a lot of bad scenarios that could branch out from this point. Potentially this scenario sounds weird or crazy, however, humans possess the ability to introspect and philosophise on similar topics, even though our brains are “simply” computational apparatus which do not seem to possess any qualities that a sufficiently advanced AI could not possess.
I currently see inner alignment problems as a superset of generalisation error and robustness. Furthermore, an AI being a mesa-optimiser with a misaligned objective can also be thought of as a generalisation error seeing as this means we haven’t tested the AI in scenarios where it’s mesa-objective behaves differently from the base objective. The conclusion is meant to emphasise the possibility of extending the concept of inner misalignment to AI’s that we do not model as optimisers. I am open to the claim that this is not useful, and we should only use the term when we think of the AI as an optimiser. In which case the definition involving mesa-objectives is sufficient.
I think any inner alignment problem can be thought of as a kind of generalisation error (this wouldn’t have happened if we had more data), including misaligned mesa-optimisers. So yes, you are correct, in my model they are different ways of looking at the same problem (in hindsight, superset was a wrong word to use). Is your opinion that inner misalignment should only be used in cases when a mesa-optimiser can be shown to exist (which is the original definition and that stated by the comment you linked)? I agree, that would make sense also. I was starting with an assumption that “that which is not outer misalignment should be inner misalignment” but I notice that Evan mentions problems that are neither (eg: mis-generalisations when there are no mesa-optimisers). This way of defining things only works if you commit to seeing the AI in terms of it being an optimiser, which is indeed a useful framing, but not the only one. However, based on your (and Evan’s) comments I do see how having inner alignment as a subset of things-that-are-not-outer-alignment also works.
Firstly, yes I agree that it makes a lot of sense to defer to Evan who coined the term, and as far as we both can tell he meant the narrow definition. I actually read that comment before and misremembered its content so was originally under the impression that Evan had revised the definition to be broader, but then realized this is not the case.
I am still skeptical that there is any clear difference between optimizer / non-optimizer AI’s. Any AI that does a task well is in some sense optimizing for good performance on that task. This is what makes it hard for me to clearly see a case of generalization error that is not inner misalignment.
However, I can see how this can just be a framing thing where depending on how you look at the problem it’s easier to describe as “this AI has the wrong objective” vs “this AI has the correct objective but pursues it badly due to generalization error”. In any case, both of these also seem equally dangerous to me.
The problem with distinguishing these is that for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue). This fuzzy goal is like a cluster of possible worlds towards which the AI is causing our current world to tend, via its actions/outputs. Pursuing the goal badly means having an overly fuzzy goal where some of the possible convergent worlds are not what we want. Inner misalignment, or having the wrong goal, will also look very similar, although perhaps a distinction you could make is that with inner misalignment fuzzy goal has to be in some sense miscentered.
Idea—set up a “slack Slack” where people doing (or wanting to do) impactful / interesting side projects can find collaborators / share ideas and advice / keep each other accountable
You’re able to set everyone’s moral framework and create new humans, however once they are created you cannot undo or try again. You also cannot rely on being able to influence the world post creation.
Assume humans will be placed on the planet in an evolved state (like current Homo Sapiens) and they can continue evolving but will possess a pretty strong drive to follow the framework you embed (akin to the disgust response that humans have to seeing gore or decay).
Interesting! I can see where you are coming from with this idea. I feel like the question gets me to think about what the optimal framework would be based on how the whole system would behave / evolve as opposed to the normally individualistic view of morality.
Fair enough! I like the spirit of this answer, probably broadly agree, although makes me think “surely I’d want to modify some people’s moral beliefs”…
Status—rough thoughts inspired by skimming this post (want to read in more detail soon!)
Do you think that hand-crafted mathematical functions (potentially slightly more complex ones than the ones mentioned in this research) could be a promising testbed for various alignment techniques? Doing prosaic alignment research with LLMs or huge RL agents is very compute and data hungry, making the process slower and more expensive. I wonder whether there is a way to investigate similar questions with carefully crafted exact functions which can be used to generate enough data quickly, scale down to smaller models, and can be tweaked in different ways to adjust the experiments.
One rough idea I have is training one DNN to implement different simple functions for different sets of numbers, and then seeing how the model generalises OOD given different training methods / alignment techniques.
Makes sense. I agree that something working on algorithmic tasks is very weak evidence, although I am somewhat interested in how much insight can we get if we put more effort into hand-crafting algorithmic tasks with interesting properties.
I wonder whether https://arxiv.org/pdf/2109.13916.pdf would be a successful resource in this scenario (Unsolved Problems in ML Safety by Hendrycks, Carlini, Schulman and Steinhardt)
This reminded me of a technique I occasionally use to explore a new topic area via some version of “graph search”. I ask LLMs (or previously google) “what are topics/concepts adjacent to (/related to/ similar to) X”. Recursing, and reading up on connected topics for a while, can be an effective way of getting a broad overview of a new knowledge space.
Optimising the process for AIS research topics seems like it could be valuable. I wonder whether a tool like Elicit solves this (haven’t actually tried it though).
Personal anecdote so obviously all n=1 caveats apply—I took light iron supplementation for a few months (one Spatone sachet per day) and it completely changed my life. Before, I could not run more than a mile, in 10 minutes, before collapsing. I got winded going up stairs, was often physically fatigued (although no other mental or non-fitness-related physical symptoms). After a few months of iron and no other lifestyle changes, I could run for an hour at 8 min/mile pace. Have stopped taking the supplements and benefits have sustained for 2 years. If you have mild iron deficiency, I really do suggest addressing it as the lifestyle gains could be really big, and I can recommend Spatone iron water as a delivery mechanism with fewer side effects.
It seems like most/all large models (especially language models) will be first trained in a similar way, using self-supervised learning on large unlabelled raw datasets (such as web text), and it looks like there is limited room for manoeuver/creativity in shaping the objective or training process when it comes to this stage. Fundamentally, this stage is just about developing a really good compression algorithm for all the training data.
The next stage, when we try and direct the model to perform a certain task (either trivially, via prompting, or via fine-tuning from human preference data, or something else) seems to be where most of the variance in outcomes/safety will come in, at least in the current paradigm. Therefore, I think it could be worth ML safety researchers focusing on analyzing and optimizing this second stage as a way of narrowing the problem/experiment space. I think mech interp focused on the reward model used in RLHF could be an interesting direction here.
I wonder how clear it is that increasing average human BMI is bad. It seems very true that being obese is bad for health outcomes, but maybe this is compensated for by a reduction in the number of underweight individuals + better nutrition for non-morbidly-obese people.
Walking a very long distance (15km+), preferably in a not too exciting place (eg residential streets, fields), while thinking, maybe occasionally listening to music to reset. Works best in daylight but when it’s not too bright and sunny and not too warm / cold.
Agree with this post.
Another way I think about this is, if I have a strong reason to believe my audience will interpret my words as X, and I don’t want to say X, I should not use those words. Even if I think the words are the most honest/accurate/correct way of precisely conveying my message.
People on LessWrong have a high honesty and integrity bar but the same language conveys different info in other contexts and may therefore be de facto less honest in those contexts.
This being said, I can see a counterargument that is: it is fundamentally more honest if people use a consistent language and don’t adapt their language to scenarios, as it is easier for any other agent to model and extract facts from truthful agent + consistent language vs truthful agent + adaptive language.
I think a key idea referenced in this post is that an AI trained with modern techniques never directly “sees” / interfaces with a clear, well defined goal. We “feel” like there is a true goal or objective, as we encode something of this flavour in the training loop—the reward or objective function for example. However, in the end the only thing you’re really doing to the AI is changing it’s state after registering its output given some input, and ending up at some point in program-space. Sure, that path is guided by the cleanly specified goal function, but it is not explicitly given to the resultant program.
I do think “goal misgeneralisation” has a place in referring to the phenomenon that:
In the limit of infinite training data and training time, the optimisation procedure should converge a model to an ideal implementation of the objective function encoded into its training loop
Before this limit, the trajectory in program-space may be skewed away from the optimal program leading to unintended results—“misgeneralisation”
A confounder here is that modern AI training objectives are fundamentally un-extendable-to-infinity and so misgeneralisation is ill-defined. For example, “predict the next token in this human-generated text” is bound by humans generating text, and “maximise the human’s response to X” is bound by number of humans, number of interactions with X. Most loss functions make no sense outside of the type of data they are defined on, and so there exists no such thing as perfect generalisation as data is by definition limited.
You could redefine “perfect generalisation” to mean optimal performance on the available data, however, as long as it is possible to produce more data at some point in the future, even a finite amount, this definition is brittle.
A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable. - Stuart Russell