PhD student at UCL DARK doing RL, OOD Robustness and safety. Interested in self improvement.
RobertKirk(Robert Kirk)
What is Interpretability?
No worries. As much as I think less has been written on debate than amplification (Paul has a lot of blog posts on IDA), it seems to me like most of the work Paul’s team at OpenAI is doing is working on debates rather than IDA.
How can Interpretability help Alignment?
(Note: you’re quoting your response as well as the sentence you’ve meant to be quoting (and responding to), which makes it hard to see which part is your writing. I think you need 2 newlines to break the quote formatting).
Do you see a way of incentivizing the RL community to change this? (If possible, that would seem like a more effective approach than doing it “ourselves”.)
I think this is kind of the same as how do we incentivise the wider ML community to think safety is important? I don’t know if there’s anything specific about the RL community which makes it a different case.
There is some work in DeepMind’s safety team on this, isn’t there? (Not to dispute the overall point though, “a part of DeepMind’s safety team” is rather small compared to the RL community :-).)
I think there is too, and I think there’s more research in general than there used to be. I think the field of interpretability (and especially RL) interpretability is very new, pre-paradigmatic, which can make some of the research not seem useful or relevant.
It was a bit hard to understand what you mean by the “research questions vs tasks” distinction. (And then I read the bullet point below it and came, perhaps falsely, to the conclusion that you are only after “reusable piece of wisdom” vs “one-time thing” distinction.)
I’m still uncertain whether tasks is the best word. I think we want reusable pieces of wisdom as well as one-time things, and I don’t know whether that’s the distinction I was aiming for. It’s more like “answer this question once, and then we have the answer forever” vs “answer this question again and again with different inputs each time”. In the first case interpretability tools might enable researchers to answer the question easier. In the second our interpretability tool might have to answer the question directly in an automatic way.
If we believe a particular proposal is more or less likely than others to produce aligned AI, then we would preferentially work on interpretability research which we believe will help this proposal over research which wouldn’t, as it wouldn’t be as useful.
I have changed the sentence, I had other instead of over.
I’d be curious to hear what your thoughts are on the other conversations, or at least specifically which conversations you’re not a fan of?
That’s my guess also, but I’m more asking just in case that’s not the case, and he disagrees with (for example) the Pragmatic AI Safety sequence, in which case I’d like to know why.
Do you think these insights would generalise to the case where the language model may be interacting with some system during this fine-tuning phase? For example, if it generates queries to an external search engine or API, or has dialogue with a human, then the optimal policy is no longer equivalent to just generating the correct output distribution, as it now also involves environment observations. This setting makes the distinction between a generative model and a policy clearer, and maybe changes the relevance of this statement:
The problem with the RL objective is that it treats the LM as a policy, not as a generative model. While a generative model is supposed to capture a diverse distribution of samples, a policy is supposed to chose the optimal action.
That is, in these settings we do want an optimal policy and not a generative model. This is quite similar to what Paul was saying as well.
Further, if we see aligning language models as a proxy for aligning future strong systems, it seems likely that these systems will be taking multiple steps of interaction in some environment, rather than just generating one (or several) sequences without feedback.
Another possible way to provide pressure towards using language in a human-sense way is some form of multi-tasking/multi-agent scenario, inspired by this paper: Multitasking Inhibits Semantic Drift. They show that if you pretrain multiple instructors and instruction executors to understand language in a human-like way (e.g. with supervised labels), and then during training mix the instructors and instruction executors, it makes it difficult to drift from the original semantics, as all the instructors and instruction executors would need to drift in the same direction; equivalently, any local change in semantics would be sub-optimal compared to using language in the semantically correct way. The examples in the paper are on quite toy problems, but I think in principle this could work.
So I think it mostly comes down to a philosophical difference. Do you want your LM to be a decision-maker acting in a world or a model of a some probability distribution over texts? If you want a decision-maker and training on language is just a scaffolding to get you there, maybe indeed staying close to the original distribution only has instrumental value?
But what if what you want is just an oracle-type conversational AI: a knowledge base and a common-sense reasoner. Maybe in this case staying close to human knowledge and inference rules represented in language is of inherent value?
I feel like this wasn’t exactly made clear in the post/paper that this was the motivation. You state that distribution collapse is bad without really justifying it (in my reading). Clearly distributional collapse to a degenerate bad output is bad, and also will often stall learning so is bad from an optimisation perspective as well (as it makes exploration much less likely), but this seems different from distributional collapse to the optimal output. For example, when you say
if x∗ is truly the best thing, we still wouldn’t want the LM to generate only x∗
I think I just disagree, in the case where we’re considering LLMs as a model for future agent-like systems that we will have to align, which to me is the reason they’re useful for alignment research. If there’s a normative claim that diversity is important, then you should just have that in your objective/algorithm.
I think the reason KL divergence is included in RLHF is an optimisation hack to make sure it does well. Maybe that’s revealed that for alignment we actually wanted the Bayesian posterior distribution you describe rather than just the optimal distribution according to the reward function (i.e. a hardmax rather than a softmax on the reward over trajectories), although that seems to be an empirical question whether that was our preference all along or it’s just useful in the current regime.
Newtonian: complex reactions
So please suggest alternative names and characterizations, or ask questions to pinpoint what I’m describing.
Are you pointing here at the fact that the AI training process and world will be a complex system, and as such it is hard to predict the outcomes of interventions, and hence the first-order obvious outcomes of interventions may not occur, or may be dominated by higher-order outcomes? That’s what the “complex reactions” and some of the references kind of point at, but then in the description you seem to be talking more about a specific case: Strong optimisation will always find a path if it exists, so patching some but not all paths isn’t useful, and in fact could have weird counter-productive effects if the remaining paths that the strong optimisation takes are actually worse in some other ways than the ones you patched.
Other possible names would then be either leaning into the complex systems view, so the (possibly incorrect) assumption is something like “non-complexity” or “linear/predictable responses”; or leaning into the optimisation paths analogy which might be something like “incremental improvement is ok” although that is pretty bad as a name.
This assumption is basically that you can predict the result of an intervention without having to understand the internal mechanism in detail, because the latter is straightforward.
This seems to me that you want a word for whatever the opposite of complex/chaotic systems are, right? Although obviously “Simple” is probably not the best word (as it’s very generic). It could be “Simple Dynamics” or “Predictable Dynamics”?
Causal confusion as an argument against the scaling hypothesis
Another point worth making here is why I haven’t separated out worst-case inspection transparency for deceptive models vs. worst-case training process transparency for deceptive models there. That’s because, while technically the latter is strictly more complicated than the former, I actually think that they’re likely to be equally difficult. In particular, I suspect that the only way that we might actually have a shot at understanding worst-case properties of deceptive models is through understanding how they’re trained.
I’d be curious to hear a bit more justification for this. It feels like resting on this intuition for a reason not to include worst-case inspection transparency for deceptive models as a separate node is a bit of a brittle choice (i.e. makes it more likely the tech tree would change if we got new information). You write
That is, if our ability to understand training dynamics is good enough, we might be able to make it impossible for a deceptive model to evade us by always being able to see its planning for how to do so during training.
which to me is a justification that worst-case inspection transparency for deceptive models is solved if we solve worst-case training process transparency for deceptive models, but not a justification that that’s the only way to solve it.
The ability to go 1->4 or 2->5 by the behavioural-cloning approach would assume that the difficulty of interpreting all parts of the model are fairly similar, but it just takes time for the humans to interpret all parts, so we can automate that by imitating the humans. But if understanding the worst-case stuff is significantly harder than the best-case stuff (which seems likely to me) then I wouldn’t expect the behaviourally-cloned interpretation agent to generalise to being able to correctly interpret the worse-case stuff.
Here we’re saying that the continual fine-tuning might not necessarily resolve causal confusion within the model; instead, it will help the model learn the (new) spurious correlations so that it still performs well on the test data. This is assuming that continual fine-tuning is using a similar ERM-based method (e.g. the same pretraining objective but on the new data distribution). In hindsight, we probably should have written “continual training” rather than specifically “continual fine-tuning”. If you could continually train online in the deployment environment then that would be better, and whether it’s enough is very related to whether online training is enough, which is one of the key open questions we mention.
I expect that these kinds of problems could mostly be solved by scaling up data and compute (although I haven’t read the paper). However, the argument in the post is that even if we did scale up, we couldn’t solve the OOD generalisation problems.
Suppose that aligning an AGI requires 1000 person-years of research.
900 of these person-years can be done in parallelizable 5-year chunks (e.g., by 180 people over 5 years — or, more realistically, by 1800 people over 10 years, with 10% of the people doing the job correctly half the time).
The remaining 100 of these person-years factor into four chunks that take 25 serial years apiece (so that you can’t get any of those four parts done in less than 25 years).
Do you have a similar model for just building (unaligned) AGI? Or is the model meaningfully different? On a similar model for just building AGI, then timelines would mostly be shortened by progressing through the serial research-person-years instead of the parallelisable research-person-years. If researchers who are progressing both capabilities and aligning are doing both in the parallelisable part, then this would be less worrying, as they’re not actually shortening timelines meaningfully.
Unfortunately I imagine you think that building (unaligned) AGI quite probably doesn’t have many more serial person-years of research required, if any. This is possibly another way of framing the prosaic AGI claim: “we expect we can get to AGI without any fundamentally new insights on intelligence, using (something like) current methods.”
If, instead of using interpretability tools in the loss function, we merely use it as a ‘validation set’ instead of the training set (i.e. using it as a ‘mulligan’), we might have better chances of picking up dangerous cognition before it gets out of hand so we can terminate the model and start over. We’re therefore still using interpretability in model selection, but the feedback loop is much less tight, so it’d be harder to Goodhart.
While only using the interpretability-tool-based filter for model selection is much weaker optimisation pressure than using it in the loss function, and hence makes goodhearting harder and hence slower, it’s not clear that this would solve the problem in the long run. If the interpretability-tool-based filter captures everything we know now to capture, and we don’t get new insights during the iterated process of model training and model selection, then it’s possible we’ll eventually end up goodharting the model selection process in the same was as SGD would goodhart the interpretability tool in the loss function.
I think it’s likely that we would gain more insights or have more time if we were to use the interpretability tool as a mulligan, and it’s possible the way we as AI builders optimise producing a model that passes the interpretability filters is qualitatively different from the way SGD (or whatever training algorithm is being used) would optimise the interpretability-filter loss function. However, in the spirit of paranoia/security mindset/etc., it’s worth pointing out that using the tool as a model selection filter doesn’t guarantee that an AGI that passes the filter is safer than if we used the interpretability tool as a training signal, in the limit of iterating to pass the interpretability tool model selection filter.
First condition: assess reasoning authenticity
To be able to do this step in the most general setting seems to capture the entire difficulty of interpretability—if we could assess whether a model’s outputs faithfully reflect it’s internal “thinking” and hence that all of it’s reasoning is what we’re seeing, then that would be a huge jump forwards (and perhaps possible be equivalent to solving) something like ELK. Given that that problem is known to be quite difficult, and we currently don’t have solutions for it, I’m uncertain whether this reduction of aligning a language model into first verifying all its visible reasoning is complete, correct and faithful, and then doing other steps (i.e. actively optimising against this our measures of correct reasoning) is one that makes the problem easier. Do you think it’s meaningfully different (i.g. easier) to solve the “assess reasoning authenticity” completely than to solve ELK, or another hard interpretability problem?
I don’t know whether this is on purpose, but I’d think that AI Safety Via Debate (original paper: https://arxiv.org/abs/1805.00899; recent progress report: https://www.lesswrong.com/posts/Br4xDbYu4Frwrb64a/writeup-progress-on-ai-safety-via-debate-1) should get a mention, probably in the Technical agendas focused on possible solutions section? I’d argue it’s different enough from IDA to have it’s own subititle.