Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.
davidad
Is this basically Stuart Russell’s provably beneficial AI?
As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for ? why does the direction of the arrow between and seem inconsistent?).
I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the ) in the coslice category where is the category of Markov kernels and is the implicit sample space with respect to which all the random variables are defined.
I believe this is a real second-order effect of AI discourse, and “how would this phrasing being in the corpus bias a GPT?” is something I frequently consider briefly when writing publicly, but I also think the first-order effect of striving for more accurate and precise understanding of how AI might behave should take precedence whenever there is a conflict. There’s already a lot of text in the corpora about misaligned AI—most of which is in science fiction, not even from researchers—so even if everyone in this community all stopped writing about instrumental convergence (seems costly!), it probably wouldn’t make much positive impact via this pathway.
I think it will also prove useful for world-modeling even with a naïve POMDP-style Cartesian boundary between the modeler and the environment, since the environment is itself generally well-modeled by a decomposition into locally stateful entities that interact in locally scoped ways (often restricted by naturally occurring boundaries).
- May 3, 2023, 10:09 PM; 1 point) 's comment on «Boundaries/Membranes» and AI safety compilation by (
I want to voice my strong support for attempts to define something like dependent type signatures for alignment-relevant components and use wiring diagrams and/or string diagrams (in some kind of double-categorical systems theory, such as David Jaz Myers’) to combine them into larger AI systems proposals. I also like the flowchart. I’m less excited about swiss-cheese security, but I think assemblages are also on the critical path for stronger guarantees.
Yes, it’s worth pulling out that the mesa-optimizers demonstrated here are not consequentialists, they are optimizing the goodness of fit of an internal representation to in-context data.
The role this plays in arguments about deceptive alignment is that it neutralizes the claim that “it’s probably not a realistically efficient or effective or inductive-bias-favoured structure to actually learn an internal optimization algorithm”. Arguments like “it’s not inductive-bias-favoured for mesa-optimizers to be consequentialists instead of maximizing purely epistemic utility” remain.
Although I predict someone will find consequentialist mesa-optimizers in Decision Transformers, that has not (to my knowledge) actually been seen yet.
I think it’s too easy for someone to skim this entire post and still completely miss the headline “this is strong empirical evidence that mesa-optimizers are real in practice”.
- Feb 10, 2023, 1:15 PM; 1 point) 's comment on Anomalous tokens reveal the original identities of Instruct models by (
I think so, yes.
This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough.
I feel slightly better about RLHF now, but not much.
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first and then the KL penalty relative to the predictive model would restore the predictive model’s correlations (once the binarised reward is absolutely saturated). Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.
In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. Obviously, if the global KL penalty is weighted at zero, it doesn’t help avoid Causal Goodhart, nor if it’s weighted at . Conversely, if it’s weighted at , the model won’t noticeably respond to human feedback. I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.
The ELBO argument in the post you linked requires explicitly transforming the reward into a Boltzmann distribution (relative to the prior of the purely predictive model) before using it in the objective function, which seems computationally difficult. That post also suggests some other alternatives to RLHF that are more like cleverly accelerated filtering, such as PPLM, and has a broad conclusion that RL doesn’t seem like the best framework for aligning LMs.
That being said, both of the things I said seem computationally difficult above also seem not-necessarily-impossible and would be research directions I would want to allocate a lot of thought to if I were leaning into RLHF as an alignment strategy.
That’s not the case when using a global KL penalty—as (I believe) OpenAI does in practice, and as Buck appeals to in this other comment. In the paper linked here a global KL penalty is only applied in section 3.6, because they observe a strictly larger gap between proxy and gold reward when doing so.
In RLHF there are at least three different (stochastic) reward functions:
the learned value network
the “human clicks 👍/👎” process, and
the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.
I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.
Extremal Goodhart relies on a feasibility boundary in -space that lacks orthogonality, in such a way that maximal logically implies non-maximal . In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won’t come up in either near-term or long-term applications. Near-term, it’s Causal Goodhart and Regressional Goodhart, and long-term, it might be Adversarial Goodhart.
Extremal Goodhart might come into play if, for example, there are some truths about what’s useful that humans simply cannot be convinced of. In that case, I am fine with answers that pretend those things aren’t true, because I think the scope of that extremal tradeoff phenomenon will be small enough to cope with for the purpose of ending the acute risk period. (I would not trust it in the setting of “ambitious value learning that we defer the whole lightcone to.”)
For the record, I’m not very optimistic about filtering as an alignment scheme either, but in the setting of “let’s have some near-term assistance with alignment research”, I think Causal Goodhart is a huge problem for RLHF that is not a problem for equally powerful filtering. Regressional Goodhart will be a problem in any case, but it might be manageable given a training distribution of human origin.
Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.
Humans could then look at what happens in the simulations and say “gee, that doesn’t look good,” and specify better goals instead, and the policy won’t experience gradient pressure to make those evaluations systematically wrong.
This isn’t the place where I want to make a case for the “competitiveness” or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.
Here’s my steelman of this argument:
There is some quantity called a “level of performance”.
A certain level of performance, , is necessary to assist humans in ending the acute risk period.
A higher level of performance, , is necessary for a treacherous turn.
Any given alignment strategy is associated with a factor , such that it can convert an unaligned model with performance into an aligned model with performance .
The maximum achievable performance of unaligned models increases somewhat gradually as a function of time .
Given two alignment strategies and such that , it is more likely that than that .
Therefore, a treacherous turn is less likely in a world with alignment strategy than in a world with only alignment strategy .
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.
Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?
IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.
This seems like the key! It’s probably what people actually mean by the question “is RLHF a promising alignment strategy?”
Most of this post is laying out thoughtful reasoning about related but relatively uncontroversial questions like “is RLHF, narrowly construed, plausibly sufficient for alignment” (of course not) and “is RLHF, very broadly construed, plausibly useful for alignment” (of course yes). I don’t want to diminish the value of having those answers be more common-knowledge. But I do want to call attention to how little of the reasoning elsewhere in the post seems to me to support the plausibility of this opinion here, which is the most controversial and decision-relevant one, and which is stated without any direct justification. (There’s a little bit of justification for it elsewhere, which I’ve argued against in separate comments.) I’m afraid that one post which states a bunch of opinions about related questions, while including detailed reasoning but only for the less controversial ones, might be more persuasive than it ought to be about the juicier questions.
- Compendium of problems with RLHF by Jan 29, 2023, 11:40 AM; 120 points) (
- Compendium of problems with RLHF by Jan 30, 2023, 8:48 AM; 18 points) (EA Forum;
- Jan 30, 2023, 3:58 PM; 3 points) 's comment on Compendium of problems with RLHF by (
I’m not really aware of any compelling alternatives to this class of plan–“training a model based on a reward signal” is basically all of machine learning
I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.
Is next-token prediction already “training a model based on a reward signal”? A little bit—there’s a loss function! But is it effectively RL on next-token-prediction reward/feedback? Not really. Next-token prediction, by contrast to RL, only does one-step lookahead and doesn’t use a value network (only a policy network). Next-token prediction is qualitatively different from RL because it doesn’t do any backprop-through-time (which can induce emergent/convergent forward-looking “grooves” in trajectory-space which were not found in the [pre]training distribution).
Perhaps more importantly, maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function seems valuable, for similar reasons to why one holds out some of the historical data while training a financial model, or why one separates a test distribution from a validation distribution. When we give the optimisation process unfettered access to the function that’s ultimately going to make decisions about how all-things-considered good the result is, the opportunities for unmitigated overfitting/Goodharting are greatly increased.
I don’t think RLHF seems worse than [prompting or filtering to] try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)
Where does the following argument fall flat for you?
Human feedback is systematically wrong, e.g. approving of impressively well-reasoned-looking confabulations much more than “I don’t know.”
The bias induced by sequence prediction preserves some of the correlations found in Internet text between features that human raters use as proxies (like being impressively well-reasoned-looking) and features that those are proxies for (like being internally coherent, and being consistent with other information found in the corpus).
RL fine-tuning applies optimisation pressure to break these correlations in favor of optimising the proxies.
Filtering (post-sequence-generation) preserves the correlations while optimising, making it more likely that the outputs which pass the filter have the underlying features humans thought they were selecting for.
I agree that, even for side-channels exposing external information to a mathematical attacker, we cannot get this absolutely perfect. Error-correction in microelectronics is an engineering problem and engineering is never absolutely fault-free.
However, per this recent US government study, RAM error rates in high-performance compute clusters range from 0.2 to 20 faults per billion device-hours. For comparison, training GPT-3 (175B parameters) from scratch takes roughly 1-3 million device-hours. An attacker inside a deep learning training run probably gets zero bits of information via the RAM-error channel.
But suppose they get a few bits. Those bits are about as random as they come. Nor is there anything clever to do from within an algorithm to amplify the extent to which cosmic rays reflect useful information about life on Earth.
I disbelieve that your claims about real-world exploits, if cashed out, would break the abstraction of deterministic execution such as is implemented in practice for blockchain smart contracts.
I do think it’s prudent to use strong hardware and software error-correction techniques in high-stakes situations, such as advanced AI, but mostly because errors are generally bad, for reliability and ability to reason about systems and their behaviours. The absolute worst would be if the sign bit got flipped somewhere in a mesa-optimiser’s utility function. So I’m not saying we can just completely neglect concerns about cosmic rays in an AI safety context. But I am prepared to bet the farm on task-specific AIs being completely unable to learn any virology via side-channels if the AI lab training it musters a decent effort to be careful about deterministic execution (which, I stress again, is not something I think happens by default—I hope this post has some causal influence towards making it more likely).
There’s a lot of similarity. People (including myself in the past) have criticized Russell on the basis that no formal model can prove properties of real-world effects, because the map is not the territory, but I now agree with Russell that it’s plausible to get good enough maps. However:
I think it’s quite likely that this is only possible with an infra-Bayesian (or credal-set) approach to explicitly account for Knightian uncertainty, which seems to be a difference from Russell’s published proposals (although he has investigated Halpern-style probability logics, which have some similarities to credal sets, he mostly gravitates toward frameworks with ordinary Bayesian semantics).
Instead of an IRL or CIRL approach to value learning, I propose to rely primarily on linguistic dialogues that are grounded in a fully interpretable representation of preferences. A crux for this is that I believe success in the current stage of humanity’s game does not require loading very much of human values.