Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.
davidad
Nate [replying to Eric Drexler]: I expect that, if you try to split these systems into services, then you either fail to capture the heart of intelligence and your siloed AIs are irrelevant, or you wind up with enough AGI in one of your siloes that you have a whole alignment problem (hard parts and all) in there. Like, I see this plan as basically saying “yep, that hard problem is in fact too hard, let’s try to dodge it, by having humans + narrow AI services perform the pivotal act”. Setting aside how I don’t particularly expect this to work, we can at least hopefully agree that it’s attempting to route around the problems that seem to me to be central, rather than attempting to solve them.
I think, in an open agency architecture, the silo that gets “enough AGI” is in step 2, and it is pointed at the desired objective by having formal specifications and model-checking against them.
But I also wouldn’t object to the charge that an open agency architecture would “route around the central problem,” if you define the central problem as something like building a system that you’d be happy for humanity to defer to forever. In the long run, something like more ambitious value learning (or value discovery) will be needed, on pain of astronomical waste. This would be, in a sense, a compromise (or, if you’re optimistic, a contingency plan), motivated by short timelines and insufficient theoretical progress toward full normative alignment.
GPT-3.5-Nate [prompted using 1061 tokens of Nate’s criticism of related ideas]: This is a really interesting proposal! It seems to me that you’re trying to solve the hard problem of AI alignment, by proposing a system that would be able to identify the concepts that an AGI should be optimizing for, and then use a combination of model-checking and time-bounded optimization to ensure that the AGI’s behavior is robustly directed at those concepts. This is a really ambitious project, and I’m excited to see how it develops.
🙃
How would you respond to predicted objections from Nate Soares?
There’s a lot of similarity. People (including myself in the past) have criticized Russell on the basis that no formal model can prove properties of real-world effects, because the map is not the territory, but I now agree with Russell that it’s plausible to get good enough maps. However:
I think it’s quite likely that this is only possible with an infra-Bayesian (or credal-set) approach to explicitly account for Knightian uncertainty, which seems to be a difference from Russell’s published proposals (although he has investigated Halpern-style probability logics, which have some similarities to credal sets, he mostly gravitates toward frameworks with ordinary Bayesian semantics).
Instead of an IRL or CIRL approach to value learning, I propose to rely primarily on linguistic dialogues that are grounded in a fully interpretable representation of preferences. A crux for this is that I believe success in the current stage of humanity’s game does not require loading very much of human values.
Is this basically Stuart Russell’s provably beneficial AI?
An Open Agency Architecture for Safe Transformative AI
As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for ? why does the direction of the arrow between and seem inconsistent?).
I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the ) in the coslice category where is the category of Markov kernels and is the implicit sample space with respect to which all the random variables are defined.
I believe this is a real second-order effect of AI discourse, and “how would this phrasing being in the corpus bias a GPT?” is something I frequently consider briefly when writing publicly, but I also think the first-order effect of striving for more accurate and precise understanding of how AI might behave should take precedence whenever there is a conflict. There’s already a lot of text in the corpora about misaligned AI—most of which is in science fiction, not even from researchers—so even if everyone in this community all stopped writing about instrumental convergence (seems costly!), it probably wouldn’t make much positive impact via this pathway.
I think it will also prove useful for world-modeling even with a naïve POMDP-style Cartesian boundary between the modeler and the environment, since the environment is itself generally well-modeled by a decomposition into locally stateful entities that interact in locally scoped ways (often restricted by naturally occurring boundaries).
- May 3, 2023, 10:09 PM; 1 point) 's comment on «Boundaries/Membranes» and AI safety compilation by (
I want to voice my strong support for attempts to define something like dependent type signatures for alignment-relevant components and use wiring diagrams and/or string diagrams (in some kind of double-categorical systems theory, such as David Jaz Myers’) to combine them into larger AI systems proposals. I also like the flowchart. I’m less excited about swiss-cheese security, but I think assemblages are also on the critical path for stronger guarantees.
Yes, it’s worth pulling out that the mesa-optimizers demonstrated here are not consequentialists, they are optimizing the goodness of fit of an internal representation to in-context data.
The role this plays in arguments about deceptive alignment is that it neutralizes the claim that “it’s probably not a realistically efficient or effective or inductive-bias-favoured structure to actually learn an internal optimization algorithm”. Arguments like “it’s not inductive-bias-favoured for mesa-optimizers to be consequentialists instead of maximizing purely epistemic utility” remain.
Although I predict someone will find consequentialist mesa-optimizers in Decision Transformers, that has not (to my knowledge) actually been seen yet.
I think it’s too easy for someone to skim this entire post and still completely miss the headline “this is strong empirical evidence that mesa-optimizers are real in practice”.
- Feb 10, 2023, 1:15 PM; 1 point) 's comment on Anomalous tokens reveal the original identities of Instruct models by (
I think so, yes.
AI Neorealism: a threat model & success criterion for existential safety
This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough.
I feel slightly better about RLHF now, but not much.
It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first and then the KL penalty relative to the predictive model would restore the predictive model’s correlations (once the binarised reward is absolutely saturated). Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.
In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. Obviously, if the global KL penalty is weighted at zero, it doesn’t help avoid Causal Goodhart, nor if it’s weighted at . Conversely, if it’s weighted at , the model won’t noticeably respond to human feedback. I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.
The ELBO argument in the post you linked requires explicitly transforming the reward into a Boltzmann distribution (relative to the prior of the purely predictive model) before using it in the objective function, which seems computationally difficult. That post also suggests some other alternatives to RLHF that are more like cleverly accelerated filtering, such as PPLM, and has a broad conclusion that RL doesn’t seem like the best framework for aligning LMs.
That being said, both of the things I said seem computationally difficult above also seem not-necessarily-impossible and would be research directions I would want to allocate a lot of thought to if I were leaning into RLHF as an alignment strategy.
That’s not the case when using a global KL penalty—as (I believe) OpenAI does in practice, and as Buck appeals to in this other comment. In the paper linked here a global KL penalty is only applied in section 3.6, because they observe a strictly larger gap between proxy and gold reward when doing so.
In RLHF there are at least three different (stochastic) reward functions:
the learned value network
the “human clicks 👍/👎” process, and
the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.
I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.
Extremal Goodhart relies on a feasibility boundary in -space that lacks orthogonality, in such a way that maximal logically implies non-maximal . In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won’t come up in either near-term or long-term applications. Near-term, it’s Causal Goodhart and Regressional Goodhart, and long-term, it might be Adversarial Goodhart.
Extremal Goodhart might come into play if, for example, there are some truths about what’s useful that humans simply cannot be convinced of. In that case, I am fine with answers that pretend those things aren’t true, because I think the scope of that extremal tradeoff phenomenon will be small enough to cope with for the purpose of ending the acute risk period. (I would not trust it in the setting of “ambitious value learning that we defer the whole lightcone to.”)
For the record, I’m not very optimistic about filtering as an alignment scheme either, but in the setting of “let’s have some near-term assistance with alignment research”, I think Causal Goodhart is a huge problem for RLHF that is not a problem for equally powerful filtering. Regressional Goodhart will be a problem in any case, but it might be manageable given a training distribution of human origin.
What about inner misalignment?