davidad

Karma: 2,141

Programme Director at UK Advanced Research + Invention Agency focusing on safe transformative AI; formerly Protocol Labs, FHI/Oxford, Harvard Biophysics, MIT Mathematics And Computation.

davidad Dec 20, 2022, 1:06 PM
LW: 8 AF: 3
0
AF
in reply to: davidad’s comment on: An Open Agency Architecture for Safe Transformative AI
There’s a lot of similarity. People (including myself in the past) have criticized Russell on the basis that no formal model can prove properties of real-world effects, because the map is not the territory, but I now agree with Russell that it’s plausible to get good enough maps. However:
- I think it’s quite likely that this is only possible with an infra-Bayesian (or credal-set) approach to explicitly account for Knightian uncertainty, which seems to be a difference from Russell’s published proposals (although he has investigated Halpern-style probability logics, which have some similarities to credal sets, he mostly gravitates toward frameworks with ordinary Bayesian semantics).
- Instead of an IRL or CIRL approach to value learning, I propose to rely primarily on linguistic dialogues that are grounded in a fully interpretable representation of preferences. A crux for this is that I believe success in the current stage of humanity’s game does not require loading very much of human values.

davidad Dec 20, 2022, 1:04 PM
LW: 3 AF: 1
0
AF
on: An Open Agency Architecture for Safe Transformative AI
Is this basically Stuart Russell’s provably beneficial AI?

davidad Dec 20, 2022, 1:35 AM
LW: 7 AF: 4
2
AF
on: The “Minimal Latents” Approach to Natural Abstractions
As a category theorist, I am confused by the diagram that you say you included to mess with me; I’m not even sure what I was supposed to think it means (where is the cone for $Λ^{*}$ ? why does the direction of the arrow between $Λ^{*}$ and $Λ$ seem inconsistent?).

I think a “minimal latent,” as you have defined it equationally, is a categorical product (of the $X_{i}$ ) in the coslice category $Ω ↓ S t o c h$ where $S t o c h$ is the category of Markov kernels and $Ω$ is the implicit sample space with respect to which all the random variables are defined.

davidad Dec 20, 2022, 1:06 AM
2 points
3
on: Will research in AI risk jinx it? Consequences of training AI on AI risk arguments
I believe this is a real second-order effect of AI discourse, and “how would this phrasing being in the corpus bias a GPT?” is something I frequently consider briefly when writing publicly, but I also think the first-order effect of striving for more accurate and precise understanding of how AI might behave should take precedence whenever there is a conflict. There’s already a lot of text in the corpora about misaligned AI—most of which is in science fiction, not even from researchers—so even if everyone in this community all stopped writing about instrumental convergence (seems costly!), it probably wouldn’t make much positive impact via this pathway.

davidad Dec 20, 2022, 12:45 AM
3 points
0
on: Towards Hodge-podge Alignment
Potentially relevant is Karvonen and Broadbent’s work on formalizing secure composition of cryptographic primitives using symmetric monoidal categories: video, paper

davidad Dec 20, 2022, 12:29 AM
7 points
5
in reply to: Cleo Nardo’s comment on: Towards Hodge-podge Alignment
I think it will also prove useful for world-modeling even with a naïve POMDP-style Cartesian boundary between the modeler and the environment, since the environment is itself generally well-modeled by a decomposition into locally stateful entities that interact in locally scoped ways (often restricted by naturally occurring boundaries).
What links here?
- Chris Lakin's comment on «Boundaries/Membranes» and AI safety compilation by Chris Lakin (May 3, 2023, 10:09 PM; 1 point)

davidad Dec 19, 2022, 11:56 PM
10 points
9
on: Towards Hodge-podge Alignment
I want to voice my strong support for attempts to define something like dependent type signatures for alignment-relevant components and use wiring diagrams and/or string diagrams (in some kind of double-categorical systems theory, such as David Jaz Myers’) to combine them into larger AI systems proposals. I also like the flowchart. I’m less excited about swiss-cheese security, but I think assemblages are also on the critical path for stronger guarantees.

davidad Dec 17, 2022, 10:37 AM
3 points
0
in reply to: tailcalled’s comment on: Paper: Transformers learn in-context by gradient descent
Yes, it’s worth pulling out that the mesa-optimizers demonstrated here are not consequentialists, they are optimizing the goodness of fit of an internal representation to in-context data.
The role this plays in arguments about deceptive alignment is that it neutralizes the claim that “it’s probably not a realistically efficient or effective or inductive-bias-favoured structure to actually learn an internal optimization algorithm”. Arguments like “it’s not inductive-bias-favoured for mesa-optimizers to be consequentialists instead of maximizing purely epistemic utility” remain.
Although I predict someone will find consequentialist mesa-optimizers in Decision Transformers, that has not (to my knowledge) actually been seen yet.

davidad Dec 16, 2022, 9:36 PM
LW: 19 AF: 5
1
AF
on: Paper: Transformers learn in-context by gradient descent
I think it’s too easy for someone to skim this entire post and still completely miss the headline “this is strong empirical evidence that mesa-optimizers are real in practice”.
What links here?
- Noosphere89's comment on Anomalous tokens reveal the original identities of Instruct models by janus (Feb 10, 2023, 1:15 PM; 1 point)

davidad Dec 15, 2022, 6:40 PM
LW: 2 AF: 1
0
AF
in reply to: Buck’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
I think so, yes.

davidad Dec 15, 2022, 10:02 AM
LW: 3 AF: 2
0
AF
in reply to: Buck’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
This is very interesting. I had previously thought the “KL penalty” being used in RLHF was just the local one that’s part of the PPO RL algorithm, but apparently I didn’t read the InstructGPT paper carefully enough.

I feel slightly better about RLHF now, but not much.

It’s true that minimizing KL subject to a constraint of always exceeding a certain reward threshold would theoretically be equivalent to Bayesian conditioning and therefore equivalent to filtering. That could be seen as a lexicographic objective where the binarised reward gets optimised first and then the KL penalty relative to the predictive model would restore the predictive model’s correlations (once the binarised reward is absolutely saturated). Unfortunately, this would be computationally difficult with gradient descent since you would already have mode-collapse before the KL penalty started to act.

In practice (in the InstructGPT paper, at least), we have a linear mixture of the reward and the global KL penalty. Obviously, if the global KL penalty is weighted at zero, it doesn’t help avoid Causal Goodhart, nor if it’s weighted at $10^{- 10}$ . Conversely, if it’s weighted at $10^{10}$ , the model won’t noticeably respond to human feedback. I think this setup has a linear tradeoff between how much helpfulness you get and how much you avoid Causal Goodhart.

The ELBO argument in the post you linked requires explicitly transforming the reward into a Boltzmann distribution (relative to the prior of the purely predictive model) before using it in the objective function, which seems computationally difficult. That post also suggests some other alternatives to RLHF that are more like cleverly accelerated filtering, such as PPLM, and has a broad conclusion that RL doesn’t seem like the best framework for aligning LMs.

That being said, both of the things I said seem computationally difficult above also seem not-necessarily-impossible and would be research directions I would want to allocate a lot of thought to if I were leaning into RLHF as an alignment strategy.

davidad Dec 15, 2022, 9:35 AM
0 points
−2
AF
in reply to: LawrenceC’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
That’s not the case when using a global KL penalty—as (I believe) OpenAI does in practice, and as Buck appeals to in this other comment. In the paper linked here a global KL penalty is only applied in section 3.6, because they observe a strictly larger gap between proxy and gold reward when doing so.

davidad Dec 15, 2022, 9:08 AM
LW: 2 AF: 1
0
AF
in reply to: Buck’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
In RLHF there are at least three different (stochastic) reward functions:
1. the learned value network
2. the “human clicks 👍/👎” process, and
3. the “what if we asked a whole human research group and they had unlimited time and assistance to deliberate about this one answer” process.
I think the first two correspond to what that paper calls “proxy” and “gold” but I am instead concerned with the ways in which 2 is a proxy for 3.

davidad Dec 14, 2022, 8:36 PM
LW: 6 AF: 4
2
AF
in reply to: Erik Jenner’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
Extremal Goodhart relies on a feasibility boundary in $U, V$ -space that lacks orthogonality, in such a way that maximal $U$ logically implies non-maximal $V$ . In the case of useful and human-approved answers, I expect that in fact, there exist maximally human-approved answers that are also maximally useful—even though there are also maximally human-approved answers that are minimally useful! I think the feasible zone here looks pretty orthogonal, pretty close to a Cartesian product, so Extremal Goodhart won’t come up in either near-term or long-term applications. Near-term, it’s Causal Goodhart and Regressional Goodhart, and long-term, it might be Adversarial Goodhart.

Extremal Goodhart might come into play if, for example, there are some truths about what’s useful that humans simply cannot be convinced of. In that case, I am fine with answers that pretend those things aren’t true, because I think the scope of that extremal tradeoff phenomenon will be small enough to cope with for the purpose of ending the acute risk period. (I would not trust it in the setting of “ambitious value learning that we defer the whole lightcone to.”)

For the record, I’m not very optimistic about filtering as an alignment scheme either, but in the setting of “let’s have some near-term assistance with alignment research”, I think Causal Goodhart is a huge problem for RLHF that is not a problem for equally powerful filtering. Regressional Goodhart will be a problem in any case, but it might be manageable given a training distribution of human origin.

davidad Dec 14, 2022, 5:28 PM
LW: 6 AF: 4
0
AF
in reply to: RobertKirk’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
Briefly, the alternative optimisation target I would suggest is performance at achieving intelligible, formally specified goals within a purely predictive model/simulation of the real world.

Humans could then look at what happens in the simulations and say “gee, that doesn’t look good,” and specify better goals instead, and the policy won’t experience gradient pressure to make those evaluations systematically wrong.

This isn’t the place where I want to make a case for the “competitiveness” or tractability of that kind of approach, but what I want to claim here is that it is an example of an alignment paradigm that does leverage machine learning (both to make a realistic model of the world and to optimise policies for acting within that model) but does not directly use human approval (or an opaque model thereof) as an optimisation target in the kind of way that seems problematic about RLHF.

davidad Dec 14, 2022, 12:01 PM
LW: 9 AF: 4
2
AF
in reply to: Dalcy’s comment on: Trying to disambiguate different questions about whether RLHF is “good”
Here’s my steelman of this argument:
1. There is some quantity called a “level of performance”.
2. A certain level of performance, $P_{1}$ , is necessary to assist humans in ending the acute risk period.
3. A higher level of performance, $P_{2}$ , is necessary for a treacherous turn.
4. Any given alignment strategy $A$ is associated with a factor $λ_{A} \in [0, 1]$ , such that it can convert an unaligned model with performance $P$ into an aligned model with performance $λ_{A} P$ .
5. The maximum achievable performance of unaligned models increases somewhat gradually as a function of time $P (t)$ .
6. Given two alignment strategies $A$ and $B$ such that $λ_{A} > λ_{B}$ , it is more likely that $\exists t . λ_{A} P (t) \geq P_{1} \land P (t) < P_{2}$ than that $\exists t . λ_{B} P (t) \geq P_{1} \land P (t) < P_{2}$ .
7. Therefore, a treacherous turn is less likely in a world with alignment strategy $A$ than in a world with only alignment strategy $B$ .
I’m pretty skeptical of premises 3, 4, and 5, but I think the argument actually is valid, and my guess is that Buck and a lot of other folks in the prosaic-alignment space essentially just believe those premises are plausible.

davidad Dec 14, 2022, 10:05 AM
LW: 28 AF: 15
13
AF
on: Trying to disambiguate different questions about whether RLHF is “good”
Should we do research on alignment schemes which use RLHF as a building block? E.g. work on recursive oversight schemes or RLHF with adversarial training?
- IMO, this kind of research is promising and I expect a large fraction of the best alignment research to look like this.
This seems like the key! It’s probably what people actually mean by the question “is RLHF a promising alignment strategy?”

Most of this post is laying out thoughtful reasoning about related but relatively uncontroversial questions like “is RLHF, narrowly construed, plausibly sufficient for alignment” (of course not) and “is RLHF, very broadly construed, plausibly useful for alignment” (of course yes). I don’t want to diminish the value of having those answers be more common-knowledge. But I do want to call attention to how little of the reasoning elsewhere in the post seems to me to support the plausibility of this opinion here, which is the most controversial and decision-relevant one, and which is stated without any direct justification. (There’s a little bit of justification for it elsewhere, which I’ve argued against in separate comments.) I’m afraid that one post which states a bunch of opinions about related questions, while including detailed reasoning but only for the less controversial ones, might be more persuasive than it ought to be about the juicier questions.
What links here?

davidad Dec 14, 2022, 9:33 AM
LW: 12 AF: 7
3
AF
on: Trying to disambiguate different questions about whether RLHF is “good”
I’m not really aware of any compelling alternatives to this class of plan–“training a model based on a reward signal” is basically all of machine learning
I think the actual concern there is about human feedback, but you phrased the question as about overseer feedback, but then your answer (quoted) is about any reward signal at all.
Is next-token prediction already “training a model based on a reward signal”? A little bit—there’s a loss function! But is it effectively RL on next-token-prediction reward/feedback? Not really. Next-token prediction, by contrast to RL, only does one-step lookahead and doesn’t use a value network (only a policy network). Next-token prediction is qualitatively different from RL because it doesn’t do any backprop-through-time (which can induce emergent/convergent forward-looking “grooves” in trajectory-space which were not found in the [pre]training distribution).
Perhaps more importantly, maintaining some qualitative distance between the optimisation target for an AI model and the human “does this look good?” function seems valuable, for similar reasons to why one holds out some of the historical data while training a financial model, or why one separates a test distribution from a validation distribution. When we give the optimisation process unfettered access to the function that’s ultimately going to make decisions about how all-things-considered good the result is, the opportunities for unmitigated overfitting/Goodharting are greatly increased.

davidad Dec 14, 2022, 9:00 AM
LW: 18 AF: 11
6
AF
on: Trying to disambiguate different questions about whether RLHF is “good”
I don’t think RLHF seems worse than [prompting or filtering to] try to get your model to use its capabilities to be helpful. (Many LessWrong commenters disagree with me here but I haven’t heard them give any arguments that I find compelling.)
Where does the following argument fall flat for you?
1. Human feedback is systematically wrong, e.g. approving of impressively well-reasoned-looking confabulations much more than “I don’t know.”
2. The bias induced by sequence prediction preserves some of the correlations found in Internet text between features that human raters use as proxies (like being impressively well-reasoned-looking) and features that those are proxies for (like being internally coherent, and being consistent with other information found in the corpus).
3. RL fine-tuning applies optimisation pressure to break these correlations in favor of optimising the proxies.
4. Filtering (post-sequence-generation) preserves the correlations while optimising, making it more likely that the outputs which pass the filter have the underlying features humans thought they were selecting for.

davidad Dec 13, 2022, 9:58 AM
LW: 10 AF: 3
AF
in reply to: JBlack’s comment on: Side-channels: input versus output
I agree that, even for side-channels exposing external information to a mathematical attacker, we cannot get this absolutely perfect. Error-correction in microelectronics is an engineering problem and engineering is never absolutely fault-free.

However, per this recent US government study, RAM error rates in high-performance compute clusters range from 0.2 to 20 faults per billion device-hours. For comparison, training GPT-3 (175B parameters) from scratch takes roughly 1-3 million device-hours. An attacker inside a deep learning training run probably gets zero bits of information via the RAM-error channel.

But suppose they get a few bits. Those bits are about as random as they come. Nor is there anything clever to do from within an algorithm to amplify the extent to which cosmic rays reflect useful information about life on Earth.

I disbelieve that your claims about real-world exploits, if cashed out, would break the abstraction of deterministic execution such as is implemented in practice for blockchain smart contracts.

I do think it’s prudent to use strong hardware and software error-correction techniques in high-stakes situations, such as advanced AI, but mostly because errors are generally bad, for reliability and ability to reason about systems and their behaviours. The absolute worst would be if the sign bit got flipped somewhere in a mesa-optimiser’s utility function. So I’m not saying we can just completely neglect concerns about cosmic rays in an AI safety context. But I am prepared to bet the farm on task-specific AIs being completely unable to learn any virology via side-channels if the AI lab training it musters a decent effort to be careful about deterministic execution (which, I stress again, is not something I think happens by default—I hope this post has some causal influence towards making it more likely).