I’m not sure how relevant meta-execution is any more, I haven’t seen it discussed much recently. So probably you’d want to ask Paul, or someone else who was around earlier than I was.
Can you not meaningfully discuss “this amplification procedure is like an n-depth approximation of HCH at step x”, for any amplification procedure?
No, you can’t. E.g. If your amplification procedure only allows you to ask a single subagent a single question, that will approximate a linear HCH instead of a tree-based HCH. If your amplification procedure doesn’t invoke subagents at all, but instead provides more and more facts to the agent, it doesn’t look anything like HCH. The canonical implementations of iterated amplification are trying to approximate HCH though.
For example, the internal structure of the distilled agent described in Christiano’s paper is unlikely to look anything like a tree. However, my (potentially incorrect?) impression is that the agent’s capabilities at step x are identical to an HCH tree of depth x if the underlying learning system is arbitrarily capable.
That sounds right to me.
I maybe think so? I’m not sure about meta-execution (as the comment thread above shows).
“Depth” only applies to the canonical tree-based implementation of IDA. If you slot in other amplification or distillation procedures, then you won’t necessarily have “depths” any more. You’ll still have recursion, and that recursion will lead to more and more capability. Where it ends up depends on your initial agent and how good the amplification and distillation procedures are.
That’s… a fair point. It does make up a substantial portion of the transparency section, which seems like the “solutions” part of this post, but it isn’t the entire post.
Matthew’s certainly right that I tend to reply to things I disagree with, though I usually try to avoid disagreeing with details. I’m not sure that I only disagree with details here, but I can’t clearly articulate what about this feels off to me. I’ll delete the opinion altogether; I’m not going to put an unclear opinion in the newsletter.
I certainly would count an ontological failure in the reward function as an incorrect belief about the reward function.
Glad it was useful!
My opinion, also going into the newsletter:
Like Matthew, I’m excited to see more work on transparency and adversarial training for inner alignment. I’m a somewhat skeptical of the value of work that plans to decompose future models into a “world model”, “search” and “objective”: I would guess that there are many ways to achieve intelligent cognition that don’t easily factor into any of these concepts. It seems fine to study a system composed of a world model, search and objective in order to gain conceptual insight; I’m more worried about proposing it as an actual plan.
I mentioned in my opinion that I think many of my disagreements are because of an implicit disagreement on how we build powerful AI systems:
the book has an implied stance towards the future of AI research that I don’t agree with: I could imagine that powerful AI systems end up being created by learning alone without needing the conceptual breakthroughs that Stuart outlines.
I didn’t expand on this in the newsletter because I’m not clear enough on the disagreement; I try to avoid writing very confused thoughts that say wrong things about what other people believe in a publication read by a thousand people. But that’s fine for a comment here!
Rather than attribute a model to Stuart, I’m just going to make up a model that was inspired by reading HC, but wasn’t proposed by HC. In this model, we get a superintelligent AI system that looks like a Bayesian-like system that explicitly represents things like “beliefs”, “plans”, etc. Some more details:
Things like ‘hierarchical planning’ are explicit algorithms. Simply looking at the algorithm can give you a lot of insight into how it does hierarchy. You can inspect things like “options” just by looking at inputs/outputs to the hierarchical planning module. The same thing applies for e.g. causal reasoning.
Any black box deep learning system is only used to provide low-level inputs to the real ‘intelligence’, in the same way that for humans vision provides low-level inputs for the rest of cognition. We don’t need to worry about the deep learning system “taking over”, in the same way that we don’t worry about our vision module “taking over”.
The AI system was created by breakthroughs in algorithms for causal reasoning, hierarchical planning, etc, that allow it to deal with the combinatorial explosion caused by the real world. As a result, it is very cheap to run (i.e. doesn’t need a huge amount of compute). This is more compatible with a discontinuous takeoff, though a continuous takeoff is possible if the algorithms improved continuously over time, rather than having breakthroughs.
Some implications of this model:
All of the “intelligence” is happening via explicit algorithms. We only need to make sure that the algorithms are aligned. So, we only have an outer alignment problem; there is no inner alignment problem.
Since the system is mostly Bayesian, the main challenges are to avoid misspecification (solution: use something equivalent to the Solomonoff prior) and to be computationally efficient (solution: keep a small set of hypotheses, detect when they fail to explain the data, and expand to a bigger class of hypotheses). You don’t have to worry about other forms of robustness like adversarial examples.
If you’re curious about how I select what goes in the newsletter: I almost put in this critical review of the book, in the spirit of presenting both sides of the argument. I didn’t put it in because I couldn’t understand it.
My best guess right now is that the author is arguing that “we’ll never get superintelligence”, possibly because intelligence isn’t a coherent concept, but there’s probably something more that I’m not getting. If it turned out that it was only saying “we’ll never get superintelligence”, and there weren’t any new supporting arguments, I wouldn’t include it in the newsletter, because we’ve seen and heard that counterargument more than enough.
I enjoyed pages 185-190, on mathematical guarantees, especially because I’ve been confused about what the “provably beneficial” in CHAI’s mission statement is meant to say. Some quotes:
On the other hand, if you want to prove something about the real world—for example, that AI systems designed like so won’t kill you on purpose—your axioms have to be true in the real world. If they aren’t true, you’ve proved something about an imaginary world.
On the applicability of theorems to practice:
The trick is to know how far one can stray from the real world and still obtain useful results. For example, if the rigid-beam assumption allows an engineer to calculate the forces in a structure that includes the beam, and those forces are small enough to bend a real steel beam by only a tiny amount, then the engineer can be reasonably confident that the analysis will transfer from the imaginary world to the real world.
as well as
The process of removing unrealistic assumptions continues until the engineer is fairly confident that the remaining assumptions are true enough in the real world. After that, the engineered system can be tested in the real world; but the test results are just that. They do not prove that the same system will work in other circumstances or that other instances of the system will behave the same way as the original.
It then talks about assumption failure in cryptography due to side-channel attacks.
A somewhat more concrete version of what “provably beneficial” might mean:
Let’s look at the kind of theorem we would like eventually to prove about machines that are beneficial to humans. One type might go something like this:
Suppose a machine has components A, B, C, connected to each other like so and to the environment like so, with internal learning algorithms lA, lB, lC that optimize internal feedback rewards rA, rB, rC defined like so, and [a few more conditions] . . . then, with very high probability, the machine’s behavior will be very close in value (for humans) to the best possible behavior realizable on any machine with the same computational and physical capabilities.
The main point here is that such a theorem should hold regardless of how smart the components become—that is, the vessel never springs a leak and the machine always remains beneficial to humans.
There are three other points worth making about this kind of theorem. First, we cannot try to prove that the machine produces optimal (or even near-optimal) behavior on our behalf, because that’s almost certainly computationally impossible. [...] Second, we say “very high probability . . . very close” because that’s typically the best that can be done with machines that learn. [...] Finally, we are a long way from being able to prove any such theorem for really intelligent machines operating in the real world!
It then goes on to discuss how such a theorem is subject to “side-channel attacks” because such theorems typically assume Cartesian duality, which is not actually true (see Embedded Agency).
Quote from the book on the problem of aligning black box models:
The task is, fortunately, not the following: given a machine that possesses a high degree of intelligence, work out how to control it. If that were the task, we would be toast. A machine viewed as a black box, a fait accompli, might as well have arrived from outer space. And our chances of controlling a superintelligent entity from outer space are roughly zero. Similar arguments apply to methods of creating AI systems that guarantee we won’t understand how they work; these methods include whole-brain emulation — creating souped-up electronic copies of human brains — as well as methods based on simulated evolution of programs. I won’t say more about these proposals because they are so obviously a bad idea.
This is unfortunately the only paragraph that HC devotes to the matter.
Glad to hear it! Yeah, I do expect many people to disagree with many parts of this book. My guess is that it mostly boils down to a difference in predictions about how we build powerful AI systems.
This post makes the point that for Markovian reward functions on observations, since any given observation can correspond to multiple underlying states, we cannot know just by analyzing the reward function whether it actually leads to good behavior: it also depends on the environment. For example, suppose we want an agent to collect all of the blue blocks in a room together. We might simply reward it for having blue in its observations: this might work great if the agent only has the ability to pick up and move blocks, but won’t work well if the agent has a paintbrush and blue paint. This makes the reward designer’s job much more difficult. However, the designer could use techniques that don’t require a reward on individual observations, such as rewards that can depend on the agent’s internal cognition (as in iterated amplification), or rewards that can depend on histories (as in Deep RL from Human Preferences).
I certainly agree that we want to avoid reward functions defined on observations, and this is one reason why. It seems like a more general version of the wireheading argument to me, and applies even if you think that the AI won’t be able to wirehead, as long as it is capable enough to find other plans for getting high reward besides the one the designer intended.
See also Reward Uncertainty.
Clarification: For me, the general worry is something like “if I get quoted, I need to make sure that it’s not misleading (which can happen even if the person quoting me didn’t mean to be misleading), and that takes time and effort and noticing all the places where I’m quoted, and it’s just easier to not say things at all”.
(Other people may have more worries, like “If I say something that could be interpreted as being critical of the organization, and that becomes sufficiently well-publicized, then I might get fired, so I’ll just never say anything like that.”)
I wonder why Professor Russell doesn’t describe his agenda in more technical detail, or engage much with the technical AI safety community, to the extent that even grad students at CHAI apparently do not know much about his approach.
For the sake of explaining this: for quite a while, he’s been engaging with academics and policymakers, and writing a book; it’s not that he’s been doing research and not talking to anyone about it.
Fyi, when you quote people who work at an organization saying something that has a negative implication about that organization, you make it less likely that people will say things like that in the future. I’m not saying that you did anything wrong here; I just want to make sure that you know of this effect, and that it does make me in particular more likely to be silent the next time you ask about CHAI rather than responding.
Concerns about mesa-optimizers are mostly concerns that “capabilities” will be robust to distributional shift while “objectives” will not be robust.