That’s a big open question which we’re still figuring out.
johnswentworth
Natural Latents Are Not Robust To Tiny Mixtures
I expect the typical case will look like:
Find some internal signal/latent using whatever random methods someone pulled out their ass
Check whether it satisfies the naturality conditions (over some choice of variables)
… which is not what this post is about.
The material in this post is useful mainly in cases where we want to be able to rule out any “better” natural latents, which is a somewhat atypical use case. It would be relevant, for instance, if I want to design a toy environment with known natural latents in which to train some system.
(Aside: this is something I updated about relatively recently; I had previously thought of the sort of thing this post is doing as the central use-case.)
Calculating Natural Latents via Resampling
I mean, it could be instrumentally rational if for some reason you expect the advice to be true/useful and not just a parasitic meme.
What’s the best reference, if you have a moment?
Eliezer’s List O’Doom probably has a short statement in there somewhere, if you want a quote on his position. Much of his back-and-forth with Quintin is also about rejecting natural abstraction, but I don’t know of a short pithy summary in that corpus. (More generally, it’s pretty clear from my standpoint that there are basically two cruxes between Eliezer and Quintin, because my own models look mostly like Eliezer’s if I flip the natural abstraction bit and mostly like Quintin’s if I flip a particular bit having to do with ease of outer alignment.)
If you want a reference on the natural abstraction hypothesis more generally, I introduced the term in Alignment By Default.
FYI, I thought your shortform here was an unusually excellent summary of cruxes, but I don’t think coherence is the main missing piece which gets Eliezer to 99%+. (Also, I think I understand Eliezer’s models better than the large majority of people on LW, but still definitely not perfectly.)
I think the main “next piece” missing is that Eliezer basically rejects the natural abstraction hypothesis; he expects that powerful AI will reason in internal ontologies thoroughly alien to humans. That makes not just full-blown alignment hard, but even “relatively easy” things like instruction-following hard in the relevant regime.
(Also there are a few other pieces which your shortform didn’t talk about much which are relevant to high-certainty-of-doom, but I expect those were pieces which you intentionally didn’t focus on much—like e.g. near-certainty that there will be many-OOM-equivalent software improvements very rapidly once AI crosses the critical threshold of being able to do AI research.)
No, one of them is and the other is , specifically to avoid that problem. (Possibly you read the post via someplace other than lesswrong which dropped the prime?)
This going to be a somewhat-scattered summary of my own current understanding. My understanding of this question has evolved over time, and is therefore likely to continue to evolve over time.
Classic Theorems
First, there’s all the classic coherence theorems—think Complete Class or Savage or Dutch books or any of the other arguments you’d find in Stanford Encyclopedia of Philosophy. The general pattern of these is:
Assume some arguably-intuitively-reasonable properties of an agent’s decisions (think e.g. lack of circular preferences).
Show that these imply that the agent’s decisions maximize some expected utility function.
I would group objections to this sort of theorem into three broad classes:
Argue that some of the arguably-intuitively-reasonable properties are not actually necessary for powerful agents.
Be confused about something, and accidentally argue against something which is either not really what the theorem says or assumes a particular way of applying the theorem which is not the only way of applying the theorem.
Argue that all systems can be modeled as expected utility maximizers (i.e. just pick a utility function which is maximized by whatever the system in fact does) and therefore the theorems don’t say anything useful.
For an old answer to (2.a), see the discussion under my mini-essay comment on Coherent Decisions Imply Consistent Utilities. (We’ll also talk about (2.a) some more below.) Other than that particularly common confusion, there’s a whole variety of other confusions; a few common types include:
Only pay attention to the VNM theorem, which is relatively incomplete as coherence theorems go.
Attempt to rely on some notion of preferences which is not revealed preference.
Lose track of which things the theorems say an agent has utility and/or uncertainty over, i.e. what the inputs to the utility and/or probability functions are.
How To Talk About “Powerful Agents” Directly
While I think EJT’s arguments specifically are not quite right in a few ways, there is an importantly correct claim close to his: none of the classic coherence theorems say “powerful agent → EU maximizer (in a nontrivial sense)”. They instead say “<list of properties which are not obviously implied by powerful agency> → EU maximizer”. In order to even start to make a theorem of the form “powerful agent → EU maximizer (in a nontrivial sense)”, we’d first need a clean intuitively-correct mathematical operationalization of what “powerful agent” even means.
Currently, the best method I know of for making the connection between “powerful agency” and utility maximization is in Utility Maximization = Description Length Minimization. There, the notion of “powerful agency” is tied to optimization, in the sense of pushing the world into a relatively small number of states. That, in turn, is equivalent (the post argues) to expected utility maximization. That said, that approach doesn’t explicitly talk about “an agent” at all; I see it less as a coherence theorem and more as a likely-useful piece of some future coherence theorem.
What would the rest of such a future coherence theorem look like? Here’s my current best guess:
We start from the idea of an agent optimizing stuff “far away” in spacetime. Coherence of Caches and Agents hints at why this is necessary: standard coherence constraints are only substantive when the utility/”reward” is not given for the immediate effects of local actions, but rather for some long-term outcome. Intuitively, coherence is inherently substantive for long-range optimizers, not myopic agents.
We invoke the Utility Maximization = Description Length Minimization equivalence to say that optimization of the far-away parts of the world will be equivalent to maximization of some utility function over the far-away parts of the world.
We then use basically similar arguments to Coherence of Caches and Agents, but generalized to operate on spacetime (rather than just states-over-time with no spatial structure) and allow for uncertainty.
Pareto-Optimality/Dominated Strategies
There are various claims along the lines of “agent behaves like <X>, or else it’s executing a pareto-suboptimal/dominated strategy”.
Some of these are very easy to prove; here’s my favorite example. An agent has a fixed utility function and performs pareto-optimally on that utility function across multiple worlds (so “utility in each world” is the set of objectives). Then there’s a normal vector (or family of normal vectors) to the pareto surface at whatever point the agent achieves. (You should draw a picture at this point in order for this to make sense.) That normal vector’s components will all be nonnegative (because pareto surface), and the vector is defined only up to normalization, so we can interpret that normal vector as a probability distribution. That also makes sense intuitively: larger components of that vector (i.e. higher probabilities) indicate that the agent is “optimizing relatively harder” for utility in those worlds. This says nothing at all about how the agent will update, and we’d need a another couple sentences to argue that the agent maximizes expected utility under the distribution, but it does give the prototypical mental picture behind the “pareto-optimal → probabilities” idea.
The most fundamental and general problem with pareto-optimality-based claims is that “pareto-suboptimal” implies that we already had a set of quantitative objectives in mind (or in some cases a “measuring stick of utility”, like e.g. money). But then some people will say “ok, but what if a powerful agent just isn’t pareto-optimal with respect to any resources at all, for instance because it just produces craptons of resources and then uses them inefficiently?”.
(Aside: “‘pareto-suboptimal’ implies we already had a set of quantitative objectives in mind” is also usually the answer to claims that all systems can be represented as expected utility maximizers. Sure, any system can be represented as an expected utility maximizer which is pareto-optimal with respect to some made-up objectives/resources which we picked specifically for this system. That does not mean all systems are pareto-optimal with respect to money, or energy, or other resources which we actually care about. Or, if using Utility Maximization = Description Length Minimization to ground out the quantitative objectives: not all systems are pareto-optimal with respect to optimization of some stuff far away in the world. That’s where the nontrivial content of most coherence theorems comes from: the quantitative objectives with respect to which the agent is pareto-optimal need to be things we care about for some reason.)
Approximate Coherence
What if a powerful agent just isn’t pareto-optimal with respect to any resources or far-away optimization targets at all? Or: even if you do expect powerful agents to be approximately pareto-optimal, presumably they will be approximately pareto optimal, not exactly pareto-optimal. What can we say about coherence then?
To date, I know of no theorems saying anything at all about approximate coherence. That said, this looks like more a case of “nobody’s done the legwork yet” rather than “people tried and failed”. It’s on my todo list.
My guess is that there’s a way to come at the problem with a thermodynamics-esque flavor, which would yield global bounds, for instance of roughly the form “in order for the system to apply n bits of optimization more than it could achieve with outputs independent of its inputs, it must observe at least m bits and approximate coherence to within m-n bits” (though to be clear I don’t yet know the right ways to operationalize all the parts of that sentence). The simplest version of a theorem of that form doesn’t work, but David and I have played with some variations and have some promising threads.
- 5 Jun 2024 12:17 UTC; 13 points) 's comment on The Standard Analogy by (
The “dead” part is a value judgement, right?
No, “dead transposons” meaning that they’ve mutated in some way which makes them no longer functional transposons, i.e. they can no longer copy themselves back into the genome (often due to e.g. another transposon copying into the middle of the first transposon sequence).
If you’re going to link Why Subagents?, you should probably also link Why Not Subagents?.
I know how awful this sounds to many of the people reading this, including the person I am replying to...
I actually find this kind of thinking quite useful. I mean, the particular policies proposed are probably pareto-suboptimal, but there’s a sound method in which we first ask “what policies would buy a lot more time?”, allowing for pretty bad policies as a first pass, and then think through how to achieve the same subgoals in more palatable ways.
Yeah, admittedly health is kind of a borderline case where it’s technically factual but in practice mostly operates as a standard value-claim because of low entanglement and high reason to care.
I basically agree with your claim that the heuristic is approximating (reason to care) + (low entanglement).
Value Claims (In Particular) Are Usually Bullshit
A thing I am confused about: what is the medium-to-long-term actual policy outcome you’re aiming for? And what is the hopeful outcome which that policy unlocks?
You say “implement international AI compute governance frameworks and controls sufficient for halting the development of any dangerous AI development activity, and streamlined functional processes for doing so”. The picture that brings to my mind is something like:
Track all compute centers large enough for very high-flop training runs
Put access controls in place for such high-flop runs
A prototypical “AI pause” policy in this vein would be something like “no new training runs larger than the previous largest run”.
Now, the obvious-to-me shortcoming of that approach is that algorithmic improvement is moving at least as fast as scaling, a fact which I doubt Eliezer or Nate have overlooked. Insofar as that algorithmic improvement is itself compute-dependent, it’s mostly dependent on small test runs rather than big training runs, so a pause-style policy would slow down the algorithmic component of AI progress basically not-at-all. So whatever your timelines look like, even a full pause on training runs larger than the current record should less than double our time.
… and that still makes implementation of a pause-style policy a very worthwhile thing for a lot of people to work on, but I’m somewhat confused that Eliezer and Nate specifically currently see that as their best option? Where is the hope here? What are they hoping happens with twice as much time, which would not happen with one times as much time? Or is there some other policy target (including e.g. “someone else figures out a better policy”) which would somehow buy a lot more time?
Should be fixed now, thanks for flagging.
Should be fixed now, thanks for the heads-up. The same problem also broke images on a bunch of my other old posts; please do leave comments if you find more which I haven’t fixed yet.
When Are Circular Definitions A Problem?
Yeah, that’s one of the main things which the “causal models as programs” thing is meant to capture, especially in conjunction with message passing and caching. The whole thing is still behaviorally one big model insofar as the cache is coherent, but the implementation is a bunch of little sparsely-interacting submodel-instances.
We actually started from that counterexample, and the tiny mixtures example grew out of it.