That sounds to me like PP, or at least PP as it exists, is something that’s compatible with implementing different decision theories, rather than one that implies a specific decision theory by itself.
I generally agree with this. Specifically, I tend to imagine that PP is trying to make our behavior match a model in which we behave like an agent (at least sometimes). Thus, for instance, the tendency for humans to do things which “look like” or “feel like” optimizing for X without actually optimizing for X.
In that case, PP would be consistent with many decision theories, depending on the decision theory used by the model it’s trying to match.
This was a solid explanation, thanks.
Some differences from what I imagine...
First and foremost, I imagine that the notion of “success” on which the agent conditions is not just a direct translation of “winning” in the decision problem. After all, a lot of the substance of tricky decision theory problems is exactly in that “direct” translation of what-it-means-to-win! Instead, I imagine that the notion of “success” has a lot more supporting infrastructure built into it, and the agent’s actions can directly interact with the supporting infrastructure as well as the nominal goal itself.
A prototypical example here would be an abstraction-based decision theory. There, the notion of “success” would not be “system achieves the maximum amount of utility”, but rather “system abstracts into a utility-maximizing agent”. The system’s “choices” will be used both to maximize utility and to make sure the abstraction holds. The “supporting infrastructure” part—i.e. making sure the abstraction holds—is what would handle things like e.g. acting as though the agent is deciding for simulations of itself (see the link for more explanation of that).
More generally, two other notions of “success” which we could imagine:
“success” means “our model of the territory is accurate, and our modelled-choices maximize our modelled-utility” (though this allows some degrees of freedom in how the model handles counterfactuals)
“success” means “the physical process which output our choices is equivalent to program X” (where X itself would optimize for this notion of success, and probably some other conditions as well; the point here is to check that the computation is not corrupted)
(These are not mutually exclusive.) In both cases, the agent’s decisions would be used to support its internal infrastructure (accurate models, uncorrupted computation) as well as the actual utility-maximization.
Having written that all out, it seems like it might be orthogonal to predictive processing. I had been thinking of these “success” notions more as part-of-the-world-model, mainly because the “success” notions are largely about parts of the world abstracting into specific things (models, program execution, agents). In that context, it made sense to view “enforcing the infrastructure” as part of “making the model and the territory match”. But if abstraction-enforcement is built into the utility function, rather than the model, then it looks less predictive-processing-specific.
Do you think that the argument motivating these examples is invalid?
Do you think that the argument motivating these examples is invalid?
Yes, because it skips over the most important part: what it means to “give an AI a goal”. For example, perhaps we give the AI positive reward every time it solves a maths problem, but it never has a chance to seize more resources during training—all it’s able to do is think about them. Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them? The former I’d call large-scale, the latter I wouldn’t.
I think I’ll concede that “large-scale” is maybe a bad word for the concept I’m trying to point to, because it’s not just a property of the goal, it’s a property of how the agent thinks about the goal too. But the idea I want to put forward is something like: if I have the goal of putting a cup on a table, there’s a lot of implicit context around which table I’m thinking about, which cup I’m thinking about, and what types of actions I’m thinking about. If for some reason I need to solve world peace in order to put the cup on the table, I won’t adopt solving world peace as an instrumental goal, I’ll just shrug and say “never mind then, I’ve hit a crazy edge case”. I don’t think that’s because I have safe values. Rather, this is just how thinking works—concepts are contextual, and it’s clear when the context has dramatically shifted.
So I guess I’m kind of thinking of large-scale goals as goals that have a mental “ignore context” tag attached. And these are certainly possible, some humans have them. But it’s also possible to have exactly the same goal, but only defined within “reasonable” boundaries—and given the techniques we’ll be using to train AGIs, I’m pretty uncertain which one will happen by default. Seems like, when we’re talking about tasks like “manage this specific factory” or “solve this specific maths problem”, the latter is more natural.
I notice I am surprised you write
However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals
and not address the “Riemman disaster” or “Paperclip maximizer” examples 
Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.
Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.
Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.
Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better?
One thing I like about this series is that it puts all this online in a fairly condensed form, which I feel like I often am not quite sure what to link to in order to present these kinds of arguments. That you do it better than perhaps we have done in the past makes it all the better!
I read you to be asking “what decision theory is implied by predictive processing as it’s implemented in human brains”. It’s my understanding that while there are attempts to derive something like a “decision theory formulated entirely in PP terms”, there are also serious arguments for the brain actually having systems that are just conventional decision theories and not directly derivable from PP.
Let’s say you try, as some PP theorists do, to explain all behavior as free energy minimization as opposed to expected utility maximization. Ransom et al. (2020) (current sci-hub) note that this makes it hard to explain cases where the mind acts according to a prediction that has a low probability of being true, but a high cost if it were true.
For example, the sound of rustling grass might be indicative either of the wind or of a lion; if wind is more likely, then predictive processing says that wind should become the predominant prediction. But for your own safety it can be better to predict that it’s a lion, just in case. “Predict a lion” is also what standard Bayesian decision theory would recommend, and it seems like the correct solution… but to get that correct solution, you need to import Bayesian decision theory as an extra ingredient, it doesn’t fall naturally out of the predictive processing framework.
It’s obvious that you intend this as requiring research, including making good conceptual choices, rather than having a fixed answer. However, I’m going to speak from my current understanding of predictive processing.
I’m quite interested in your (John’s) take on how the following differs from what you had in mind.
I believe there are several possible answers based on different ways of using predictive-processing-associated ideas.
A. Soft-max decision-making.
One thing I’ve seen in a presentation on this stuff is the claim of a close connection between probability and utility, namely u=log(p).
This relates to a very common approximate model of bounded rationality: you introduce some randomness, but make worse mistakes less probable, by making actions exponentially more probable as their utility goes up. The level of rationality can be controlled by a “temperature” parameter—higher temperature means more randomness, lower temperature means closer to just always taking the max.
The u=log(p) idea takes that “approximation” as definitional; action probabilities are revealed preferences, from which we can find utilities by taking logarithms.
The randomness can be interpreted as exploration. I don’t personally see that interpretation as very good, since this form of randomness does not vary based on model uncertainty, but there may be justifications I’m not aware of.
The stronger attempt to justify the randomness, in my book, is based on monte carlo inference. However, that’s better discussed under the next heading.
B. Sampling from wishful thinking.
If you were to construct an agent by the formula from option (A), you would first define the agent’s beliefs and desires in the usual Bayesian way. You’d then calculate expected utilities for events in the normal way. You only depart from standard Bayesian decision-making at the last step, where you randomize rather than just taking the best action.
The implicit promise of the u=log(p) formula is to provide a deeper unification of belief and value than that, and correspondingly, a deeper restructuring of decision theory.
One commonly discussed proposal is as follows: condition on success, then sample from the resulting distribution on actions. (You don’t necessarily have a binary notion of “success” if you attach real-valued utilities to the various outcomes, but, there is a generalization where we condition on “utility being high” without exactly specifying how high it is. This will involve the same “temperature” parameter mentioned earlier.)
The technical name for this idea is “planning by inference”, because we can use algorithms for Monte Carlo inference to sample actions. We’re using inference algorithms to plan! That’s a useful unification of utility and probability: machinery previously used for one purpose, is now used for both.
It also kinda captures the intuition you mentioned, about restricting our world-model to assume some stuff we want to be true:
Abstracting out the key idea: we pack all of the complicated stuff into our world-model, hardcode some things into our world-model which we want to be true, then generally try to make the model match reality.
However, planning-by-inference can cause us to take some pretty dumb-looking actions.
For example, let’s say that we need $200 for rent money. For simplicity, we have binary success/failure: either we get the money we need, or not. We have $25 which we can use to gamble, for a 1/16th chance of making the $200 we need. Alternately, we happen to know tomorrow’s winning lotto numbers, which we can enter in for a 100% chance of getting the money we need.
However, taking random actions, let’s say there is only a 1/million chance of entering the winning lotto numbers.
Conditioning on our success, it’s much more probable that we gamble with our $25 and get the money we need that way.
So planning-by-inference is heavily biased toward plans of action which are not too improbable in the prior before conditioning on success.
On the other hand, the temperature parameter can help us out here. Adjusting the temperature looks kind of like “conditioning on success multiple times”—IE, it’s as if you took the new distribution on actions as the prior, and then conditioned again to further bias things in the direction of success.
This has a somewhat nice justification in terms of monte-carlo algorithms. For some algorithms, this “temperature” ends up being an indication of how long you took to think. There’s a bias toward actions with high prior probabilities because that’s where you look first when planning, effectively (due to the randomness of the search).
This sounds like a nice account of bounded rationality: the randomness in the p=log(u) model is due to the boundedness of our search, and the fact that we may or may not find the good solutions in that time.
Except for one major problem: this kind of random search isn’t what humans, or AIs, do in general. Even within the realm of Monte Carlo algorithms, there are a lot of optimizations one can add which would destroy the p=log(u) relationship. I don’t currently know of any reason to suppose that there’s some nice generalization which holds for computationally efficient minds.
So ultimately, I would say that there is a sorta nice theory of bounded rationality here, but not a very nice one.
Except… I actually know a way to address the concern about bias toward a priori actions, while sticking to the planning-by-inference picture, and also using an arguably much better theory of bounded rationality.
C. Logical Induction Decision Theory
As Scott discussed in a recent talk, if you try the planning-by-inference trick with a logical inductor as your inferencer, you maximize expected utility anyway:
This algorithm predicts what it did conditional on having won, and then copies that distribution. It just says, “output whatever I predict that I output conditioned on my having won”. [...]But it turns out that you do reach the same endpoint, because the only fixed point of this process is going to do the same as the last algorithm’s. So this algorithm turns out to be functionally the same as the previous one.
This algorithm predicts what it did conditional on having won, and then copies that distribution. It just says, “output whatever I predict that I output conditioned on my having won”.
But it turns out that you do reach the same endpoint, because the only fixed point of this process is going to do the same as the last algorithm’s. So this algorithm turns out to be functionally the same as the previous one.
One way of understanding what’s happening is this: in the planning-by-inference picture, we start with a prior, and condition on success, then sample actions. This creates a bias toward a priori probable actions, which can result in the irrational behavior I mentioned earlier.
In the context of logical induction, however, we additionally stipulate that the a priori distribution on actions and the updated distribution must match. This has the effect of “updating on success an infinite number of times” (in the sense that I mentioned earlier, where lowering the temperature is kind of like “updating on success again”).
Furthermore, unlike the monte-carlo algorithms mentioned earlier, logical induction is a theoretically very well-founded theory of bounded rationality. Not so bounded you’d want to run it on an actual computer, granted. But at least it addresses the question of what kind of optimality we can enforce on bounded reasoning, rather than just positing a particular kind of computation as the answer.
Since this is equivalent to regular expected utility maximization with logical inductors, there’s no reason to use planning-by-inference, but there’s also no reason not to.
So, what kind of decision theory does this get us?
Cooperate in Prisoner’s Dilemma with agents whose pseudorandom moves exactly match, or sufficiently correlate with, our own. Defect against agents with uncorrelated pseudorandom exploration sequences (even if they otherwise have “the same mental architecture”). So cooperation is pretty difficult.
One-box in Newcomb with a perfect predictor. Two-box if the predictor is imperfect. This holds even if the predictor is extremely accurate (say 99.9% accurate), so long as the agent knows more about its own move than the predictor—the only way the agent will one-box is if the predictor’s prediction contains information about the agent’s own action which the agent does not possess at the time of choosing.
Fail transparent Newcomb.
Fail counterfactual mugging.
Fail Parfit’s Hitchhiker.
Fail at agent-simulates-predictor.
sometimes there would be multiple possible self-consistent models
I’m not sure what you’re getting at here; you may have a different conception of predictive-processing-like decision theory than I do. I would say “I will get up and go to the store” is a self-consistent model, “I will sit down and read the news” is a self-consistent model, etc. etc. There are always multiple possible self-consistent models—at least one for each possible action that you will take.
Oh, maybe you’re taking the perspective where if you’re hungry you put a high prior on “I will eat soon”. Yeah, I just don’t think that’s right, or if there’s a sensible way to think about it, I haven’t managed to get it despite some effort. I think if you’re hungry, you want to eat because it leads to a predicted reward, not because you have a prior expectation that you will eat. After all, if you’re stuck on a lifeboat in the middle of the ocean, you’re hungry but you don’t expect to eat. This is an obvious point, frequently brought up, and Friston & colleagues hold strong that it’s not a problem for their theory, and I can’t make heads or tails of what their counterargument is. I discussed my version (where rewards are also involved) here, and then here I went into more depth for a specific example.
It’s obvious but worth saying anyway that pretty much all the decision theory scenarios that people talk about, like Newcomb’s problem, are scenarios where people find themselves unsure what to do, and disagree with each other. Therefore the human brain doesn’t give straight answers—or if it does, the answers are not to be found at the “base algorithm” level, but rather the “learned model” level (which can involve metacognition).
One point I personally put a lot of weight on: while people are unsure/disagree about particular scenarios, people do mostly seem to agree on what the relevant arguments are, or what the main “options” are for how to think about particular scenarios. That suggests that we do share a common underlying decision-making algorithm, but that algorithm itself sometimes produces uncertain answers.
In particular, for a predictive-processing-like decision theory, it makes sense that sometimes there would be multiple possible self-consistent models. In those cases, we should expect humans to be unsure/disagree, but we’d still expect people to agree on what the relevant arguments/options are—i.e. the possible models.
Academic philosophers sometimes talk about how beliefs have a mind-to-world direction of fit whereas desires have a world-to-mind direction of fit. Perhaps they even define the distinction that way, I don’t remember.
A quick google search didn’t turn up anything interesting but I think there might be some interesting papers in there if you actually looked. Not sure though.
Similarly, in decision theory literature there is this claim that “deliberation screens off prediction.” That seems relevant somehow. If it’s true it might be true for reasons unrelated to predictive processing, but I suspect there is a connection...
My take on predictive processing is a bit different than the textbooks, and in terms of decision theories, it doesn’t wind up radically different from logical inductor decision theory, which Scott talked about in 2017 here, and a bit more here. Or at least, take logical inductor decision theory, make everything about it kinda more qualitative, and subtract the beautiful theoretical guarantees etc.
It’s obvious but worth saying anyway that pretty much all the decision theory scenarios that people talk about, like Newcomb’s problem, are scenarios where people find themselves unsure what to do, and disagree with each other. Therefore the human brain doesn’t give straight answers—or if it does, the answers are not to be found at the “base algorithm” level, but rather the “learned model” level (which can involve metacognition). Or I guess it’s possible that the base-algorithm-default and the learned models are pushing in different directions.
Scott’s 2017 post gives two problems with this decision theory. In my view humans absolutely suffer from both. Like, my friend always buys the more expensive brand of cereal because he’s concerned that he wouldn’t like the less expensive brand. But he’s never tried it! The parallel to the 5-and-10 problem is obvious, right?
The problem about whether to change the map, territory, or both is something I discussed a bit here. Wishful thinking is a key problem—and just looking at the algorithm as I understand it, it’s amazing that humans don’t have even more wishful thinking than we do. I think wishful thinking is kept mostly under control in a couple ways: (1) self-supervised learning effectively gets a veto over what we can imagine happening, by-and-large preventing highly-implausible future scenarios from even entering consideration in the Model Predictive Control competition; (2) The reward-learning part of the algorithm is restricted to the frontal lobe (home of planning and motor action), not the other lobes (home of sensory processing). (Anatomically, the other lobes have no direct connection to the basal ganglia.) This presumably keeps some healthy separation between understanding sensory inputs and “what you want to see”. (I didn’t mention that in my post because I only learned about it more recently; maybe I should go back and edit, it’s a pretty neat trick.) (3) Actually, wishful thinking is wildly out of control in certain domains like post hoc rationalizations. (At least, the ground-level algorithm doesn’t do anything to keep it under control. At the learned-model level, it can be kept under control by learned metacognive memes, e.g. by Reading The Sequences.).
The embedded agency sequence says somewhere that there are still mysteries in human decisionmaking, but (at some risk of my sounding arrogant) I’m not convinced. Everything people do that I can think of, seems to fit together pretty well into the same algorithmic story. I’m very open to discussion about that. Of course, insofar as human decisionmaking has room for improvement, it’s worth continuing to think through these issues. Maybe there’s a better option that we can use for our AGIs.
Or if not, I guess we can build our human-brain-like AGIs and tell them to Read The Sequences to install a bunch of metacognitive memes in themselves that patch the various problems in their own cognitive algorithms. :-P (Actually, I wrote that as a joke but maybe it’s a viable approach...??)
There is a formal analogy between infra-Bayesian decision theory (IBDT) and modal updateless decision theory (MUDT).
Consider a one-shot decision theory setting. There is a set of unobservable states S, a set of actions A and a reward function r:A×S→[0,1]. An IBDT agent has some belief β∈□S, and it chooses the action a∗:=argmaxa∈AEβ[λs.r(a,s)].
We can construct an equivalent scenario, by augmenting this one with a perfect predictor of the agent (Omega). To do so, define S′:=A×S, where the semantics of (p,s) is “the unobservable state is s and Omega predicts the agent will take action p”. We then define r′:A×S′→[0,1] by r′(a,p,s):=1a=pr(a,s)+1a≠p and β′∈□S′ by Eβ′[f]:=minp∈AEβ[λs.f(p,s)] (β′ is what we call the pullback of β to S′, i.e we have utter Knightian uncertainty about Omega). This is essentially the usual Nirvana construction.
The new setup produces the same optimal action as before. However, we can now give an alternative description of the decision rule.
For any p∈A, define Ωp∈□S′ by EΩp[f]:=mins∈Sf(p,s). That is, Ωp is an infra-Bayesian representation of the belief “Omega will make prediction p”. For any u∈[0,1], define Ru∈□S′ by ERu[f]:=minμ∈ΔS′:Eμ[r(p,s)]≥uEμ[f(p,s)]. Ru can be interpreted as the belief “assuming Omega is accurate, the expected reward will be at least u”.
We will also need to use the order ⪯ on □X defined by: ϕ⪯ψ when ∀f∈[0,1]X:Eϕ[f]≥Eψ[f]. The reversal is needed to make the analogy to logic intuitive. Indeed, ϕ⪯ψ can be interpreted as ”ϕ implies ψ“, the meet operator ∧ can be interpreted as logical conjunction and the join operator ∨ can be interpreted as logical disjunction.
(Actually I only checked it when we restrict to crisp infradistributions, in which case ∧ is intersection of sets and ⪯ is set containment, but it’s probably true in general.)
Now, β′∧Ωa⪯Ru can be interpreted as “the conjunction of the belief β′ and Ωa implies Ru”. Roughly speaking, “according to β′, if the predicted action is a then the expected reward is at least u”. So, our decision rule says: choose the action that maximizes the value for which this logical implication holds (but “holds” is better thought of as “is provable”, since we’re talking about the agent’s belief). Which is exactly the decision rule of MUDT!
Apologies for the potential confusion between □ as “space of infradistrubutions” and the □ of modal logic (not used in this post). ↩︎
Technically it’s better to think of it as ”ψ is true in the context of ϕ”, since it’s not another infradistribution so it’s not a genuine implication operator. ↩︎
Master post for ideas about infra-Bayesianism.
You could try to infer human values from the “sideload” using my “Conjecture 5″ about the AIT definition of goal-directed intelligence. However, since it’s not an upload and, like you said, it can go off-distribution, that doesn’t seem very safe. More generally, alignment protocols should never be open-loop.
I’m also skeptical about IDA, for reasons not specific to your question (in particular, this), but making it open-loop is worse.
Gurkenglas’ answer seems to me like something that can work, if we can somehow be sure the sideload doesn’t become superintelligent, for example, given an imitation plateau.
It sounds like you want to use it as a component for alignment of a larger AI, which would somehow turn its natural-language directives into action. I say use it as the capability core: Ask it to do armchair alignment research. If we give it subjective time, a command line interface and internet access, I see no reason it would do worse than the rest of us.
Promoted to curated! I held off on curating this post for a while, first because it’s long and it took me a while to read through it, and second because we already had a lot of AI Alignment posts in the curation pipeline, and I wanted to make sure we have some diversity in our curation decisions. But overall, I really liked this post, and also want to mirror Rohin’s comment in that I found this version more useful than the version where you got everything right, because this way I got to see the contrast between your interpretation and Paul’s responses, which feels like it helped me locate the right hypothesis more effective than either would have on its own (even if more fleshed out).
Another thing that might happen is a data bottleneck.
Maybe there will be a good enough dataset to produce a sideload that simulates an “average” person, and that will be enough to automate many jobs, but for a simulation of a competent AI researcher you would need a more specialized dataset that will take more time to produce (since there are a lot less competent AI researchers than people in general).
Moreover, it might be that the sample complexity grows with the duration of coherent thought that you require. That’s because, unless you’re training directly on brain inputs/outputs, non-realizable (computationally complex) environment influences contaminate the data, and in order to converge you need to have enough data to average them out, which scales with the length of your “episodes”. Indeed, all convergence results for Bayesian algorithms we have in the non-realizable setting require ergodicity, and therefore the time of convergence (= sample complexity) scales with mixing time, which in our case is determined by episode length.
In such a case, we might discover that many tasks can be automated by sideloads with short coherence time, but AI research might require substantially longer coherence times. And, simulating progress requires by design going off-distribution along certain dimensions which might make things worse.
I agree. But GPT-3 seems to me like a good estimate for how much compute it takes to run stream of consciousness imitation learning sideloads (assuming that learning is done in batches on datasets carefully prepared by non-learning sideloads, so the cost of learning is less important). And with that estimate we already have enough compute overhang to accelerate technological progress as soon as the first amplified babbler AGIs are developed, which, as I argued above, should happen shortly after babblers actually useful for automation of human jobs are developed (because generation of stream of consciousness datasets is a special case of such a job).
So the key things to make imitation plateau last for years are either sideloads requiring more compute than it looks like (to me) they require, or amplification of competent babblers into similarly competent AGIs being a hard problem that takes a long time to solve.
The imitation plateau can definitely be rather short. I also agree that computational overhang is the major factor here. However, a failure to capture some of the ingredients can be a cause of low computational overhead, whereas a success to capture all of the ingredients is a cause of high computational overhang, because the compute necessary to reach superintelligence might be very different in those two cases. Using sideloads to accelerate progress might still require years, whereas an “intrinsic” AGI might lead to the classical “foom” scenario.
EDIT: Although, since training is typically much more computationally expensive than deployment, it is likely that the first human-level imitators will already be significantly sped-up compared to humans, implying that accelerating progress will be relatively easy. It might still take some time from the first prototype until such an accelerate-the-progress project, but probably not much longer than deploying lots of automation.
I was arguing that near human level babblers (including the imitation plateau you were talking about) should quickly lead to human level AGIs by amplification via stream of consciousness datasets, which doesn’t pose new ML difficulties other than design of the dataset. Superintelligence follows from that by any of the same arguments as for uploads leading to AGI (much faster technological progress; if amplification/distillation of uploads is useful straight away, we get there faster, but it’s not necessary). And amplified babblers should be stronger than vanilla uploads (at least implausibly well-educated, well-coordinated, high IQ humans).
For your scenario to be stable, it needs to be impossible (in the near term) to run the AGIs (amplified babblers) faster than humans, and for the AGIs to remain less effective than very high IQ humans. Otherwise you get acceleration of technological progress, including ML. So my point is that feasibility of imitation plateau depends on absence of compute overhang, not on ML failing to capture some of the ingredients of human general intelligence.