I am not sure what you mean by “meets the LIC or similar” in this context. If we consider a predictor which is a learning algorithm in itself (i.e., it predicts by learning from the agent’s past choices),
Yeah, that’s what I meant.
, then the agent will converge to one-boxing. This is because a weak predictor will be fully inside the agent’s prior, so the agent will know that one-boxing for long enough will cause the predictor to fill the box.
Suppose the interval between encounters with the predictor is long enough that, due to the agent’s temporal discounting, the immediate reward of two-boxing outweighs the later gains which one-boxing provides. In any specific encounter with the predictor, the agent may prefer to two-box, but prefer to have been the sort of agent who predictably one-boxes, and also preferring to pre-commit to one-box on the next example if a commitment mechanism exists. (This scenario also requires a carefully tuned strength for the predictor, of course.)
But I wasn’t sure this would be the result for your agent, since you described the agent using the hypothesis which gives the best picture about achievable utility.
As I discussed in Do Sufficiently Advanced Agents Use Logic, what I tend to think about is the case where the agent doesn’t literally encounter the predictor repeatedly in its physical history. Instead, the agent must learn what strategy to use by reasoning about similar (but “smaller”) scenarios. But we can get the same effect by assuming the temporal discounting is steep enough, as above.
I was never convinced that “logical ASP” is a “fair” problem. I once joked with Scott that we can consider a “predictor” that is just the single line of code “return DEFECT” but in the comments it says “I am defecting only because I know you will defect.” It was a joke, but it was half-serious. The notion of “weak predictor” taken to the limit leads to absurdity, and if you don’t take it to the limit it might still lead to absurdity. Logical inductors in one way to try specifying a “weak predictor”, but I am not convinced that settings in which logic is inserted ad hoc should be made into desiderata.
Yeah, it is clear that there has to be a case where the predictor is so weak that the agent should not care. I’m fine with dropping the purely logical cases as desiderata in favor of the learning-theoretic versions. But, the ability to construct analogous problems for logic and for learning theory is notable. Paying attention to that analogy more generally seems like a good idea.
I am not sure we need an arbitrary cutoff. There might be a good solution where the agent can dynamically choose any finite cutoff.
Yeah, I guess we can do a variety of things:
Naming a time limit for the commitment.
Naming a time at which a time limit for the commitment will be named.
Naming an ordinal (in some ordinal notation), so that a smaller ordinal must be named every time-step, until a smaller ordinal cannot be named, at which point the commitment runs out
I suspect I want to evaluate a commitment scheme by asking whether it helps achieve a nice regret-bound notion, rather than defining the regret notion by evaluating regret-with-respect-to-making-commitments.
Thinking about LI policy selection where we choose a slow-growing function f(n) which determines how long we think before we choose the policy to follow on day n –– there’s this weird trade-off between how (apparently) “good” the updatelessness is vs how long it takes to be any good at all. I’m fine with notions of rationality being parameterized by an ordinal or some such if it’s just a choose-the-largest-number game. But in this case, choosing too slow-growing a function makes you worse off; so the fact that the rationality principle is parameterized (by the slow-growing function) is problematic. Choosing a commitment scheme seems similar.
So it would be nice to have a rationality notion which clarified this situation.
My main concern here is: the case for empirical updatelessness seems strong in realizable situations where the prior is meaningful. Things aren’t as nice in the non-realizable cases such as logical uncertainty. But it doesn’t make sense to abandon updateless principles altogether because of this!
Response to Section IV:
FDT fails to get the answer Y&S want in most instances of the core example that’s supposed to motivate it
I am basically sympathetic to this concern: I think there’s a clear intuition that FDT is 2-boxing more than we would like (and a clear formal picture, in toy formalisms which show FDT-ish DTs failing on Agent Simulates Predictor problems).
Of course, it all depends on how logical counterfactuals are supposed to work. From a design perspective, I’m happy to take challenges like this as extra requirements for the behavior of logical counterfactuals, rather than objections to the whole project. I intuitively think there is a notion of logical counterfactual which fails in this respect, but, this does not mean there isn’t some other notion which succeeds. Perhaps we can solve the easy problem of one-boxing with a strong predictor first, and then look for ways to one-box more generally (and in fact, this is what we’ve done—one-boxing with a strong predictor is not so difficult).
However, I do want to add that when Omega uses very weak prediction methods such as the examples given, it is not so clear that we want to one-box. Will is presuming that Y&S simply want to one-box in any Newcomb problem. However, we could make a distinction between evidential Newcomb problems and functional Newcomb problems. Y&S already state that they consider some things to be functional Newcomb problems despite them not being evidential Newcomb problems (such as transparent Newcomb). It stands to reason that there would be some evidential Newcomb problems which are not functional Newcomb problems, as well, and that Y&S would prefer not to one-box in such cases.
However, the predictor needn’t be running your algorithm, or have anything like a representation of that algorithm, in order to predict whether you’ll one box or two-box. Perhaps the Scots tend to one-box, whereas the English tend to two-box.
In this example, it seems quite plausible that there’s a (logico-causal) reason for the regularity, so that in the logical counterfactual where you act differently, your reference class also acts somewhat differently. Say you’re Scottish, and 10% of Scots read a particular fairy tale growing up, and this is connected with why you two-box. Then in the counterfactual in which you one-box, it is quite possible that those 10% also one-box. Of course, this greatly weakens the connection between Omega’s prediction and your action; perhaps the change of 10% is not enough to tip the scales in Omega’s prediction.
But, without any account of Y&S’s notion of subjunctive counterfactuals, we just have no way of assessing whether that’s true or not. Y&S note that specifying an account of their notion of counterfactuals is an ‘open problem,’ but the problem is much deeper than that. Without such an account, it becomes completely indeterminate what follows from FDT, even in the core examples that are supposed to motivate it — and that makes FDT not a new decision theory so much as a promissory note.
In the TDT document, Eliezer addresses this concern by pointing out that CDT also takes a description of the causal structure of a problem as given, begging the question of how we learn causal counterfactuals. In this regard, FDT and CDT are on the same level of promissory-note-ness.
It might, of course, be taken as much more plausible that a technique of learning the physical-causal structure can be provided, in contrast to a technique which learns the logical-counterfactual structure.
I want to inject a little doubt about which is easier. If a robot is interacting with an exact simulation of itself (in an iterated prisoner’s dilemma, say), won’t it be easier to infer that it directly controls the copy than it is to figure out that the two are running on different computers and thus causally independent?
Put more generally: logical uncertainty has to be handled one way or another; it cannot be entirely put aside. Existing methods of testing causality are not designed to deal with it. It stands to reason that such methods applied naively to cases including logical uncertainty would treat such uncertainty like physical uncertainty, and therefore tend to produce logical-counterfactual structure. This would not necessarily be very good for FDT purposes, being the result of unprincipled accident—and the concern for FDT’s counterfactuals is that there may be no principled foundation. Still, I tend to think that other decision theories merely brush the problem under the rug, and actually have to deal with logical counterfactuals one way or another.
Indeed, on the most plausible ways of cashing this out, it doesn’t give the conclusions that Y&S would want. If I imagine the closest world in which 6288 + 1048 = 7336 is false (Y&S’s example), I imagine a world with laws of nature radically unlike ours — because the laws of nature rely, fundamentally, on the truths of mathematics, and if one mathematical truth is false then either (i) mathematics as a whole must be radically different, or (ii) all mathematical propositions are true because it is simple to prove a contradiction and every propositions follows from a contradiction.
To this I can only say again that FDT’s problem of defining counterfactuals seems not so different to me from CDT’s problem. A causal decision theorist should be able to work in a mathematical universe; indeed, this seems rather consistent with the ontology of modern science, though not forced by it. I find it implausible that a CDT advocate should have to deny Tegmark’s mathematical universe hypothesis, or should break down and be unable to make decisions under the supposition. So, physical counterfactuals seem like they have to be at least capable of being logical counterfactuals (perhaps a different sort of logical counterfactual than FDT would use, since physical counterfactuals are supposed to give certain different answers, but a sort of logical counterfactual nonetheless).
(But this conclusion is far from obvious, and I don’t expect ready agreement that CDT has to deal with this.)
Response to Section VIII:
An alternative approaches that captures the spirit of FDT’s aims
I’m somewhat confused about how you can buy FDT as far as you seem to buy it in this section, while also claiming not to understand FDT to the point of saying there is no sensible perspective at all in which it can be said to achieve higher utility. From the perspective in this section, it seems you can straightforwardly interpret FDT’s notion of expected utility maximization via an evaluative focal point such as “the output of the algorithm given these inputs”.
This evaluative focal point addresses the concern you raise about how bounded ability to implement decision procedures interacts with a “best decision procedure” evaluative focal point (making it depart from FDT’s recommendations in so far as the agent can’t manage to act like FDT), since those concerns don’t arise (at least not so clearly) when we consider what FDT would recommend for the response to one situation in particular. On the other hand, we also can make sense of the notion that taking the bomb is best, since (according to both global-CDT and global-EDT) it is best for an algorithm to output “left” when given the inputs of the bomb problem (in that it gives us the best news about how that agent would do in bomb problems, and causes the agent to do well when put in bomb problems, in so far as a causal intervention on the output of the algorithm also affects a predictor running the same algorithm).
Responses to Sections V and VI:
I’m puzzled by this concern. Is the doctrine of expected utility plagued by a corresponding ‘implausible discontinuity’ problem because if action 1 has expected value .999 and action 2 has expected value 1, then you should take action 2, but a very small change could mean you should take action 1? Is CDT plagued by an implausible-discontinuity problem because two problems which EDT would treat as the same will differ in causal expected value, and there must be some in-between problem where uncertainty about the causal structure balances between the two options, so CDT’s recommendation implausibly makes a sharp shift when the uncertainty is jiggled a little? Can’t we similarly boggle at the implausibility that a tiny change in the physical structure of a problem should make such a large difference in the causal structure so as to change CDT’s recommendation? (For example, the tiny change can be a small adjustment to the coin which determines which of two causal structures will be in play, with no overall change in the evidential structure.)
It seems like what you find implausible about FDT here has nothing to do with discontinuity, unless you find CDT and EDT similarly implausible.
FDT is deeply indeterminate
This is obviously a big challenge for FDT; we don’t know what logical counterfactuals look like, and invoking them is problematic until we do.
However, I can point to some toy models of FDT which lend credence to the idea that there’s something there. The most interesting may be MUDT (see the “modal UDT” section of this summary post). This decision theory uses the notion of “possible” from the modal logic of provability, so that despite being a deterministic agent and therefore only taking one particular action in fact, agents have a well-defined possible-world structure to consider in making decisions, derived from what they can prove.
I have a post planned that focuses on a different toy model, single-player extensive-form games. This has the advantage of being only as exotic as standard game theory.
In both of these cases, FDT can be well-specified (at least, to the extent we’re satisfied with calling the toy DTs examples of FDT—which is a bit awkward, since FDT is kind of a weird umbrella term for several possible DTs, but also kind of specifically supposed to use functional graphs, which MUDT doesn’t use).
It bears mentioning that a Bayesian already regards the probability distribution representing a problem to be deeply indeterminate, so this seems less bad if you start from such a perspective. Logical counterfactuals can similarly be thought of as subjective objects, rather than some objective fact which we have to uncover in order to know what FDT does.
On the other hand, greater indeterminacy is still worse; just because we already have lots of degrees of freedom to mess ourselves up with doesn’t mean we happily accept even more.
And in general, it seems to me, there’s no fact of the matter about which algorithm a physical process is implementing in the absence of a particular interpretation of the inputs and outputs of that physical process.
Part of the reason that I’m happy for FDT to need such a fact is that I think I need such a fact anyway, in order to deal with anthropic uncertainty, and other issues.
If you don’t think there’s such a fact, then you can’t take a computationalist perspective on theory of mind—in which case, I wonder what position you take on questions such as consciousness. Obviously this leads to a number of questions which are quite aside from the point at hand, but I would personally think that questions such as whether an organism is experiencing suffering have to do with what computations are occurring. This ultimately cashes out to physical facts, yes, but it seems as if suffering should be a fundamentally computational fact which cashes out in terms of physical facts only in a substrate-independent way (ie, the physical facts of importance are precisely those which pertain to the question of which computation is running).
But almost all accounts of computation in physical processes have the issue that very many physical processes are running very many different algorithms, all at the same time.
Indeed, I think this is one of the main obstacles to a satisfying account—a successful account should not have this property.
Response to Section VII:
Assessing by how well the decision-maker does in possible worlds that she isn’t in fact in doesn’t seem a compelling criterion (and EDT and CDT could both do well by that criterion, too, depending on which possible worlds one is allowed to pick).
You make the claim that EDT and CDT can claim optimality in exactly the same way that FDT can, here, but I think the arguments are importantly not symmetric. CDT and EDT are optimal according to their own optimality notions, but given the choice to implement different decision procedures on later problems, both the CDT and EDT optimality notions would endorse selecting FDT over themselves in many of the problems mentioned in the paper, whereas FDT will endorse itself.
Most of this section seems to me to be an argument to make careful level distinctions, in an attempt to avoid the level-crossing argument which is FDT’s main appeal. Certainly, FDTers such as myself will often use language which confuses the various levels, since we take a position which says they should be confusable—the best decision procedures should follow the best policies, which should take the best actions. But making careful level distinctions does not block the level-crossing argument, it only clarifies it. FDT may not be the only “consistent fixed-point of normativity” (to the extent that it even is that), but CDT and EDT are clearly not that.
Fourth, arguing that FDT does best in a class of ‘fair’ problems, without being able to define what that class is or why it’s interesting, is a pretty weak argument.
I basically agree that the FDT paper dropped the ball here, in that it could have given a toy setting in which ‘fair’ is rigorously defined (in a pretty standard game-theoretic setting) and FDT has the claimed optimality notion. I hope my longer writeup can make such a setting clear.
Briefly: my interpretation of the “FDT does better” claim in the FDT paper is that FDT is supposed to take UDT-optimal actions. To the extent that it doesn’t take UDT-optimal actions, I mostly don’t endorse the claim that it does better (though I plan to note in a follow-up post an alternate view in which the FDT notion of optimality may be better).
The toy setting I have in mind that makes “UDT-optimal” completely well-defined is actually fairly general. The idea is that if we can represent a decision problem as a (single-player) extensive-form game, UDT is just the idea of throwing out the requirement of subgame-optimality. In other words, we don’t even need a notion of “fairness” in the setting of extensive-form games—the setting isn’t rich enough to represent any “unfair” problems. Yet it is a pretty rich setting.
This observation was already made here: https://www.lesswrong.com/posts/W4sDWwGZ4puRBXMEZ/single-player-extensive-form-games-as-a-model-of-udt. Note that there are some concerns in the comments. I think the concerns make sense, and I’m not quite sure how I want to address them, but I also don’t think they’re damning to the toy model.
The FDT paper may have left out this model out of a desire for greater generality, which I do think is an important goal—from my perspective, it makes sense not to reduce things to the toy model in which everything works out nicely.
Here are some (very lightly edited) comments I left on Will’s draft of this post. (See also my top-level response.)
Responses to Sections II and III:
I’m not claiming that it’s clear what this means. E.g. see here, second bullet point, arguing there can be no such probability function, because any probability function requires certainty in logical facts and all their entailments.
This point shows the intertwining of logical counterfactuals (counterpossibles) and logical uncertainty. I take logical induction to represent significant progress generalizing probability theory to the case of logical uncertainty, ie, objects which have many of the virtues of probability functions while not requiring certainty about entailment of known facts. So, we can substantially reply to this objection.
However, replying to this objection does not necessarily mean we can define logical counterfactuals as we would want. So far we have only been able to use logical induction to specify a kind of “logically uncertain evidential conditional”. (IE, something closer in spirit to EDT, which does behave more like FDT in some problems but not in general.)
I want to emphasize that I agree that specifying what logical counterfactuals are is a grave difficulty, so grave as to seem (to me, at present) to be damning, provided one can avoid the difficulty in some other approach. However, I don’t actually think that the difficulty can be avoided in any other approach! I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals. Even EDT has to grapple with this problem, ultimately, due to the need to handle cases where one’s own action can be logically known. (Or provide a convincing argument that such cases cannot arise, even for an agent which is computable.)
Guaranteed Payoffs: In conditions of certainty — that is, when the decision-maker has no uncertainty about what state of nature she is in, and no uncertainty about the utility payoff of each action is — the decision-maker should choose the action that maximises utility.
(Obligatory remark that what maximizes utility is part of what’s at issue here, and for precisely this reason, an FDTist could respond that it’s CDT and EDT which fail in the Bomb example—by failing to maximize the a priori expected utility of the action taken.)
FDT would disagree with this principle in general, since full certainty implies certainty about one’s action, and the utility to be received, as well as everything else. However, I think we can set that aside and say there’s a version of FDT which would agree with this principle in terms of prior uncertainty. It seems cases like Bomb cannot be set up without either invoking prior uncertainty (taking the form of the predictor’s failure rate) or bringing the question of how to deal with logically impossible decisions to the forefront (if we consider the case of a perfect predictor).
Why should prior uncertainty be important, in cases of posterior certainty? Because of the prior-optimality notion (in which a decision theory is judged on a decision problem based on the utility received in expectation according to the prior probability which defines the decision problem).
Prior-optimality considers the guaranteed-payoff objection to be very similar to objecting to a gambling strategy by pointing out that the gambling strategy sometimes loses. In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn’t look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn’t pay well.
The right action, according to FDT, is to take Left, in the full knowledge that as a result you will slowly burn to death. Why? Because, using Y&S’s counterfactuals, if your algorithm were to output ‘Left’, then it would also have outputted ‘Left’ when the predictor made the simulation of you, and there would be no bomb in the box, and you could save yourself $100 by taking Left.
And why, on your account, is this implausible? To my eye, this is right there in the decision problem, not a weird counterintuitive consequence of FDT: the decision problem stipulates that algorithms which output ‘left’ will not end up in the situation of taking a bomb, with very, very high probability.
Again, complaining that you now know with certainty that you’re in the unlucky position of seeing the bomb seems irrelevant in the way that a gambler complaining that they now know how the dice fell seems irrelevant—it’s still best to gamble according to the odds, taking the option which gives the best chance of success.
(But what I most want to convey here is that there is a coherent sense in which FDT does the optimal thing, whether or not one agrees with it.)
One way of thinking about this is to say that the FDT notion of “decision problem” is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified ‘bomb’ with just the certain information that ‘left’ is (causally and evidentially) very bad and ‘right’ is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT “rejects” decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.
Also, I note that this analysis (on the part of FDT) does not hinge in this case on exotic counterfactuals. If you set Bomb up in the Savage framework, you would be forced to either give only the certain choice between bomb and not-bomb (so you don’t represent the interesting part of the problem, involving the predictor) or to give the decision in terms of the prior, in which case the Savage framework would endorse the FDT recommendation.
Another framework in which we could arrive at the same analysis would be that of single-player extensive-form games, in which the FDT recommendation corresponds to the simple notion of optimal strategy, whereas the CDT recommendation amounts to the stipulation of subgame-optimality.
Replying to one of Will’s edits on account of my comments to the earlier draft:
Finally, in a comment on a draft of this note, Abram Demski said that: “The notion of expected utility for which FDT is supposed to do well (at least, according to me) is expected utility with respect to the prior for the decision problem under consideration.” If that’s correct, it’s striking that this criterion isn’t mentioned in the paper. But it also doesn’t seem compelling as a principle by which to evaluate between decision theories, nor does it seem FDT even does well by it. To see both points: suppose I’m choosing between an avocado sandwich and a hummus sandwich, and my prior was that I prefer avocado, but I’ve since tasted them both and gotten evidence that I prefer hummus. The choice that does best in terms of expected utility with respect to my prior for the decision problem under consideration is the avocado sandwich (and FDT, as I understood it in the paper, would agree). But, uncontroversially, I should choose the hummus sandwich, because I prefer hummus to avocado.
Yeah, the thing is, the FDT paper focused on examples where “expected utility according to the prior” becomes an unclear notion due to logical uncertainty issues. It wouldn’t have made sense for the FDT paper to focus on that, given the desire to put the most difficult issues into focus. However, FDT is supposed to accomplish similar things to UDT, and UDT provides the more concrete illustration.
The policy that does best in expected utility according to the prior is the policy of taking whatever you like. In games of partial information, decisions are defined as functions of information states; and in the situation as described, there are separate information states for liking hummus and liking avocado. Choosing the one you like achieves a higher expected utility according to the prior, in comparison to just choosing avocado no matter what. In this situation, optimizing the decision in this way is equivalent to updating on the information; but, not always (as in transparent newcomb, Bomb, and other such problems).
To re-state that a different way: in a given information state, UDT is choosing what to do as a function of the information available, and judging the utility of that choice according to the prior. So, in this scenario, we judge the expected utility of selecting avocado in response to liking hummus. This is worse (according to the prior!) than selecting hummus in response to liking hummus.
I saw an earlier draft of this, and hope to write an extensive response at some point. For now, the short version:
As I understand it, FDT was intended as an umbrella term for MIRI-style decision theories, which illustrated the critical points without making too many commitments. So, the vagueness of FDT was partly by design.
I think UDT is a more concrete illustration of the most important points relevant to this discussion.
The optimality notion of UDT is clear. “UDT gets the most utility” means “UDT gets the highest expected value with respect to its own prior”. This seems quite well-defined, hopefully addressing your (VII).
There are problems applying UDT to realistic situations, but UDT makes perfect sense and is optimal in a straightforward sense for the case of single-player extensive form games. That doesn’t address multi-player games or logical uncertainty, but it is enough for much of Will’s discussion.
FDT focused on the weird logical cases, which is in fact a major part of the motivation for MIRI-style decision theory. However, UDT for single-player extensive-form games actually gets at a lot of what MIRI-style decision theory wants, without broaching the topic of logical counterfactuals or proving-your-own-action directly.
The problems which create a deep indeterminacy seem, to me, to be problems for other decision theories than FDT as well. FDT is trying to face them head-on. But there are big problems for applying EDT to agents who are physically instantiated as computer programs and can prove too much about their own actions.
This also hopefully clarifies the sense in which I don’t think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
There’s a subtle point here, though, since Will describes the decision problem from an updated perspective—you already know the bomb is in front of you. So UDT “changes the problem” by evaluating “according to the prior”. From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let’s call the way-you-put-agents-into-the-scenario the “construction”. We then evaluate agents on how well they deal with the construction.
For examples like Bomb, the construction gives us the overall probability distribution—this is then used for the expected value which UDT’s optimality notion is stated in terms of.
For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
The point about “constructions” is possibly a bit subtle (and hastily made); maybe a lot of the disagreement will turn out to be there. But I do hope that the basic idea of UDT’s optimality criterion is actually clear—“evaluate expected utility of policies according to the prior”—and clarifies the situation with FDT as well.
By asking people to leave a comment here linking to their exercises, are you discouraging writing exercises directly as a comment to this post? (Perhaps you’re wanting something longer, and so discouraging comments as the arena for listing exercises?)
I agree that if a point can be addressed or explored in a static framework, it can be easier to do that first rather than going to the fully dynamic picture.
On the other hand, I think your discussion of the cat overstates the case. Your own analysis of the decision theory of a single-celled organism (ie the perspective you’ve described to me in person) compares it to gradient descent, rather than expected utility maximization. This is a fuzzy area, and certainly doesn’t achieve all the things I mentioned, but doesn’t that seem more “dynamic” than “static”? Today’s deep learning systems aren’t as generally intelligent as cats, but it seems like the gap exists more within learning theory than static decision theory.
More importantly, although the static picture can be easier to analyse, it has also been much more discussed for that reason. The low-hanging fruits are more likely to be in the more neglected direction. Perhaps the more difficult parts of the dynamic picture (perhaps robust delegation) can be put aside while still approaching things from a learning-theoretic perspective.
I may have said something along the lines of the static picture already being essentially solved by reflective oracles (the problems with reflective oracles being typical of the problems with the static approach). From my perspective, it seems like time to move on to the dynamic picture in order to make progress. But that’s overstating things a bit—I am interested in better static pictures, particularly when they are suggestive of dynamic pictures, such as COEDT.
In any case, I have no sense that you’re making a mistake by looking at abstraction in the static setting. If you have traction, you should continue in that direction. I generally suspect that the abstraction angle is valuable, whether static or dynamic.
Still, I do suspect we have material disagreements remaining, not only disagreements in research emphasis.
Toward the end of your comment, you speak of the one-shot picture and the dynamic picture as if the two are mutually exclusive, rather than just easy mode vs hard mode as you mention early on. A learning picture still admits static snapshots. Also, cats don’t get everything right on the first try.
Still, I admit: a weakness of an asymptotic learning picture is that it seems to eschew finite problems; to such an extent that at times I’ve said the dynamic learning picture serves as the easy version of the problem, with one-shot rationality being the hard case to consider later. Toy static pictures—such as the one provided by reflective oracles—give an idealized static rationality, using unbounded processing power and logical omniscience. A real static picture—perhaps the picture you are seeking—would involve bounded rationality, including both logical non-omniscience and regular physical non-omniscience. A static-rationality analysis of logical non-omnincience has seemed quite challenging so far. Nice versions of self-reference and other challenges to embedded world-models such as those you mention seem to require conveniences such as reflective oracles. Nothing resembling thin priors has come along to allow for eventual logical coherence while resembling bayesian static rationality (rather than logical-induction-like dynamic rationality). And as for the empirical uncertainty, we would really like to get some guarantees about avoiding catastrophic mistakes (though, perhaps, this isn’t within your scope).
This might seem surprising at first, because there is also a different incomplete model Φ2 that says “if you pays the blackmail, infestation will not happen”. Φ2 is false if you use physical causal counterfactuals, but from the agent’s perspective Φ2 is consistent with all observations. However, Φ2 only guarantees the payoff −c (because it is unknown whether the blackmail will arrive). Therefore, Φ2 will have no effect on the ultimate behavior of the agent.
What happens in ASP? (Say you’re in an iterated Newcomb’s problem with a predictor much slower than you, but which meets the LIC or similar.) I’m concerned that it will either settle on two-boxing, or possibly not settle on one strategy, since if it settles on two-boxing then a model which says “you can get the higher reward by one-boxing” (ie, the agent has control over the predictor) looks appealing; but, if it settles on one-boxing, a model which says “you can get higher reward by two-boxing” (ie, the agent’s action doesn’t control the predictor) looks appealing. This concern is related to the way asymptotic decision theory fails—granted, for cases outside of its definition of “fair”.
The precommitments have to expire after some finite time.
I agree that something like this generally does the right thing in most cases, with the exception of superrationality in games as a result of commitment races.
I still have a little hope that there will be a nice version, which doesn’t involve a commitment-races problem and which doesn’t make use of an arbitrary commitment cutoff. But I would agree that things don’t look good, and so it is reasonable to put this kind of thing outside of “fair” problems.
Let me add that I am not even sure what are the correct desiderata. In particular, I don’t think that we should expect any group of good agents to converge to a Pareto optimal outcome.
I don’t currently see why we shouldn’t ask to converge to pareto optima. Obviously, we can’t expect to do so with arbitrary other agents; but it doesn’t seem unreasonable to use an algorithm which has the property of reaching pareto-optima with other agents who use that same algorithm. This even seems reasonable in the standard iterated Nash picture (where not all strategies achieve pareto optima, but there exist strategies which achieve pareto optima with a broad-ish class of other strategies, including others who use strategies like their own—while being very difficult to exploit).
But yeah, I’m pretty uncertain about what the desiderata should be—both with respect to game theory, and with respect to scenarios which require updatelessness/precommitments in order to do well. I agree that it should all be approached with a learning-theoretic perspective.
Ahh thanks :p fixed
Those of a Bayesian leaning will tend to say things like “probability is subjective”, and claim this is an important insight into the nature of probability—one might even go so far as to say “probability is an answer, not a question”. But this doesn’t mean you can believe what you want; not exactly. There are coherence constraints. So, once we see that probability is subjective, we can then seek a theory of the subjectivity, which tells us “objective” information about it (yet which leaves a whole lot of flexibility).
The same might be true of counterfactuals. I personally lean toward the position that the constraints on counterfactuals are that they be consistent with evidential predictions, but I don’t claim to be unconfused. My position is a “counterfactuals are subjective but have significant coherence constraints” type position, but (arguably) a fairly minimal one—the constraint is a version of “counterfacting on what you actually did should yield what actually happened”, one of the most basic constraints on what counterfactuals should be.
On the other hand, my theory of counterfactuals is pretty boring and doesn’t directly solve problems—it more says “look elsewhere for the interesting stuff”.
Oh, also, I wanted to pitch the idea that counterfactuals, like a whole bunch of things, should be thought of as “constructed rather than real”. This is subtly different from “subjective”. We humans are pretty far along in an ongoing process of figuring out how to be and act in the world. Sometimes we come up with formal theories of things like probability, utility, counterfactuals, and logic. The process of coming up with these formal theories informs our practice. Our practice also informs the formal theories. Sometimes a theory seems to capture what we wanted really nicely. My argument is that in an important sense we’ve invented, not discovered, what we wanted.
So, for example, utility functions. Do utility functions capture human preferences? No, not really, they are pretty far from preferences observed in the wild. However, we’re in the process of figuring out what we prefer. Utility functions capture some nice ideas about idealized preferences, so that when we’re talking about idealized versions of what we want (trying to figure out what we prefer upon reflection) it is (a) often pretty convenient to think in terms of utilities, and (b) somewhat difficult to really escape the framework of utilities. Similarly for probability and logic as formal models of idealized reasoning.
So, just as utility functions aren’t really out there in the world, counterfactuals aren’t really out there in the world. But just as it might be that we should think about our preferences in terms of utility anyway (...or maybe abandon utility in favor of better theoretical tools), we might want to equip our best world-model with counterfactuals anyway (...or abandon them in favor of better theoretical tools).
I very much agree with the point about not decoupling learning and decision theory. I wrote a comment making somewhat similar points.
I believe that this indeed solves both INP and IXB.
I’d like to understand this part.
One way to fix it is by allowing the agent to precommit. Then the assumption about Omega becomes empirically verifiable.
I’m not sure I should find the precommitment solution satisfying. Won’t it make some stupid precommitments early (before it has learned enough about the world to make reasonable precommitments) and screw itself up forever? Is there a generally applicable version of precommitments which ensures learning good behavior?
The only class of problems that I’m genuinely unsure how to deal with is game-theoretic superrationality.
If we take the learning-theoretic view, then we get to bring in tools from iterated games. There’s a Pavlov-like strategy for playing deterministic iterated games which converges to optimal responses to non-agentic environments and converges to Pareto optima for environments containing agents who use the Pavlov-like strategy. It is not the greatest at being unexploitable, and it also has fairly bad convergence.
However, I don’t yet see how to translate the result to logical-induction type learners. Besides requiring deterministic payouts (a property which can probably be relaxed somehow), the algorithm requires an agent to have a definite history—a well-defined training sequence. Agents based on logical induction are instead forming generalizations based on any sufficiently analogous situation within logic, so they don’t have a well-defined history in the right way. (An actual instance of a logical induction agent has an actual temporal history, but this temporal history is not necessarily what it is drawing on to play the game—it may have never personally encountered a similar situation.)
In other words, I’m hopeful that there could be a learning-theoretic solution, but I don’t know what it is yet.
As for superrationality for agents w/o learning theory, there’s cooperative oracles, right? We can make computable analogues with distributed oracles. It’s not a real solution, specifically in that it ignores learning. So I sort of think we know how to do it in the “static” setting, but the problem is that we live in a learning-theoretic setting rather than a static-rationality setting.
The dice example is one I stumbled on while playing with the idea of a probability-like calculus for excluding information, rather than including information. I’ll write up a post on it at some point.
I look forward to it.
When I imagine an embedded agent, I imagine some giant computational circuit representing the universe, and I draw a box around one finite piece of it
Speaking very abstractly, I think this gets at my actual claim. Continuing to speak at that high level of abstraction, I am claiming that you should imagine an agent more as a flow through a fluid.
Speaking much more concretely, this difference comes partly from the question of whether to consider robust delegation as a central part to tackle now, or (as you suggested in the post) a part to tackle later. I agree with your description of robust delegation as “hard mode”, but nonetheless consider it to be central.
To name some considerations:
The “static” way of thinking involves handing decision problems to agents without asking how the agent found itself in that situation. The how-did-we-get-here question is sometimes important. For example, my rejection of the standard smoking lesion problem is a how-did-we-get-here type objection.
Moreover, “static” decision theory puts a box around “epistemics” with an output to decision-making. This implicitly suggests: “Decision theory is about optimal action under uncertainty—the generation of that uncertainty is relegated to epistemics.” This ignores the role of learning how to act. Learning how to act can be critical even for decision theory in the abstract (and is obviously important to implementation).
Viewing things from a learning-theoretic perspective, it doesn’t generally make sense to view a single thing (a single observation, a single action/decision, etc) in isolation. So, accounting for logical non-omniscience, we can’t expect to make a single decision “correctly” for basically any notion of “correctly”. What we can expect is to be “moving in the right direction”—not at a particular time, but generally over time (if nothing kills us).
So, describing an embedded agent in some particular situation, the notion of “rational (bounded) agency” should not expect anything optimal about its actions in that circumstance—it can only talk about the way the agent updates.
Due to logical non-omniscience, this applies to the action even if the agent is at the point where it knows what’s going on epistemically—it might not have learned to appropriately react to the given situation yet. So even “reacting optimally given your (epistemic) uncertainty” isn’t realistic as an expectation for bounded agents.
Obviously I also think the “dynamic” view is better in the purely epistemic case as well—logical induction being the poster boy, totally breaking the static rules of probability theory at a fixed time but gradually improving its beliefs over time (in a way which approaches the static probabilistic laws but also captures more).
Even for purely Bayesian learning, though, the dynamic view is a good one. Bayesian learning is a way of setting up dynamics such that better hypotheses “rise to the top” over time. It is quite analogous to replicator dynamics as a model of evolution.
You can do “equilibrium analysis” of evolution, too (ie, evolutionary stable equilibria), but it misses how-did-we-get-here type questions: larger and smaller attractor basins. (Evolutionarily stable equilibria are sort of a patch on Nash equilibria to address some of the how-did-we-get-here questions, by ruling out points which are Nash equilibria but which would not be attractors at all.) It also misses out on orbits and other fundamentally dynamic behavior.
(The dynamic phenomena such as orbits become important in the theory of correlated equilibria, if you get into the literature on learning correlated equilibria (MAL—multi-agent learning) and think about where the correlations come from.)
Of course we could have agents which persist over time, collecting information and making multiple decisions, but if our theory of embedded agency assumes that, then it seems like it will miss a lot of agenty behavior.
I agree that requiring dynamics would miss some examples of actual single-shot agents, doing something intelligently, once, in isolation. However, it is a live question for me whether such agents can be anything else that Boltzmann brains. In Does Agent-like Behavior imply Agent-like Architecture, Scott mentioned that it seems quite unlikely that you could get a look-up table which behaves like an agent without having an actual agent somewhere causally upstream of it. Similarly, I’m suggesting that it seems unlikely you could get an agent-like architecture sitting in the universe without some kind of learning process causally upstream.
Moreover, continuity is central to the major problems and partial solutions in embedded agency. X-risk is a robust delegation failure more than a decision-theory failure or an embedded world-model failure (though subsystem alignment has a similarly strong claim). UDT and TDT are interesting largely because of the way they establish dynamic consistency of an agent across time, partially addressing the tiling agent problem. (For UDT, this is especially central.) But, both of them ultimately fail very much because of their “static” nature.
[I actually got this static/dynamic picture from komponisto btw (talking in person, though the posts give a taste of it). At first it sounded like rather free-flowing abstraction, but it kept surprising me by being able to bear weight. Line-per-line, though, much more of the above is inspired by discussions with Steve Rayhawk.]
Edit: Vanessa made a related point in a comment on another post.
I think of agent-like architectures as something objective, or related to the territory. In contrast, agent-like behavior is something subjective, something in the map. Importantly, agent-like behavior, or the lack of it, of some X is something that exists in the map of some entity Y (where often Y≠X).
The selection/control distinction seems related, but not quite similar to me. Am I missing something there?
A(Θ)-morphism seems to me to involve both agent-like architecture and agent-like behavior, because it just talks about prediction generally. Mostly I was asking if you were trying to point it one way or the other (we could talk about prediction-of-internals exclusively, to point at structure, or prediction-of-external exclusively, to talk about behavior—I was unsure whether you were trying to do one of those things).
Since you say that you are trying to formalize how we informally talk, rather than how we should, I guess you weren’t trying to make A(Θ)-morphism get at this distinction at all, and were separately mentioning the distinction as one which should be made.
I don’t see how this agent seems to control his sanity.
The agent in Troll Bridge thinks that it can make itself insane by crossing the bridge. (Maybe this doesn’t answer your question?)
Troll Bridge is a rare case where agents that require proof to take action can prove they would be insane to take some action before they’ve thought through its consequences. Can you show how they could unwisely do this in chess, or some sort of Troll Chess?
I make no claim that this sort of case is common. Scenarios where it comes up and is relevant to X-risk might involve alien superintelligences trolling human-made AGI. But it isn’t exactly high on my list of concerns. The question is more about whether particular theories of counterfactual are right. Troll Bridge might be “too hard” in some sense—we may just have to give up on it. But, generally, these weird philosophical counterexamples are more about pointing out problems. Complex real-life situations are difficult to deal with (in terms of reasoning about what a particular theory of counterfactuals will actually do), so we check simple examples, even if they’re outlandish, to get a better idea of what the counterfactuals are doing in general.
Yep, sorry. The illustrations were not actually originally meant for publication; they’re from my personal notes. I did it this way (1) because the pictures are kind of nice, (2) because I was frustrated that no one had written a good summary post on Troll Bridge yet, (3) because I was in a hurry. Ideally I’ll edit the images to be more suitable for the post, although adding the omitted content is a higher priority.