# Example decision theory problem: “Agent simulates predictor”

Some people on LW have expressed interest in what’s happening on the decision-theory-workshop mailing list. Here’s an example of the kind of work we’re trying to do there.

In April 2010 Gary Drescher proposed the “Agent simulates predictor” problem, or ASP, that shows how agents with lots of computational power sometimes fare worse than agents with limited resources. I’m posting it here with his permission:

There’s a version of Newcomb’s Problem that poses the same sort of challenge to UDT that comes up in some multi-agent/game-theoretic scenarios.

Suppose:

The predictor does not run a detailed simulation of the agent, but relies instead on a high-level understanding of the agent’s decision theory and computational power.

The agent runs UDT, and has the ability to fully simulate the predictor.

Since the agent can deduce (by low-level simulation) what the predictor will do, the agent does not regard the prediction outcome as contingent on the agent’s computation. Instead, either predict-onebox or predict-twobox has a probability of 1 (since one or the other of those is deducible), and a probability of 1 remains the same regardless of what we condition on. The agent will then calculate greater utility for two-boxing than for one-boxing.

Meanwhile, the predictor, knowing that the the agent runs UDT and will fully simulate the predictor, can reason as in the preceding paragraph, and thus deduce that the agent will two-box. So the large box is left empty and the agent two-boxes (and the agent’s detailed simulation of the predictor correctly shows the predictor correctly predicting two-boxing).

The agent would be better off, though, running a different decision theory that does not two-box here, and that the predictor can deduce does not two-box.

About a month ago I came up with a way to formalize the problem, along the lines of my other formalizations:

a) The agent generates all proofs of length up to M, then picks the action for which the greatest utility was proven.

b) The predictor generates all proofs of length up to N which is much less than M. If it finds a provable prediction about the agent’s action, it fills the boxes accordingly. Also the predictor has an “epistemic advantage” over the agent: its proof system has an axiom saying the agent’s proof system is consistent.

Now the predictor can reason as follows. It knows that the agent will find some proof that the predictor will put X dollars in the second box, for some unknown value of X, because the agent has enough time to simulate the predictor. Therefore, it knows that the agent will find proofs that one-boxing leads to X dollars and two-boxing leads to X+1000 dollars. Now what if the agent still chooses one-boxing in the end? That means it must have found a different proof saying one-boxing gives more than X+1000 dollars. But if the agent actually one-boxes, the existence of these two different proofs would imply that the agent’s proof system is inconsistent, which the predictor knows to be impossible. So the predictor ends up predicting that the agent will two-box, the agent two-boxes, and everybody loses.

Also Wei Dai has a tentative new decision theory that solves the problem, but this margin (and my brain) is too small to contain it :-)

Can LW generate the kind of insights needed to make progress on problems like ASP? Or should we keep working as a small clique?

- Introduction to Cartesian Frames by 22 Oct 2020 13:00 UTC; 140 points) (
- The Credit Assignment Problem by 8 Nov 2019 2:50 UTC; 72 points) (
- MIRI Research Guide by 7 Nov 2014 19:11 UTC; 71 points) (
- Conceptual Problems with UDT and Policy Selection by 28 Jun 2019 23:50 UTC; 51 points) (
- Predictability of Decisions and the Diagonal Method by 9 Mar 2012 23:53 UTC; 29 points) (
- 20 May 2011 1:20 UTC; 27 points) 's comment on The Aliens have Landed! by (
- What are some concrete problems about logical counterfactuals? by 16 Dec 2018 10:20 UTC; 26 points) (
- Single player extensive-form games as a model of UDT by 25 Feb 2014 10:43 UTC; 25 points) (
- Kidnapping and the game of Chicken by 3 Nov 2013 6:29 UTC; 23 points) (
- Is Agent Simulates Predictor a “fair” problem? by 24 Jan 2019 13:18 UTC; 22 points) (
- 19 May 2011 15:20 UTC; 21 points) 's comment on What bothers you about Less Wrong? by (
- The limited predictor problem by 21 Mar 2012 0:15 UTC; 19 points) (
- 21 May 2011 15:20 UTC; 18 points) 's comment on What bothers you about Less Wrong? by (
- 30 Jul 2012 19:01 UTC; 14 points) 's comment on AI cooperation is already studied in academia as “program equilibrium” by (
- 31 Mar 2014 17:12 UTC; 13 points) 's comment on Explanations for Less Wrong articles that you didn’t understand by (
- One Doubt About Timeless Decision Theories by 22 Oct 2018 1:39 UTC; 13 points) (
- Agent Foundation Foundations and the Rocket Alignment Problem by 9 Apr 2019 11:33 UTC; 11 points) (
- 30 Dec 2011 16:01 UTC; 11 points) 's comment on Ambient control of the arithmetical hierarchy? by (
- 11 Nov 2020 20:32 UTC; 7 points) 's comment on Time in Cartesian Frames by (
- Towards a Formalisation of Logical Counterfactuals by 8 Aug 2020 22:14 UTC; 6 points) (
- 15 Oct 2019 12:59 UTC; 5 points) 's comment on Conceptual Problems with UDT and Policy Selection by (
- 11 Jan 2012 18:20 UTC; 4 points) 's comment on A model of UDT with a halting oracle by (
- Difference between CDT and ADT/UDT as constant programs by 19 Mar 2012 19:41 UTC; 4 points) (
- 18 May 2015 13:42 UTC; 3 points) 's comment on Agents that can predict their Newcomb predictor by (
- 24 Jan 2019 23:00 UTC; 3 points) 's comment on Is Agent Simulates Predictor a “fair” problem? by (
- 16 Dec 2015 13:31 UTC; 3 points) 's comment on Agent-Simulates-Predictor Variant of the Prisoner’s Dilemma by (
- 24 Jan 2017 21:00 UTC; 3 points) 's comment on Did EDT get it right all along? Introducing yet another medical Newcomb problem by (
- Agents that can predict their Newcomb predictor by 19 May 2015 10:17 UTC; 1 point) (
- 19 Jun 2013 22:29 UTC; 1 point) 's comment on Newcomb’s Problem and Regret of Rationality by (
- 27 Sep 2014 2:53 UTC; 1 point) 's comment on Newcomblike problems are the norm by (
- 14 Mar 2012 14:51 UTC; 0 points) 's comment on Anthropic Reasoning by CDT in Newcomb’s Problem by (
- 31 Aug 2017 16:35 UTC; 0 points) 's comment on The Doomsday argument in anthropic decision theory by (
- 21 Mar 2012 11:39 UTC; 0 points) 's comment on A model of UDT without proof limits by (
- 3 Sep 2011 18:49 UTC; -2 points) 's comment on Consequentialism Need Not Be Nearsighted by (

Yes, especially if matt adds a ‘decision theory’ subreddit. That way those from the clique with a narrow interest in decision theory can just follow that page instead. I know I would be more likely to engage with the topics on a reddit style medium than via email. Email conversations are just a lot harder to follow. Especially given that, let’s be honest, the ‘small clique’ is barely active and can go for weeks or a month between email.

Having lesswrong exposed to people who are actively thinking through and solving a difficult problem would be an overwhelmingly good influence on the lesswrong community and will expose potential new thinkers, hopefully inspiring some of them to get involved in the problem solving themselves. There will, of course, be a lowering of the average standard of the discussion but I would expect a net increase in high quality contributions as well. Heavy upvoting of the clear thinking and commensurate downvoting would make new insights accessible.

(Mind you there is an obvious advantage to having decision theory conversations that are

noton a sight run by SIAI too.)I think for the same reason we had meetup posts on the front page, we should just post stuff to the main LW (this will maximize the use case of recruiting new people), and we should move on to creating a sub-reddit only when the volume becomes distracting. The decision theory list can continue to serve the current purpose of a place to discuss particularly arcane ideas (that are not

results), which new recruits won’t typically understand anyway (and so why waste anyone’s time). Publicly opening list’s archives (but not subscription, we don’t have voting over there to defend ourselves) can cover the rest.But we might want to adopt a standard tag on LW so that people can subscribe to this part only, which I can commit to consistently setting on all posts that quality, say “decision_theory” (just fixed for this post).

From context I suppose you’re talking about a subreddit of LessWrong, not a subreddit of Reddit. Is there a list of existing LessWrong subreddits somewhere, or is this proposal to create the first one?

Just to give due credit: Wei Dai and others had already discussed Prisoner’s Dilemma scenarios that exhibit a similar problem, which I then distilled into the ASP problem.

Speaking only for myself: Eliezer’s sequences first lured me to Less Wrong, but your posts on decision theory were what convinced me to stick around and keep checking the front page.

I confess I don’t understand all of the math. It’s been decades since I studied mathematics with any rigour; these days I can follow some fairly advanced theory, but have difficulty reproducing it, and cannot in general extend it. I have had nothing substantial to add, and so I haven’t previously commented on one of your posts. Somehow LW didn’t seem like the

bestplace to go all fangirl…It’s not only the contributors who are following your work on decision theory. I hope you’ll continue to share it with the rest of us.

I would also like to follow the discussion on decision theory.

This whole thing seems like an artifact of failing to draw a boundary between decision theories and strategies. In my own work, I have for deterministic problems:

Phrased this way, the predictor is part of the World function, which means it is only allowed to simulate a Strategy, not a DecisionTheory, and so the problem as stated is ill-posed. This structure is

requiredfor decision theory in general to have correct answers, because otherwise you could construct a problem for any decision theory, no matter how ridiculous, which only that decision theory can win at.I disagree. An agent can use any considerations whatsoever in making its decisions, and these considerations can refer to the world, or its own algorithm, or to the way the world depends on agent’s algorithm, or to the way the dependence of the world on agent’s algorithm depends on agent’s decision in a counterfactual world.

You can object that it’s not fair to pose before the agent problems that ask for recognition of facts outside some predefined class of non-ridiculous facts, but asking about which situations we are allowed to present before an agent is a wrong perspective. It is wrong because making an agent with certain characteristics automatically determines the facts of its success or failure in

allof the possible scenarios, fair, unfair, plausible, and ridiculous.So the only consideration that is allowed to dictate which considerations we are allowed to ignore is agent’s own preference. If the agent doesn’t care about influence of some fact, then it can ignore it. Typically, we won’t be able to formally point out any class of facts to which the agent is guaranteed to be even in principle indifferent. And so decision theory must not be typed.

(You can see a certain analogy with not demanding particular kind of proof. The agent is not allowed to reject my argument that a certain action is desirable, or undesirable, on the basis of the considerations I refer to not belonging to a particular privileged class, unless it really doesn’t care about those considerations or any or their logical consequences (according to agent’s normative theory of inference). See also explicit dependence bias.)

I think you’ve misunderstood just what restrictions this type schema imposes on problems. Could you provide a specific example of something you think it excludes, that it shouldn’t?

The ASP problem described in the OP is just such an example. Are you perhaps not convinced that it represents an aspect of important real-world problems?

What’s World? What do you mean by “World: Strategy->Outcome”? The problem is that if World is a function, it’s given by some syntactic specification, and the agent, being part of the world, can control

whichfunction this syntactic specification refers to, or which value this function has for a particular argument, in a way unrelated to its type, to passing of this Strategy thing as its argument. This is a typical example of explicit dependence bias: you declare that World (as a mathematical structure) depends on nothing, but it really does.See example “world3” in my post for an example with Newcomb’s problem. There, World is given by a program “world3″ that takes agent’s action as parameter, but also depends on agent’s action implicitly in a way that isn’t captured by the interface of the function. The function in fact shows that two-boxing dominates one-boxing, and one can only see that it’s really worse by recognizing that function’s value when given agent’s actual action as parameter, is controlled by agent’s action in a way that makes one-boxing preferable. And so only seeing program “world3” as specifying a function is unhelpful.

It means “World is the type of function that takes a Strategy as input and returns an Outcomes.”

In the linked article, you’ve defined world, world2, and world3 as things that’re not actually functions; they have unbound references to agent, which are parameters in disguise. You then show that if you mix parameters-as-unbound-references with real parameters, you can get confused into thinking they’re independent. Which jus means you shouldn’t use unbound references.

How do you know that a given syntactic specification of a function doesn’t “use unbound references”? Presence of a logical dependence on agent’s action looks like something algorithmically impossible to detect. You can of course focus on a class of syntactic specifications that are known to not depend on agent’s action (for example, GLUTs), but this is too restrictive for a FAI-grade decision theory that can handle the actual world, and ignores the problem of whether the process that specified that GLUT could itself logically depend on agent’s action. (The very setup of a thought experiment could be controlled by an agent acting inside it, for example, for strange enough or sensitive-to-detail enough thought experiments.)

Give it as input to a compiler and see if it gives an error message or not. Or apply the same trivial procedure compilers use: read it and look for a definition for every symbol that is not itself the left hand side of a definition.

It doesn’t refer to any symbols, see in particular the difference between “world” and “world2″, and notice that “world3” doesn’t refer to “action()”, but instead to “action2()”, which you can assume to be a copy-paste of “action()”’s source code, with all the symbols renamed.

Ok, I think I see where our formalizations differ. In the formalization I’m using, the decision theory produces a strategy, which is a function that’s given as an argument to the world program. The world program invokes the strategy zero or more times, each time passing in some arguments that give whatever information is available to the agent at some point, and getting back a (real or predicted) decision. The world program is completely self-contained; other than through the argument it receives, it may not contain references to the agent’s choices at all. The strategy is similarly self-contained; it receives no information about the world except through the arguments the world program passes to it. Then separately from that, a “decision theory” is a function that takes a symbolic representation of a world program, and returns a symbolic representation of a strategy.

Ultimately, this amounts to a refactoring; results that hold in one system still hold in the other, if you map the definitions appropriately. However, I’ve found that structuring problems this way makes the theory easier to build on, and makes underspecified problems easier to notice.

Can you formalize this requirement? If I copy agent’s code, rename all symbols, obfuscate it, simulate its execution in a source code interpreter that runs in a hardware emulator running on an emulated linux box running on javascript inside a browser running on Windows running on a hardware simulator implemented (and then obfuscated again) in the same language as the world program, and insert this thing in the world program (along with a few billion people and a planet and a universe), how can you possibly make sure that there is no dependence?

You don’t get to do that, because when you’re writing World, the Strategy hasn’t been determined yet. Think of it as a challenge-response protocol; World is a challenge, and Strategy is a response. You can still do agent-copying, but you have to enlarge the scope of World to include the rules by which that copying was done, or else you get unrelated agents instead of copies.

To copy agent’s code, you don’t need to know strategy. World naturally changes if you change it, and the strategy might change as well if you run the agent on a changed world, but agent’s code is still the same, and you know this code. The new world will only depend on the new strategy, not the old one, but now we have a world that depends on its agent’s strategy, and you won’t be able to find how it does, if you don’t already know.

In any case, all this copying is irrelevant, because the point is that there can exist very convoluted worlds that depend on agent’s action, but it’s not feasible to know that they do or how they do. And we don’t get to choose the real world.

Yeah, in ASP the predictor looks inside the agent. But the problem still seems “fair” in a certain sense, because the predictor’s prediction is logically tied to the agent’s decision. A stupid agent that one-boxed no matter what would get rewarded, and a stupid agent that two-boxed would get punished. In other words, I’d still like a decision theory that does well on some suitably defined class of ASP-like problems, even if that class is wider than the class of “TDT-fair” problems that Eliezer envisioned. Of course we need a

lotof progress to precisely define such classes of problems, too.It would be useful to have list of problems that TDT can handle, a list that current specifications of UDT can handle and a list that are still in the grey area of not quite resolved. Among other things that would make the difference between TDT and UDT far more intuitively clear!

The class of problems which are well-posed in this type system is exactly the class of problems that would not change if you gave the agent a chance to self-modify in between receiving the problem statement and the start of the world program. Problems outside this class are neither fair nor solvable in principle nor interesting.

I work under the assumption that the problem statement

isa world program or a prior over world programs. Maybe it’s a bad assumption. Can you suggest a better one?I use that same assumption, with only the slight caveat that I keep the world program and the preference function separate for clarity’s sake, and you need both, but I often see them combined into one function.

The big difference, I think, is that the way I do it, the world program doesn’t get a decision theory directly as input; instead, the world program’s source is given to the decision theory, the decision theory outputs a strategy, and then the strategy is given to the world program as input. This is a better match for how we normally talk about decision theory problems, and prevents a lot of shennanigans.

Of course the world program shouldn’t get the decision theory as input! In the formulation I always use, the world program doesn’t have any inputs, it’s a computation with no arguments that returns a utility value. You live in the world, so the world program contains your decision theory as a subroutine :-)

That’s very wrong. You’ve given up the distinctions between question and answer, and between answer and solution procedure. This also lets in stupid problems, like “you get a dollar iff the source code to your decision theory is prime”, which I would classify as not a decision theory problem at all.

Yeah, you can’t control arbitrary world programs that way, but remember that our world really is like that: a human mind runs within the same universe that it’s trying to control. One idea would be to define a class of “fair” world programs (perhaps those that can be factorized in appropriate ways) and care only about solving those. But I guess I’d better wait for your writeup, because I don’t understand what kind of formalism you prefer.

In this case, the outcome is the same for every action you make. A decision theory can still consider this problem, and rightfully conclude that it’s irrelevant which action it takes (or maybe it infers that jumping on a right foot has a slightly higher chance of making Dark Lords of the Matrix to have written your source code so that it’s prime). I don’t see how this is a problem.

But the world is really like that. It is essential to attain this.

I don’t know of any important aspect of the world that can’t be captured in question-answer-procedure format. Could you give an example?

But who does the capturing? (To push the analogy one step up, I know of no solvable question that doesn’t admit an answer written on a sheet of paper.)

See the section on “utility functions” in this post. UDT/ADT seeks exactly to infer functional dependencies that can then be used in the usual manner (at least, this is one way to look at what’s going on). It is intended to solve the very problem which you argue must always be solvable.

My examples of explicit dependence bias show that there are wrong and right ways of parsing the outcome as depending on agent’s decision. If we are guaranteed to be given a right parsing, then that part is indeed covered, and there is no need to worry about the wrong ones. Believing that a right parsing merely exists doesn’t particularly help in finding one.

So I guess the problem you are having with UDT is that you assume that the problem of finding a correct explicit dependence of outcome on action is already solved, and we have a function World ready specified in a way that doesn’t in any way implicitly depend on agent’s actions. But UDT is intended to solve the problem while not making this assumption, and instead tries to find an unbiased dependence on its own. Since you assume the goal of UDT as a prerequisite in your thinking about decision theory, you don’t see the motivation for UDT, which indeed there would be none had we assumed this problem solved.

We’re talking past each other, and this back-and-forth conversation isn’t going anywhere because we’re starting from very different definitions. Let’s restart this conversation after I’ve finished the post that builds up the definitions from scratch.

At least address this concern, which suggests that our difficulty is probably an easy one of technical confusion, and not of communicating intuitive understanding of relevance of studying a certain question.

I’m confused on this point, would you mind checking if my thinking on it is correct?

My initial objection was that this seems to assume that the predictor doesn’t take anything into account, and that the agent was trying to predict what the predictor would do without trying to figure out what the predictor would predict the agent would choose.

Then I noticed that the predictor isn’t actually waiting for the agent to finish making its decision, it was using a higher level of representation of how the agent thinks. Taking this into account, the agent’s ability to simulate the predictor implicitly includes the ability to compute what the predictor predicts the agent will do.

So then I was confused about why this is still a problem. My intuition was banging its head against the wall insisting that the predictor still has to take into account the agent’s decision, and that the agent couldn’t model the predictor’s prediction as not contingent on the agent’s decision.

Then I noticed that the real issue isn’t any particular prediction that the agent thinks the predictor would predict, so much as the fact that the agent sees this happening with probability one, and happening regardless of what it chooses to do. Since the agent already “knows” what the predictor will predict, it is free to choose to two-box, which will always be higher utility once you can’t causally effect the boxes.

Your thinking seems to be sort of correct, but informal. I’m not sure if my proof in the second part of the post was parseable, but at least it’s an actual proof of the thing you want to know, so maybe you could try parsing it :-)

Thanks for the response, that was fast.

I can parse it, but I don’t really think that I understand it in a mathematical way.

A is a statement that makes sense to me, and I can see why the predictor needs to know that the agent’s proof system is consistent.

What I don’t get about it is why you specify that the predictor computes proofs up to length N, and then just say how the predictor will do its proof.

Basically, I have no formal mathematics education in fields that aren’t a direct prerequisite of basic multivariable calculus, and my informal mathematics education consists of Godel, Escher, Bach. And a wikipedia education in game theory.

Do you have any suggestions on how to get started in this area? Like, introductory sources or even just terms that I should look up on wikipedia or google?

If the outlined proof is less than N symbols long (which is true if N is large enough), the predictor will find it because it enumerates all proofs up to that length. Since the predictor’s proof system is consistent, it won’t find any other proofs contradicting this one.

The N < M is necessary to guarantee that the agent predicts the predictor’s proof, right?

What happens if the outlined proof is more than N symbols long?

Yeah. Actually, N must be exponentially smaller than M, so the agent’s proofs can completely simulate the predictor’s execution.

No idea. :-) Maybe the predictor will fail to prove anything, and fall back to filling only one box, I guess? Anyway, the outlined proof is quite short, so the problem already arises for not very large values of N.

What the agent tries to infer, is predictor’s behavior for each of agent’s possible actions, not just unconditional predictor’s behavior. Being able to infer predictor’s decision without assuming agent’s action is equivalent to conclusing that predictor’s decision is a constant function of agent’s action (in agent’s opinion, given the kind of decision-maker our agent is, which is something that it should be able to control better, but current version of the theory doesn’t support).

On an extremely similar note, as I’ve argued here, I’m pretty sure that AIXI two-boxes on Newcomb’s problem.

Basically my argument is that AIXI can fairly trivially work out whether the box is full or empty, and hence it two-boxes due to causal assumptions implicit in the AIXI algorithm. Moreover, that line of reasoning is very easy for Omega to come across, and so Omega predicts two-boxing and the box ends up being empty, and then AIXI takes two boxes and ends up with $1000.

It can be argued that the “epistemic advantage” of the predictor over the agent is an unfair one. After all, if the agent had an equivalent axiom for predictor’s consistency, both would be inconsistent.

In the absence of this advantage, the predictor won’t be able to find a proof of agent’s action (if it is consistent).

Yes, that’s a valid reason to discount problems like ASP. It’s awful how much we don’t know...

Well, yes, but… is there a feeling at SI that these kinds of problems are relevant to AGI, friendly or no? Or is this just something generally interesting (maybe with the hope that these problems may point to something tangential to them, but which would turn out to be highly relevant)?

I mean, generally I would say that ideas connected to these approaches fall into the “symbolic AI” paradigm, which is supposed to be discredited by some seriously revered people, like Hofstadter...

Well, to make a guaranteed friendly AI you probably need to prove theorems about your AI design. And our universe most likely contains many copies of everything. So figuring out the right decision theory in the presence of copies seems to be a necessary step on the road to FAI. I don’t speak for SingInst here, this is just how I feel.

Wouldn’t this be a level mismatch in a multi-level AI architecture? Like, proving things about low-level neural computational substrate instead of about the conceptual level where actual cognition would take place, and where the actual friendliness would be defined? [and this level can’t be isomorphic to any formal logical system, except in symbolic AI, which doesn’t work]

And again, a conceptual-level understanding should do the trick. Like, knowing that I play PD against myself would be sufficient to cooperate. Besides, as EY frequently says, it’s really hard to find oneself in a true PD. Usually, it’s iterated PD, or some Schelling’s conflict game [BTW, huge thanks for recommending his book in one of your posts!]

If a multilevel architecture (whatever it is) makes provable friendliness impossible, then FAI can’t use it.

I imagine the future FAI as closer to AIXI, which works fine without multilevel architecture, than to the Lisp programs of the 70s.

AFAICT, the general architecture that EY advocates (-ed?) in “Levels of Organization in GI” is multilevel. But this doesn’t automatically mean that it’s impossible to prove anything about it. Maybe it’s possible, just not using the formal logic methods. [And so maybe getting not a 100% certainty, but 100-1e-N%, which should be sufficient for large enough N].

AIXI doesn’t work so much more than symbolic AI Lisp programs of the 70s. I mean, the General Problem Solver would be superintelligent given infinite computing power.

Eliezer says here:

To make the General Problem Solver or any other powerful computing device do anything interesting in the real world, you need to give it a formal description that contains the real world as a special case. You could use the universal prior, which gives you AIXI. Or you could use the yet-unspecified prior of UDT, which gives you the yet-unspecified UDT-AIXI.

The central difficulty of decision theory doesn’t get easier if you have lots of computing power. Imagine you’re playing the PD against someone. You both know each other’s source code, but you have a halting oracle and your opponent doesn’t. With so much power, what do you do? I simulate the other guy and… whoops, didn’t think this through. Looks like I must avoid looking at the result. Hmmm.

Oh… LOGI’s totally relinquished then? They should mark the paper as

completely obsoletein the list of SI publications, otherwise it’s confusing :) I was under impression I read some relatively recent Eliezer’s text where he says the prospective FAI researchers must thoroughly understand LOGI before moving to the current even more advanced undisclosed architecture...Yes, this is an interesting problem. And it appears to produce some neat metaphors. Like: maintain illusion of free will by deliberately avoiding knowing your own decision in advance [or become crazy]. And avoid de-humanizing your opponents [or get defected].

But does it remain relevant given limits on the computing power? [= assuming neither simulation nor any kind of formal proof is feasible]

That sounds weird. Can you find a link?

That seems to be a much stronger assumption than just limiting computing power. It can be broken by one player strategically weakening themselves, if they can benefit from being simulated.

This.

Are you sure this is possible? I tried to do this with the “impersonate other agents” strategy, but it does not seem to work if the opponent has your source code. The other agent knows you’re not actually them, just impersonating :)

There is a possibility to send out a different simple program instead of yourself (or fully self-modify into the said program, there is no difference), but it would be a wholly different problem (and easily solvable) from the original one.

Ouch, that text sounds painful, it’s probably about as old as LOGI.

Well, not quite that old, but yes, not very recent. The internet archive says the page was created at the end of 2009, but it was probably not done by EY himself. The earliest reference google gives is in 2007...

So, you’re saying, now the party line is on single-level formal system-style architectures? But does it even make sense to try to define FAI-meaningful concepts in such architecture? Isn’t it like trying to define ‘love’, ‘freedom’, and ‘justice’ in terms of atoms?

I remember EY saying somewhere (can’t find where now) that AIXI design was very commendable in the sense that here finally is a full AGI design that can be clearly shown to kill you :)

Here is a 2003 reference to the original SL4 wiki post, which is still online but for some reason not indexed by Google.

I only know what the decision theory folks are doing, don’t know about the SingInst party line.

Formally defining “love” may be easier than you think. For example, Paul Christiano’s blog has some posts about using “pointers” to our world: take a long bitstring, like the text of Finnegans Wake, and tell the AI to influence whatever algorithm was most likely to produce that string under the universal prior. Also I have played with the idea of using UDT to increase the measure of specified bitstrings. Such ideas don’t require knowing correct physics down to the level of atoms, and I can easily imagine that we may find a formal way of pointing the AI at any human-recognizable idea without going through atoms.

Thanks for the reference! I skimmed over the blog, and wow! The amount of seriously considered weirdness is staggering :) (like: acausal counterfactual takeover by a simulating UFAI!). It is of huge entertainment value, of course, but… most of it appears to be conditioned on blatantly impossible premises, so it’s hard to take the concerns seriously. Maybe it’s lack of imagination on my part...

Regarding the solution to defining complex concepts via low-level inputs, as far as I understood the idea, you do not remove the multi-leveledness, just let it be inferred internally by the AI and refuse to look at how it is done. Besides, it does not appear to solve the problem: metaphorically speaking, we are not really interested in getting the precise text (Finnegans Wake) down to its last typo, but in a probability measure over all possible texts, which is concentrated on texts that are “sufficiently similar”. In fact, we are most interested in defining this similarity, which is extremely intricate and non-trivial (it may include, for example, translations into other languages).

Your comment reminded me of a post I’ve long wanted to write. The idea is that examining assumptions is unproductive. The only way to make intellectual progress, either individually or as a group, is to stop arguing about assumptions and instead explore their implications wherever they might lead. The #1 reason why I took so long to understand Newcomb’s Problem or Counterfactual Mugging was my insistence on denying the assumptions behind these problems. Instead I should have said to myself, okay, is this direction of inquiry interesting when taken on its own terms?

Many assumptions seemed divorced from real life at first, e.g. people dismissed the study of electromagnetism as an impractical toy, and considered number theory hopelessly abstract until cryptography arrived. People seem unable to judge the usefulness of assumptions before exploring their implications in detail, but they absolutely

lovearguing about assumptions instead of getting anythingdone.There, thanks for encouraging me to write the first draft :-)

Absolutely, I agree of course. If a line of inquiry is interesting in itself and a progress is being made, why not pursue it? My question was only about its direct relevance to FAI. Or, rather, whether the arguments that I made to myself about its non-relevance can be easily refuted.

And, you know, questioning of assumptions can sometimes be useful too. From a false assumption anything follows :)

In any case, I’m glad to be of service, however small. Your posts are generally excellent.

Interesting. Do you see any current arguments over assumptions that we should stop (on LW or elsewhere)?

Hmm, looks like I sometimes attack people for starting from (what I consider) wrong assumptions. Maybe I should rethink my position on those issues.

It seems to me that this problem assumes that the predictor both does and does not predict correctly.

When determining the predictor’s actions, we assume that it forsees the agent’s two-boxing.

When determining the agent’s actions, we assume that the simulated predictor behaves the same regardless of the agent’s decision.

The question thus seems to contradict itself.

In this problem the predictor predicts correctly. Can you explain why you think it predicts incorrectly?

In trying to explain why the simulation ought to show that the prediction outcome is in fact contingent, I realized that I was confused, so I’m going to disregard what I previously tried to think, and start over.

The results are messy and full of wrong ideas; I suggest skimming to get the gist of it. That is: the following text is a

noisysignal of what I’m trying to think, so don’t read the details too closely.--

I may have to reconsider whether I properly grokked the sequence on free will and determinism.

To the extent that the agent has not yet made up its mind as to whether it will one-box or two-box, the simulation ought to reflect that.

The effect of possessing the simulation should be that the agent has (1) unusually high confidence in the predictor’s accuracy, and (2) exceptional luminosity, which ought to be assumed of decision-theoretic agents anyway.

Being luminous about my decision-making process does not mean that I have made up my mind. Similarly, being able to run a program that contains a highly-accurate high-level description of myself does not mean that I already know what decision I’m going to eventually make.

All this adds up to circumstances that, if anything, more strongly support one-boxing than vanilla Newcomb does.

--

Alternatively, if we accept as given that the model’s mind is made up when the agent runs the simulation, we must equally infer that the agent’s decision is likewise determined. From the agent’s point of view, this means that its key choice occurred earlier, in which case we can just ask about its decision back then instead of now, and we again have a standard Newcomb/Parfit situation.

--

In short, two-boxing requires the agent to be stupid about causality and determinism, so the question is fundamentally nothing new.

I don’t even know a fraction of the math you people know, but this problem seems obvious to me: One-box fully expecting the box to be empty (or something impossible to happen).

More generally, if expecting A implies B, expecting B implies A or C and expecting C implies C and U(B)>U(C)>U(A) expect A unless the cost of “being wrong” is larger than the difference. (A being one-boxing the empty box, C being two-boxing with one box empty and B being one-boxing a full box here, in this case B → C would be via D expecting to two-box with both boxes full, which is not implied by any expectation and therefore unreachable).

In the model I suppose the agent would prove that the utility of the various actions depends on whether the agent always chooses the action for which the greatest utility was proven, go into the branch that makes an exception here, and this being correctly predicted by the predictor.

Is there something important I’m missing?

Another way to phrase it is that this problem is isomorphic to the transparent box Newcomb problem, and you are trying to find a formalized decision theory that will one-box the empty box “knowing” it is empty. (Just that instead of “knowing” through updating on seeing the empty box, which UDT refuses to do, there is an equivalent trick with the dependence.) The only way you can do that is if you either don’t actually try to maximize the money at that point or expect a contradiction. Not trying to maximize the money in such situations probably is easier to deal with.

What happens if the UDT agent generates a proof that using any proof longer than N results in only $1000? Is that level of self-reference allowed?

Yes, but current versions of the decision theory can’t respond to this conclusion of the meta-proof by not generating the proof, and really the problem is not that the agent generates the proof, but that it

usesit. It could generate it and then ignore the result (based on the meta-proof), which would result in success, but this level of decision rules is not currently supported.I was thinking that if proofs are allowed to use their own length as a premise, then the expected payoff from each proof would depend on the length of that proof. Both agent and predictor can prove that any proof which is too long results in $1000 or $0, therefore the agent would choose a shorter proof that gives $1000000.

I have no idea whether that’s correct, though.

I’d really like to have a decision theory that would do such things automatically when the situation calls for it. Unfortunately, the one in my formalization doesn’t :-(

If your proof works, I would expect that Omega also knows the agent is consistent and can follow the same logic, and so the UDT agent two-boxes on Newcomb’s problem. Unless you use a version of UDT that (effectively) optimizes over decisions rather than actions (like TDT), which would solve both problems.

EDIT: On solving both problems: my understanding of UDT comes from AlephNeil’s post. If you look at his “generalization 2,” it is exactly what I mean by a problem where you need to optimize over decisions rather than actions—and he claims that a UDT agent does so 5 diagrams later—that is, treats the action as also “controlling telepathic robots.”

So going by that understanding of UDT, cousinIt’s proof is incorrect. If we can find proofs shorter than N, we can treat the predictor as a “robot,” and so two-boxing is regarded as worse than one-boxing if it “controls the robot” into not filling the box. So to predict the agent’s action we probably need to fully define the problem—what does the predictor do in cases with no proofs and in cases where the agent’s action depends on the predictor’s?

Omega knows everything but unfortunately he isn’t available right now. We are stuck with a far more limited predictor.

The reference to Omega comes from contrasting this post with prior claims that a UDT agent will one-box on newcomb’s problem.

This problem is underspecified, in that it does not say what happens if the predictor fails to find a proof either way. Which is exactly what will happen if the agent simulates the predictor.

Do you understand the second part of the post where I gave concrete implementations of the predictor and the agent that simulates it? In that case the predictor successfully finds a proof that the agent two-boxes.

UDT may two-box in the above scenario if it simulates the predictor

once, but what if the UDT agent simulates the predictor twice, simulates itself using the above reasoning and two-boxing in simulation 1, and simulates itself one-boxing for whatever reason in simulation 2? The UDT agent that one-boxes “for whatever reason” does better, and thus the real UDT agent will realize this upon running these 2 simulations and one-box, which the predictor will reason that it would.