What are the biggest issues that haven’t been solved for UDT or FDT?
UDT was a fairly simple and workable idea in classical Bayesian settings with logical omniscience (or with some simple logical uncertainty treated as if it were empirical uncertainty), but it was always intended to utilize logical uncertainty at its core. Logical induction, our current-best theory of logical uncertainty, doesn’t turn out to work very well with UDT so far. The basic problem seems to be that UDT required “updates” to be represented in a fairly explicit way: you have a prior which already contains all the potential things you can learn, and an update is just selecting certain possibilities. Logical induction, in contrast, starts out “really ignorant” and adds structure, not just content, to its beliefs over time. Optimizing via the early beliefs doesn’t look like a very good option, as a result.
FDT requires a notion of logical causality, which hasn’t appeared yet.
What is a co-ordination problem that hasn’t been solved?
Taking logical uncertainty into account, all games become iterated games in a significant sense, because players can reason about each other by looking at what happens in very close situations. If the players have T seconds to think, they can simulate the same game but given t<<T time to think, for many t. So, they can learn from the sequence of “smaller” games.
This might seem like a good thing. For example, single-shot prisoner’s dilemma has just a Nash equilibrium of defection. Iterated play cas cooperative equilibria, such as tit-for-tat.
Unfortunately, the folk theorem of game theory implies that there are a whole lot of fairly bad equilibria for iterated games as well. It is possible that each player enforces a cooperative equilibrium via tit-for-tat-like strategies. However, it is just as possible for players to end up in a mutual blackmail double bind, as follows:
Both players initially have some suspicion that the other player is following strategy X: “cooperate 1% of the time if and only if the other player is playing consistently with strategy X; otherwise, defect 100% of the time.” As a result of this suspicion, both players play via strategy X in order to get the 1% cooperation rather than 0%.
Ridiculously bad “coordination” like that can be avoided via cooperative oracles, but that requires everyone to somehow have access to such a thing. Distributed oracles are more realistic in that each player can compute them just by reasoning about the others, but players using distributed oracles can be exploited.
So, how do you avoid supremely bad coordination in a way which isn’t too badly exploitable?
And what still isn’t known about counterfactuals?
The problem of specifying good counterfactuals sort of wraps up any and all other problems of decision theory into itself, which makes this a bit hard to answer. Different potential decision theories may lean more or less heavily on the counterfactuals. If you lead toward EDT-like decision theories, the problem with counterfactuals is mostly just the problem of making UDT-like solutions work. For CDT-like decision theories, it is the other way around; the problem of getting UDT to work is mostly about getting the right counterfactuals!
The mutual-blackmail problem I mentioned in my “coordination” answer is a good motivating example. How do you ensure that the agents don’t come to think “I have to play strategy X, because if I don’t, the other player will cooperate 0% of the time?”
I saw an earlier draft of this, and hope to write an extensive response at some point. For now, the short version:
As I understand it, FDT was intended as an umbrella term for MIRI-style decision theories, which illustrated the critical points without making too many commitments. So, the vagueness of FDT was partly by design.
I think UDT is a more concrete illustration of the most important points relevant to this discussion.
The optimality notion of UDT is clear. “UDT gets the most utility” means “UDT gets the highest expected value with respect to its own prior”. This seems quite well-defined, hopefully addressing your (VII).
There are problems applying UDT to realistic situations, but UDT makes perfect sense and is optimal in a straightforward sense for the case of single-player extensive form games. That doesn’t address multi-player games or logical uncertainty, but it is enough for much of Will’s discussion.
FDT focused on the weird logical cases, which is in fact a major part of the motivation for MIRI-style decision theory. However, UDT for single-player extensive-form games actually gets at a lot of what MIRI-style decision theory wants, without broaching the topic of logical counterfactuals or proving-your-own-action directly.
The problems which create a deep indeterminacy seem, to me, to be problems for other decision theories than FDT as well. FDT is trying to face them head-on. But there are big problems for applying EDT to agents who are physically instantiated as computer programs and can prove too much about their own actions.
This also hopefully clarifies the sense in which I don’t think the decisions pointed out in (III) are bizarre. The decisions are optimal according to the very probability distribution used to define the decision problem.
There’s a subtle point here, though, since Will describes the decision problem from an updated perspective—you already know the bomb is in front of you. So UDT “changes the problem” by evaluating “according to the prior”. From my perspective, because the very statement of the Bomb problem suggests that there were also other possible outcomes, we can rightly insist to evaluate expected utility in terms of those chances.
Perhaps this sounds like an unprincipled rejection of the Bomb problem as you state it. My principle is as follows: you should not state a decision problem without having in mind a well-specified way to predictably put agents into that scenario. Let’s call the way-you-put-agents-into-the-scenario the “construction”. We then evaluate agents on how well they deal with the construction.
For examples like Bomb, the construction gives us the overall probability distribution—this is then used for the expected value which UDT’s optimality notion is stated in terms of.
For other examples, as discussed in Decisions are for making bad outcomes inconsistent, the construction simply breaks when you try to put certain decision theories into it. This can also be a good thing; it means the decision theory makes certain scenarios altogether impossible.
The point about “constructions” is possibly a bit subtle (and hastily made); maybe a lot of the disagreement will turn out to be there. But I do hope that the basic idea of UDT’s optimality criterion is actually clear—“evaluate expected utility of policies according to the prior”—and clarifies the situation with FDT as well.
I didn’t like this post. At the time, I didn’t engage with it very much. I wrote a mildly critical comment (which is currently the top-voted comment, somewhat to my surprise) but I didn’t actually engage with the idea very much. So it seems like a good idea to say something now.
The main argument that this is valuable seems to be: this captures a common crux in AI safety. I don’t think it’s my crux, and I think other people who think it is their crux are probably mistaken. So from my perspective it’s a straw-man of the view it’s trying to point at.
The main problem is the word “realism”. It isn’t clear exactly what it means, but I suspect that being really anti-realist about rationality would not shift my views about the importance of MIRI-style research that much.
I agree that there’s something kind of like rationality realism. I just don’t think this post successfully points at it.
Ricraz starts out with the list: momentum, evolutionary fitness, intelligence. He says that the question (of rationality realism) is whether fitness is more like momentum or more like fitness. Momentum is highly formalizable. Fitness is a useful abstraction, but no one can write down the fitness function for a given organism. If pressed, we have to admit that it does not exist: every individual organism has what amounts to its own different environment, since it has different starting conditions (nearer to different food sources, etc), and so, is selected on different criteria.
So as I understand it, the claim is that the MIRI cluster believes rationality is more like momentum, but many outside the MIRI cluster believe it’s more like fitness.
It seems to me like my position, and the MIRI-cluster position, is (1) closer to “rationality is like fitness” than “rationality is like momentum”, and (2) doesn’t depend that much on the difference. Realism about rationality is important to the theory of rationality (we should know what kind of theoretical object rationality is), but not so important for the question of whether we need to know about rationality. (This also seems supported by the analogy—evolutionary biologists still see fitness as a very important subject, and don’t seem to care that much about exactly how real the abstraction is.)
To the extent that this post has made a lot of people think that rationality realism is an important crux, it’s quite plausible to me that it’s made the discussion worse.
To expand more on (1) -- since it seems a lot of people found its negation plausible—it seems like if there’s an analogue for the theory of evolution, which uses relatively unreal concepts like “fitness” to help us understand rational agency, we’d like to know about it. In this view, MIRI-cluster is essentially saying “biologists should want to invent evolution. Look at all the similarities across different animals. Don’t you want to explain that?” Whereas the non-MIRI cluster is saying “biologists don’t need to know about evolution.”
Rationality realism seems like a good thing to point out which might be a crux for a lot of people, but it doesn’t seem to be a crux for me.
I don’t think there’s a true rationality out there in the world, or a true decision theory out there in the world, or even a true notion of intelligence out there in the world. I work on agent foundations because there’s still something I’m confused about even after that, and furthermore, AI safety work seems fairly hopeless while still so radically confused about the-phenomena-which-we-use-intelligence-and-rationality-and-agency-and-decision-theory-to-describe. And, as you say, “from a historical point of view I’m quite optimistic about using maths to describe things in general”.
Here are some (very lightly edited) comments I left on Will’s draft of this post. (See also my top-level response.)
Responses to Sections II and III:
I’m not claiming that it’s clear what this means. E.g. see here, second bullet point, arguing there can be no such probability function, because any probability function requires certainty in logical facts and all their entailments.
This point shows the intertwining of logical counterfactuals (counterpossibles) and logical uncertainty. I take logical induction to represent significant progress generalizing probability theory to the case of logical uncertainty, ie, objects which have many of the virtues of probability functions while not requiring certainty about entailment of known facts. So, we can substantially reply to this objection.
However, replying to this objection does not necessarily mean we can define logical counterfactuals as we would want. So far we have only been able to use logical induction to specify a kind of “logically uncertain evidential conditional”. (IE, something closer in spirit to EDT, which does behave more like FDT in some problems but not in general.)
I want to emphasize that I agree that specifying what logical counterfactuals are is a grave difficulty, so grave as to seem (to me, at present) to be damning, provided one can avoid the difficulty in some other approach. However, I don’t actually think that the difficulty can be avoided in any other approach! I think CDT ultimately has to grapple with the question as well, because physics is math, and so physical counterfactuals are ultimately mathematical counterfactuals. Even EDT has to grapple with this problem, ultimately, due to the need to handle cases where one’s own action can be logically known. (Or provide a convincing argument that such cases cannot arise, even for an agent which is computable.)
Guaranteed Payoffs: In conditions of certainty — that is, when the decision-maker has no uncertainty about what state of nature she is in, and no uncertainty about the utility payoff of each action is — the decision-maker should choose the action that maximises utility.
(Obligatory remark that what maximizes utility is part of what’s at issue here, and for precisely this reason, an FDTist could respond that it’s CDT and EDT which fail in the Bomb example—by failing to maximize the a priori expected utility of the action taken.)
FDT would disagree with this principle in general, since full certainty implies certainty about one’s action, and the utility to be received, as well as everything else. However, I think we can set that aside and say there’s a version of FDT which would agree with this principle in terms of prior uncertainty. It seems cases like Bomb cannot be set up without either invoking prior uncertainty (taking the form of the predictor’s failure rate) or bringing the question of how to deal with logically impossible decisions to the forefront (if we consider the case of a perfect predictor).
Why should prior uncertainty be important, in cases of posterior certainty? Because of the prior-optimality notion (in which a decision theory is judged on a decision problem based on the utility received in expectation according to the prior probability which defines the decision problem).
Prior-optimality considers the guaranteed-payoff objection to be very similar to objecting to a gambling strategy by pointing out that the gambling strategy sometimes loses. In Bomb, the problem clearly stipulates that an agent who follows the FDT recommendation has a trillion trillion to one odds of doing better than an agent who follows the CDT/EDT recommendation. Complaining about the one-in-a-trillion-trillion chance that you get the bomb while being the sort of agent who takes the bomb is, to an FDT-theorist, like a gambler who has just lost a trillion-trillion-to-one bet complaining that the bet doesn’t look so rational now that the outcome is known with certainty to be the one-in-a-trillion-trillion case where the bet didn’t pay well.
The right action, according to FDT, is to take Left, in the full knowledge that as a result you will slowly burn to death. Why? Because, using Y&S’s counterfactuals, if your algorithm were to output ‘Left’, then it would also have outputted ‘Left’ when the predictor made the simulation of you, and there would be no bomb in the box, and you could save yourself $100 by taking Left.
And why, on your account, is this implausible? To my eye, this is right there in the decision problem, not a weird counterintuitive consequence of FDT: the decision problem stipulates that algorithms which output ‘left’ will not end up in the situation of taking a bomb, with very, very high probability.
Again, complaining that you now know with certainty that you’re in the unlucky position of seeing the bomb seems irrelevant in the way that a gambler complaining that they now know how the dice fell seems irrelevant—it’s still best to gamble according to the odds, taking the option which gives the best chance of success.
(But what I most want to convey here is that there is a coherent sense in which FDT does the optimal thing, whether or not one agrees with it.)
One way of thinking about this is to say that the FDT notion of “decision problem” is different from the CDT or EDT notion, in that FDT considers the prior to be of primary importance, whereas CDT and EDT consider it to be of no importance. If you had instead specified ‘bomb’ with just the certain information that ‘left’ is (causally and evidentially) very bad and ‘right’ is much less bad, then CDT and EDT would regard it as precisely the same decision problem, whereas FDT would consider it to be a radically different decision problem.
Another way to think about this is to say that FDT “rejects” decision problems which are improbable according to their own specification. In cases like Bomb where the situation as described is by its own description a one in a trillion trillion chance of occurring, FDT gives the outcome only one-trillion-trillion-th consideration in the expected utility calculation, when deciding on a strategy.
Also, I note that this analysis (on the part of FDT) does not hinge in this case on exotic counterfactuals. If you set Bomb up in the Savage framework, you would be forced to either give only the certain choice between bomb and not-bomb (so you don’t represent the interesting part of the problem, involving the predictor) or to give the decision in terms of the prior, in which case the Savage framework would endorse the FDT recommendation.
Another framework in which we could arrive at the same analysis would be that of single-player extensive-form games, in which the FDT recommendation corresponds to the simple notion of optimal strategy, whereas the CDT recommendation amounts to the stipulation of subgame-optimality.
This maybe the most horrifying thing I have ever read.
I’m amused that this sentence is likely the highest praise for my writing I’ve ever received.
I note that Alkjash’s post
had a structured model with gears
told me something about why the world is the way it is
provided mental techniques to counter a problem
I don’t think this post did any of these things. At least I didn’t extract them if they were there.
I’m not saying the message here is wrong or that a post like this couldn’t provide those three things. I just think this post didn’t achieve that.
In what way is pain the unit of effort?
What are people missing about the world when they don’t see this?
What TAPs can we implement in light of these things?
We should really be calling it Rabbit Hunt rather than Stag Hunt.
The schelling choice is rabbit. Calling it stag hunt makes the stag sound schelling.
The problem with stag hunt is that the schelling choice is rabbit. Saying of a situation “it’s a stag hunt” generally means that the situation sucks because everyone is hunting rabbit. When everyone is hunting stag, you don’t really bring it up. So, it would make way more sense if the phrase was “it’s a rabbit hunt”!
Well, maybe you’d say “it’s a rabbit hunt” when referring to the bad equilibrium you’re seeing in practice, and “it’s a stag hunt” when saying that a better equilibrium is a utopian dream.
So, yeah, calling the game “rabbit hunt” is a stag hunt.
I used to think a lot in terms of Prisoner’s Dilemma, and “Cooperate”/”Defect.” I’d see problems that could easily be solved if everyone just put a bit of effort in, which would benefit everyone. And people didn’t put the effort in, and this felt like a frustrating, obvious coordination failure. Why do people defect so much?
Eventually Duncan shifted towards using Stag Hunt rather than Prisoner’s Dilemma as the model here. If you haven’t read it before, it’s worth reading the description in full. If you’re familiar you can skip to my current thoughts below.
In the book The Stag Hunt, Skyrms similarly says that lots of people use Prisoner’s Dilemma to talk about social coordination, and he thinks people should often use Stag Hunt instead.
I think this is right. Most problems which initially seem like Prisoner’s Dilemma are actually Stag Hunt, because there are potential enforcement mechanisms available. The problems discussed in Meditations on Moloch are mostly Stag Hunt problems, not Prisoner’s Dilemma problems -- Scott even talks about enforcement, when he describes the dystopia where everyone has to kill anyone who doesn’t enforce the terrible social norms (including the norm of enforcing).
This might initially sound like good news. Defection in Prisoner’s Dilemma is an inevitable conclusion under common decision-theoretic assumptions. Trying to escape multipolar traps with exotic decision theories might seem hopeless. On the other hand, rabbit in Stag Hunt is not an inevitable conclusion, by any means.
Unfortunately, in reality, hunting stag is actually quite difficult. (“The schelling choice is Rabbit, not Stag… and that really sucks!”)
Rabbit in this case was “everyone just sort of pursues whatever conversational types seem best to them in an uncoordinated fashion”, and Stag is “we deliberately choose and enforce particular conversational norms.”
This sounds a lot like Pavlov-style coordination vs Tit for Tat style coordination. Both strategies can defeat Moloch in theory, but they have different pros and cons. TfT-style requires agreement on norms, whereas Pavlov-style doesn’t. Pavlov-style can waste a lot of time flailing around before eventually coordinating. Pavlov is somewhat worse at punishing exploitative behavior, but less likely to lose a lot of utility due to feuds between parties who each think they’ve been wronged and must distribute justice.
When discussing whether to embark on a stag hunt, it’s useful to have shorthand to communicate why you might ever want to put a lot of effort into a concerted, coordinated effort. And then you can discuss the tradeoffs seriously.
Much of the time, I feel like getting angry and frustrated… is something like “wasted motion” or “the wrong step in the dance.”
Not really strongly contradicting you, but I remember Critch once outlined something like the following steps for getting out of bad equilibria. (This is almost definitely not the exact list of steps he gave; I think there were 3 instead of 4 -- but step #1 was definitely in there.)
1. Be the sort of person who can get frustrated at inefficiencies.
2. Observe the world a bunch. Get really curious about the ins and outs of the frustrating inefficiencies you notice; understand how the system works, and why the inefficiencies exist.
3. Make a detailed plan for a better equilibrium. Justify why it is better, and why it is worth the effort/resources to do this. Spend time talking to the interested parties to get feedback on this plan.
4. Finally, formally propose the plan for approval. This could mean submitting a grant proposal to a relevant funding organization, or putting something up for a vote, or other things. This is the step where you are really trying to step into the better equilibrium, which means getting credible backing for taking the step (perhaps a letter signed by a bunch of people, or a formal vote), and creating common knowledge between relevant parties (making sure everyone can trust that the new equilibrium is established). It can also mean some kind of official deliberation has to happen, depending on context (such as a vote, or some kind of due-diligence investigation, or an external audit, etc).
Replying to one of Will’s edits on account of my comments to the earlier draft:
Finally, in a comment on a draft of this note, Abram Demski said that: “The notion of expected utility for which FDT is supposed to do well (at least, according to me) is expected utility with respect to the prior for the decision problem under consideration.” If that’s correct, it’s striking that this criterion isn’t mentioned in the paper. But it also doesn’t seem compelling as a principle by which to evaluate between decision theories, nor does it seem FDT even does well by it. To see both points: suppose I’m choosing between an avocado sandwich and a hummus sandwich, and my prior was that I prefer avocado, but I’ve since tasted them both and gotten evidence that I prefer hummus. The choice that does best in terms of expected utility with respect to my prior for the decision problem under consideration is the avocado sandwich (and FDT, as I understood it in the paper, would agree). But, uncontroversially, I should choose the hummus sandwich, because I prefer hummus to avocado.
Yeah, the thing is, the FDT paper focused on examples where “expected utility according to the prior” becomes an unclear notion due to logical uncertainty issues. It wouldn’t have made sense for the FDT paper to focus on that, given the desire to put the most difficult issues into focus. However, FDT is supposed to accomplish similar things to UDT, and UDT provides the more concrete illustration.
The policy that does best in expected utility according to the prior is the policy of taking whatever you like. In games of partial information, decisions are defined as functions of information states; and in the situation as described, there are separate information states for liking hummus and liking avocado. Choosing the one you like achieves a higher expected utility according to the prior, in comparison to just choosing avocado no matter what. In this situation, optimizing the decision in this way is equivalent to updating on the information; but, not always (as in transparent newcomb, Bomb, and other such problems).
To re-state that a different way: in a given information state, UDT is choosing what to do as a function of the information available, and judging the utility of that choice according to the prior. So, in this scenario, we judge the expected utility of selecting avocado in response to liking hummus. This is worse (according to the prior!) than selecting hummus in response to liking hummus.
When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.
This doesn’t give very useful answers when the state evolution is nearly deterministic, such as an agent made of computer code.
For example, consider an agent trying to decide whether to turn left or turn right. Suppose for the sake of argument that it actually turns left, if you run physics forward. Also suppose that the logical uncertainty has figured that out, so that the best-estimate macrostate probabilities are mostly on that. Now, the agent considers whether to turn left or right.
Since the computation (as pure math) is deterministic, counterfactuals which result from supposing the state evolution went right instead of left mostly consist of computer glitches in which the hardware failed. This doesn’t seem like what the agent should be thinking about when it considers the alternative of going right instead of left. For example, the grocery store it is trying to get to could be on the right-hand path. The potential bad results of a hardware failure might outweigh the desire to turn toward the grocery store, so that the agent prefers to turn left.
For this story to make sense, the (logical) certainty that the abstract algorithm decides to turn left in this case has to be higher than the confidence that hardware will not fail, so that turning right seems likely to imply hardware failure. This can happen due to Löb’s theorem: the whole above argument, as a hypothetical argument, suggests that the agent would turn left on a particular occasion if it happened to prove ahead of time that its abstract algorithm would turn left (since it would then be certain that turning right implied a hardware failure). But this means a proof of left-turning results in left-turning. Löb’s theorem, left-turning is indeed provable.
The Newcomb’s-problem example you give also seems problematic. Again, if the agent’s algorithm is deterministic, it does basically one thing as long as the initial conditions are such that it is in Newcomb’s problem. So, essentially all of the uncertainty about the agent’s action is logical uncertainty. I’m not sure exactly what your intended notion of counterfactual is, but, I don’t see how reasoning about microstates helps the agent here.
I agree with the broad outline of your points, but I find many of the details incongruous or poorly stated. Some of this is just a general dislike of predictive processing, but assuming a predictive processing model, I don’t see why your further comments follow.
I don’t claim to understand predictive processing fully, but I read the SSC post you linked, and looked at some other sources. It doesn’t seem to me like predictive processing struggles to model goal-oriented behavior. A PP agent doesn’t try to hide in the dark all the time to make the world as easy to predict as possible, and it also doesn’t only do what it has learned to expect itself to do regardless of what leads to pleasure. My understanding is that this depends on details of the notion of free energy.
So, although I agree that there are serious problems with taking an agent and inferring its values, it isn’t clear to me that PP points to new problems of this kind. Jeffrey-Bolker rotation already illustrates that there’s a large problem within a very standard expected utility framework.
The point about viewing humans as multi-agent systems, which don’t behave like single-agent systems in general, also doesn’t seem best made within a PP framework. Friston’s claim (as I understand it) is that clumps of matter will under very general conditions eventually evolve to minimize free energy, behaving as agents. If clumps of dead matter can do it, I guess he would say that multi-agent systems can do it. Aside from that, PP clearly makes the claim that systems running on a currency of prediction error (as you put it) act like agents.
Again, this point seems fine to make outside of PP, it just seems like a non-sequitur in a PP context.
I also found the options given in the “what are we aligning with” section confusing. I was expecting to see a familiar litany of options (like aligning with system 1 vs system 2, revealed preferences vs explicitly stated preferences, etc). But I don’t know what “aligning with the output of the generative models” means—it seems to suggest aligning with a probability distribution rather than with preferences. Maybe you mean imitation learning, like what inverse reinforcement learning does? This is supported by the way you immediately contrast with CIRL in #2. But, then, #3, “aligning with the whole system”, sounds like imitation learning again—training a big black box NN to imitate humans. It’s also confusing that you mention options #1 and #2 collapsing into one—if I’m right that you’re pointing at IRL vs CIRL, it doesn’t seem like this is what happens. IRL learns to drink coffee if the human drinks coffee, whereas CIRL learns to help the human make coffee.
FWIW, I think if we can see the mind as a collection of many agents (each with their own utility function), that’s a win. Aligning with a collection of agents is not too hard, so long as you can figure out a reasonable way to settle on fair divisions of utility between them.
After tagging settles down a bit, it may be time to re-visit this question more.
I think LW hasn’t yet managed to approach google docs in terms of draft-feedback process. Since I compose all my posts directly on LW, this matters to me (of course I could try to copy/paste).
GDoc-Like Comments For Drafts
The primary thing here is the commenting-on-highlights interface. It’s just so much better for editing!
Probably this should work for publish posts as well, facilitating easy private pings of authors for broken links, spelling mistakes, etc. Although there’s a question of whether this kind of direct editing feedback from anyone feels aversive and could discourage people.
GDoc-Like Comments for Public Review/Critique
I also think it would be nice if there were some way to associate public comments with specific points in a document, for the purpose of well-organized debate about the points raised in a post. However, I don’t know how to make this work without it being (a) pretty aversive, and (b) not too visually cluttered, while (c) still making sure people can see the point-by-point objections to a post fairly easily.
This is a bit of a stretch, but it sure would be nice if there were some natural argument-diagramming going on. To sketch a possible implementation:
Points in posts can be pulled out and associated with questions.
Questions can be associated with each other via arguments. EG, an answer (perhaps a new type of answer, not a text answer) can state that a particular answer would be true if X and Y were true, where X and Y are different answers (to different questions).
The point is not so much that this is a good idea as stated. It’s just that some form of argument mapping might really help to map disagreements and ultimately clarify the evidential status of contentious points raised in a post.
Upvotes have ambiguous meaning. For a while (mainly due to Curi accusing LW of lacking any surface area for falsification, due to the way no one explicitly stands by any principles or any canonical texts as really seriously right) I have been thinking that it would be nice if LW encouraged users to state what they endorse on their homepages. But this would not do very much good without a system for discussing endorsements.
Let’s say for a moment that a post called “The Blindness of History” is really popular but has a big flaw in its argument—for concreteness let’s say it cites a major source as stating the exact opposite of what that source really concludes. People don’t notice right away, and like 50 people endorse that post.
Someone notices the problem. Now they need to approach like 50 people and question their endorsements. There needs to be a way to find all the people who endorse a post and do something like that. As things stand, you’d have to search users to find the ones mentioning that post as endorsed on their profile page, and then PM each one.
There could be something good about having a public system like that, which notifies users specifically of challenges to posts which they endorse, and encourages users to respond somehow, perhaps putting an explicit caveat into their endorsement or something.
And of course, these responses need to themselves have responses, etc, encouraging real responses because if you made a bad argument someone will call you out on it.
Not sure how all of this could possibly work.
It seems to me that there are roughly two types of “boundary” to think about: ceilings and floors.
Floors are aka the foundations. Maybe a system is running on a basically Bayesian framework, or (alternately) logical induction. Maybe there are some axioms, like ZFC. Going meta on floors involves the kind of self-reference stuff which you hear about most often: Gödel’s theorem and so on. Floors are, basically, pretty hard to question and improve (though not impossible).
Ceilings are fast heuristics. You have all kinds of sophisticated beliefs in the interior, but there’s a question of which inferences you immediately make, without doing any meta to consider what direction to think in. (IE, you do generally do some meta to think about what direction to think in; but, this “tops out” at some level, at which point the analysis has to proceed without meta.) Ceilings are relatively easy to improve. For example, the AlphaGo move proposal network and evaluation network (if I recall the terms correctly). These have cheap updates which can be made frequently, via observing the results of reasoning. These incremental updates then help the more expensive tree-search reasoning to be even better.
Both floors and ceilings have a flavor of “the basic stuff that’s actually happening”—the interior is built out of a lot of boundary stuff, and small changes to boundary will create large shifts in interior. However, floors and ceilings are very different. Tweaking floor is relatively dangerous, while tweaking ceiling is relatively safe. Returning to the AlphaGo analogy, the floor is like the model of the game which allows tree search. The floor is what allows us to create a ceiling. Tweaks to the floor will tend to create large shifts in the ceiling; tweaks to the ceiling will not change the floor at all.
(Perhaps other examples won’t have as clear a floor/ceiling division as AlphaGo; or, perhaps they still will.)
What remains unanswered, though, is whether there is any useful way of talking about doing this (the whole thing, including the self-improvement R&D) well, doing it rationally, as opposed to doing it in a way that simply “seems to work” after the fact.
[...] Is there anything better than simply bumbling around in concept-space, in a manner that perhaps has many internal structures of self-justification but is not known to work as a whole? [...]
Can you represent your overall policy, your outermost strategy-over-strategies considered a response to your entire situation, in a way that is not a cartoon, a way real enough to defend itself?
My intuition is that the situation differs, somewhat, for floors and ceilings.
For floors, there are fundamental logical-paradox-flavored barriers. This relates to MIRI research on tiling agents.
For ceilings, there are computational-complexity-flavored barriers. You don’t expect to have a perfect set of heuristics for fast thinking. But, you can have strategies relating to heuristics which have universal-ish properties. Like, logical induction is an “uppermost ceiling” (takes the fixed point of recursive meta) such that, in some sense, you know you’re doing the best you can do in terms of tracking which heuristics are useful; you don’t have to spawn further meta-analysis on your heuristic-forming heuristics. HOWEVER, it is also very very slow and impractical for building real agents. It’s the agent that gets eaten in your parable. So, there’s more to be said with respect to ceilings as they exist in reality.
The human utility hypothesis is much more vague than the others, and seems ultimately context-dependent. To my knowledge, the main argument in its favor is the fact that most of economics is founded on it.
I would say, rather, that the arguments in its favor are the same ones which convinced economists.
Humans aren’t well-modeled as perfect utility maximizers, but utility theory is a theory of what we can reflectively/coherently value. Economists might have been wrong to focus only on rational preferences, and have moved toward prospect theory and the like to remedy this. But it may make sense to think of alignment in these terms nonetheless.
I am not saying that it does make sense—I’m just saying that there’s a much better argument for it than “the economists did it”, and I really don’t think prospect theory addresses issues which are of great interest to alignment.
If a system is trying to align with idealized reflectively-endorsed values (similar to CEV), then one might expect such values to be coherent. The argument for this position is the combination of the various arguments for expected utility theory: VNM; money-pump arguments; the various dutch-book arguments; Savage’s theorem; the Jeffrey-Bolker theorem; the complete class theorem. One can take these various arguments and judge them on their own terms (perhaps finding them lacking).
Arguably, you can’t fully align with inconsistent preferences; if so, one might argue that there is no great loss in making a utility-theoretic approximation of human preferences: it would be impossible to perfectly satisfy inconsistent preferences anyway, so representing them by a utility function is a reasonable compromise.
In aligning with inconsistent preferences, the question seems to be what standards to hold a system to in attempting to do so. One might argue that the standards of utility theory are among the important ones; and thus, that the system should attempt to be consistent even if humans are inconsistent.
To the extent that human preferences are inconsistent, it may make more sense to treat humans as fragmented multi-agents, and combine the preferences of the sub-agents to get an overall utility function—essentially aligning with one inconsistent human the same way one would align with many humans. This approach might be justified by Harsanyi’s theorem.
On the other hand, there are no strong arguments for representing human utility via prospect theory. It holds up better in experiments than utility theory does, but not so well that we would want to make it a bedrock assumption of alignment. The various arguments for expected utility make me somewhat happy for my preferences to be represented utility-theoretically even though they are not really like this; but, there is no similar argument in favor of a prospect-theoretic representation of my preferences. Essentially, I think one should either stick to a more-or-less utility-theoretic framework, or resort to taking a much more empirical approach where human preferences are learned in all their inconsistent detail (without a background assumption such as prospect theory).
That’s still a false dichotomy, but I think it is an appropriate response to many critiques of utility theory.
I wrote something which is sort of a reply to this post (although I’m not really making a critique or any solid point about this post, just exploring some ideas which I see as related).
Rob Bensinger: Nate and I tend to talk about “understandability” instead of “transparency” exactly because we don’t want to sound like we’re talking about normal ML transparency work.Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability.Ajeya Cotra: Thanks all—I like the project of trying to come up with a good handle for the kind of language model transparency we’re excited about (and have talked to Nick, Evan, etc about it too) but I think I don’t want to push it in this blog post right now because I haven’t hit on something I believe in and I want to ship this.
Rob Bensinger: Nate and I tend to talk about “understandability” instead of “transparency” exactly because we don’t want to sound like we’re talking about normal ML transparency work.
Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability.
Ajeya Cotra: Thanks all—I like the project of trying to come up with a good handle for the kind of language model transparency we’re excited about (and have talked to Nick, Evan, etc about it too) but I think I don’t want to push it in this blog post right now because I haven’t hit on something I believe in and I want to ship this.
I feel like maybe part of what’s wrong with all the suggested terms (wrt pointing at what Ajeya is excited about) is that transparency, understandability, legibility, and readability all invoke the image of a human standing over a bit of silicone with a magnifying glass and reading off what’s going on inside. Ajeya is excited about asking GPT nicely to apply its medical knowledge, and GPT complying, and we know GPT is complying. Tools for figuring out what’s going on inside GPT are probably an important step to get to that point, especially for becoming confidant that we’re at that point; but it’s not the end goal. The end goal is more like “GPT is frank with you” or “GPT does what you ask, rather than mimicking a human doing what you ask” or something like that.
Like, the property of understanding what it’s doing, rather than the tool that lets you examine it to reach that understanding.
This sits somewhere between the whole alignment problem and transparency.
I agree that
there’s something to the hierarchy thing;
if we want, we can always represent values in terms of minimizing prediction error (at least to a close approximation), so long as we choose the right predictions;
this might turn out to be the right thing to do, in order to represent the hierarchy thing elegantly (although I don’t currently see why, and am somewhat skeptical).
However, I don’t agree that we should think of values as being predictable from the concept of minimizing prediction error.
The tone of the following is a bit more adversarial than I’d like; sorry for that. My attitude toward predictive processing comes from repeated attempts to see why people like it, and all the reasons seeming to fall flat to me. If you respond, I’m curious about your reaction to these points, but it may be more useful for you to give the positive reasons why you think your position is true (or even just why it would be appealing), particularly if they’re unrelated to what I’m about to say.
Evolved Agents Probably Don’t Minimize Prediction Error
If we look at the field of reinforcement learning, it appears to be generally useful to add intrinsic motivation for exploration to an agent. This is the exact opposite of predictability: in one case we add reward for entering unpredictable states, whereas in the other case we add reward for entering predictable states. I’ve seen people try to defend minimizing prediction error by showing that the agent is still motivated to learn (in order to figure out how to avoid unpredictability). However, the fact remains: it is still motivated to learn strictly less than an unpredictability-loving agent. RL has, in practice, found it useful to add reward for unpredictability; this suggests that evolution might have done the same, and suggests that it would not have done the exact opposite. Agents operating under a prediction-error penalty would likely under-explore.
It’s Easy to Overestimate The Degree to which Agents Minimize Prediction Error
I often enjoy variety—in food, television, etc—and observe other humans doing so. Naively, it seems like humans sometimes prefer predictability and sometimes prefer variety.
However: any learning agent, almost no matter its values, will tend to look like it is seeking predictability once it has learned its environment well. It is taking actions it has taken before, and steering toward the environmental states similar to what it always steers for. So, one could understandably reach the conclusion that it is reliability itself which the agent likes.
In other words: if I seem to eat the same foods quite often (despite claiming to like variety), you might conclude that I like familiarity when it’s actually just that I like what I like. I’ve found a set of foods which I particularly enjoy (which I can rotate between for the sake of variety). That doesn’t mean it is familiarity itself which I enjoy.
I’m not denying that mere familiarity has some positive valence for humans; I’m just saying that for arbitrary agents, it seems easy to over-estimate the importance of familiarity in their values, so we should be a bit suspicious about it for humans too. And I’m saying that it seems like humans enjoy surprises sometimes, and there’s evolutionary/machine-learning reasoning to explain why this might be the case.
We Need To Explain Why Humans Differentiate Goals and Beliefs, Not Just Why We Conflate Them
You mention that good/bad seem like natural categories. I agree that people often seem to mix up “should” and “probably is”, “good” and “normal”, “bad” and “weird”, etc. These observations in themselves speak in favor of the minimize-prediction-error theory of values.
However, we also differentiate these concepts at other times. Why is that? Is it some kind of mistake? Or is the conflation of the two the mistake?
I think the mix-up between the two is partly explained by the effect I mentioned earlier: common practice is optimized to be good, so there will be a tendency for commonality and goodness to correlate. So, it’s sensible to cluster them together mentally, which can result in them getting confused. There’s likely another aspect as well, which has something to do with social enforcement (ie, people are strategically conflating the two some of the time?) -- but I’m not sure exactly how that works.
I feel like this has unintentionally brought us closer to Petrov’s actual experience.
I am probably not following this as closely as many commenters here, but I 100% assumed it was intentional. It’s just so good!
Your assessment here seems to (mostly) line up with what I was trying to communicate in the post.
This is a simple consequence of the fact that you have to look at observations to figure out what to do; this is no different from the fact that a DQN playing Pong will look at where the ball is in order to figure out what action to take.
This is something I hoped to communicate in the “Mesa-Learning Everywhere?” section, especially point #3.
If you mean the chance that the weights of a neural network are going to encode some search algorithm, then this paper should be ~zero evidence in favor of it.
This is a point I hoped to convey in the Search vs Control section.
If you mean the chance than a policy trained by RL will “learn” without gradient descent, I can’t imagine a way that could fail to be true for an intelligent system trained by deep RL
Ah, here is where the disagreement seems to lie. In another comment, you write:
Here on LW / AF, “mesa optimization” seems to only apply if there’s some sort of “general” learning algorithm, especially one that is “using search”, for reasons that have always been unclear to me.
I currently think this:
There is a spectrum between “just learning the task” vs “learning to learn”, which has to do with how “general” the learning is. DQN looking at the ball is very far on the “just learning the task” side.
This spectrum is very fuzzy. There is no clear distinction.
This spectrum is very relevant to inner alignment questions. If a system like GPT-3 is merely “locating the task”, then its behavior is highly constrained by the training set. On the other hand, if GPT-3 is “learning on the fly”, then its behavior is much less constrained by the training set, and have correspondingly more potential for misaligned behavior (behavior which is capably achieving a different goal than the intended one). This is justified by an interpolation-vs-extrapolation type intuition.
The paper provides a small amount of evidence that things higher on the spectrum are likely to happen. (I’m going to revise the post to indicate that the paper only provides a small amount of evidence—I admit I didn’t read the paper to see exactly what they did, and should have anticipated that it would be something relatively unimpressive like multi-armed-bandit.)
Thinking about the spectrum, I see no reason not to expect things to continue climbing that spectrum. This updates me significantly toward expecting inner alignment problems to be probable, compared with the previous way I was thinking about it.
Response to Section IV:
FDT fails to get the answer Y&S want in most instances of the core example that’s supposed to motivate it
I am basically sympathetic to this concern: I think there’s a clear intuition that FDT is 2-boxing more than we would like (and a clear formal picture, in toy formalisms which show FDT-ish DTs failing on Agent Simulates Predictor problems).
Of course, it all depends on how logical counterfactuals are supposed to work. From a design perspective, I’m happy to take challenges like this as extra requirements for the behavior of logical counterfactuals, rather than objections to the whole project. I intuitively think there is a notion of logical counterfactual which fails in this respect, but, this does not mean there isn’t some other notion which succeeds. Perhaps we can solve the easy problem of one-boxing with a strong predictor first, and then look for ways to one-box more generally (and in fact, this is what we’ve done—one-boxing with a strong predictor is not so difficult).
However, I do want to add that when Omega uses very weak prediction methods such as the examples given, it is not so clear that we want to one-box. Will is presuming that Y&S simply want to one-box in any Newcomb problem. However, we could make a distinction between evidential Newcomb problems and functional Newcomb problems. Y&S already state that they consider some things to be functional Newcomb problems despite them not being evidential Newcomb problems (such as transparent Newcomb). It stands to reason that there would be some evidential Newcomb problems which are not functional Newcomb problems, as well, and that Y&S would prefer not to one-box in such cases.
However, the predictor needn’t be running your algorithm, or have anything like a representation of that algorithm, in order to predict whether you’ll one box or two-box. Perhaps the Scots tend to one-box, whereas the English tend to two-box.
In this example, it seems quite plausible that there’s a (logico-causal) reason for the regularity, so that in the logical counterfactual where you act differently, your reference class also acts somewhat differently. Say you’re Scottish, and 10% of Scots read a particular fairy tale growing up, and this is connected with why you two-box. Then in the counterfactual in which you one-box, it is quite possible that those 10% also one-box. Of course, this greatly weakens the connection between Omega’s prediction and your action; perhaps the change of 10% is not enough to tip the scales in Omega’s prediction.
But, without any account of Y&S’s notion of subjunctive counterfactuals, we just have no way of assessing whether that’s true or not. Y&S note that specifying an account of their notion of counterfactuals is an ‘open problem,’ but the problem is much deeper than that. Without such an account, it becomes completely indeterminate what follows from FDT, even in the core examples that are supposed to motivate it — and that makes FDT not a new decision theory so much as a promissory note.
In the TDT document, Eliezer addresses this concern by pointing out that CDT also takes a description of the causal structure of a problem as given, begging the question of how we learn causal counterfactuals. In this regard, FDT and CDT are on the same level of promissory-note-ness.
It might, of course, be taken as much more plausible that a technique of learning the physical-causal structure can be provided, in contrast to a technique which learns the logical-counterfactual structure.
I want to inject a little doubt about which is easier. If a robot is interacting with an exact simulation of itself (in an iterated prisoner’s dilemma, say), won’t it be easier to infer that it directly controls the copy than it is to figure out that the two are running on different computers and thus causally independent?
Put more generally: logical uncertainty has to be handled one way or another; it cannot be entirely put aside. Existing methods of testing causality are not designed to deal with it. It stands to reason that such methods applied naively to cases including logical uncertainty would treat such uncertainty like physical uncertainty, and therefore tend to produce logical-counterfactual structure. This would not necessarily be very good for FDT purposes, being the result of unprincipled accident—and the concern for FDT’s counterfactuals is that there may be no principled foundation. Still, I tend to think that other decision theories merely brush the problem under the rug, and actually have to deal with logical counterfactuals one way or another.
Indeed, on the most plausible ways of cashing this out, it doesn’t give the conclusions that Y&S would want. If I imagine the closest world in which 6288 + 1048 = 7336 is false (Y&S’s example), I imagine a world with laws of nature radically unlike ours — because the laws of nature rely, fundamentally, on the truths of mathematics, and if one mathematical truth is false then either (i) mathematics as a whole must be radically different, or (ii) all mathematical propositions are true because it is simple to prove a contradiction and every propositions follows from a contradiction.
To this I can only say again that FDT’s problem of defining counterfactuals seems not so different to me from CDT’s problem. A causal decision theorist should be able to work in a mathematical universe; indeed, this seems rather consistent with the ontology of modern science, though not forced by it. I find it implausible that a CDT advocate should have to deny Tegmark’s mathematical universe hypothesis, or should break down and be unable to make decisions under the supposition. So, physical counterfactuals seem like they have to be at least capable of being logical counterfactuals (perhaps a different sort of logical counterfactual than FDT would use, since physical counterfactuals are supposed to give certain different answers, but a sort of logical counterfactual nonetheless).
(But this conclusion is far from obvious, and I don’t expect ready agreement that CDT has to deal with this.)