I jumped off a small cliff into a lake once, and when I was standing on the rock, I couldn’t bring myself to jump. I stepped back to let another person go, and then I stepped onto the rock and jumped immediately. I might be able to do something similar.
But I wouldn’t be able to endorse such behavior while reflecting on it if I were in that situation, given my conviction that I am unable to change math. Indeed, I don’t think it would be wise of me to cooperate in that situation. What I really mean when I say that I would rather be someone who cooperated in a twin prisoners dilemma is “conditioned the (somewhat odd) hypothetical that I will at some point end up in a high stakes twin prisoner’s dilemma, I would rather it be the case that I am the sort of person who cooperates”, which is really saying that I would rather play a twin prisoner’s dilemma game against a cooperator than against a defector, which is just an obvious preference for a favorable event to befall me rather than an unfavorable one. In similar news, conditioned on my encountering a situation in the future where somebody checks to see if am I good person, and if I am, they destroy the world, then I would like to become a bad person. Conditioned on my encountering a situation in which someone saves the world if I am devout, I would like to become a devout person.
If I could turn off the part of my brain that forms the question “but why should I cooperate, when I can’t change math?” that would be a path to becoming a reliable cooperator, but I don’t see a path to silencing a valid argument in my brain without a lobotomy (short of possibly just cooperating really fast without thinking, and of course without forming the doubt “wait, why am I trying to do this really fast without thinking?”).
If that’s the case, then I assume that you defect in the twin prisoner’s dilemma.
I do. I would rather be someone who didn’t. But I don’t see path to becoming that person without lobotomizing myself. And it’s not a huge concern of mine, since I don’t expect to encounter such a dilemma. (Rarely am I the one pointing out that a philosophical thought experiment is unrealistic. It’s not usually the point of thought experiments to be realistic—we usually only talk about them to evaluate the consequences of different positions. But it is worth noting here that I don’t see this as a major issue for me.) I haven’t written this up because I don’t think it’s particularly urgent to explain to people why I think CDT is correct over FDT. Indeed, in one view, it would be cruel of me to do so! And I don’t think it matters much for AI alignment.
Don’t you think that’s at least looking into?
This was partly why I decided to wade into the weeds, because absent a discussion of how plausible it is that we could affect things non-causally, yes, one’s first instinct would be that we should look at least into it. And maybe, like, 0.1% of resources directed toward AI Safety should go toward whether we can change Math, but honestly, even that seems high. Because what we’re talking about is changing logical facts. That might be number 1 on my list of intractable problems.
After all, CDT evaluates causal counterfactuals, which are just as much a fiction as logical counterfactuals.
This is getting subtle :) and it’s hard to make sure our words mean things, but I submit that causal counterfactuals are much less fictitious than logical counterfactuals! I submit that it is less extravagant to claim we can affect this world than it is to claim that we can affect hypothetical worlds with which we are not in causal contact. No matter what action I pick, math stays the same. But it’s not the case that no matter what action I pick, the world stays the same. (In the former case, which action I pick could in theory tell us something about what mathematical object the physical universe implements, but it doesn’t change math.) In both cases, yes, there is only one action that I do take, but assuming we can reason both about causal and logical counterfactuals, we can still talk sensibly about the causal and logical consequences of picking actions I won’t in fact end up picking. I don’t have a complete answer to “how should we define causal/logical counterfactuals” but I don’t think I need to for the sake of this conversation, as long as we both agree that we can use the terms in more or less the same way, which I think we are successfully doing.
I don’t yet see why creating a CDT agent avoids catastrophe better than FDT.
I think running an aligned FDT agent would probably be fine. I’m just arguing that it wouldn’t be any better than running a CDT agent (besides for the interim phase before Son-of-CDT has been created). And indeed, I don’t think any new decision theories will perform any better than Son-of-CDT, so it doesn’t seem to me to be a priority for AGI safety. Finally, the fact that no FDT agent has actually been fully defined certainly weighs in favor of just going with a CDT agent.
Ah. I agree that this proposal would not optimize causally inaccessible areas of the multiverse, except by accident. I also think that nothing we do optimizes causally inaccessible areas of the multiverse, and we could probably have a long discussion about that, but putting a pin in that,
Let’s take things one at a time. First, let’s figure out how to not destroy the real world, and then if we manage that, we can start thinking about how to maximize utility in logically possible hypothetical worlds, which we are unable to have any causal influence on.
Regarding the longer discussion, and sorry if this below my usual level of clarity: what do we have at our disposal to make counterfactual worlds with low utility inconsistent? Well, all that we humans have at our disposal is choices about actions. One can play with words, and say that we can choose not just what to do, but also who to be, and choosing who to be (i.e. editing our decision procedure) is supposed by some to have logical consequences, but I think that’s a mistake. 1) Changing who we are is an action like any other. Actions don’t have logical consequences, just causal consequences. 2) We might be changing which algorithm our brain executes, but we are not changing the output of any algorithm itself, the latter possibility being the thing with supposedly far-reaching (logical) consequences on hypothetical worlds outside of causal contact. In general, I’m pretty bearish on the ability of humans to change math.
Consider the CDT person who adopts FDT. They are probably interested in the logical consequences of the fact their brain in this world outputs certain actions. But no mathematical axioms have changed along the way, so no propositions have changed truth value. The fact that their brain now runs a new algorithm implies that (the math behind) physics ended up implementing that new algorithm. I don’t see how it implies much else, logically. And I think the fact that no mathematical axioms have changes supports that intuition quite well!
The question of which low-utility worlds are consistent/logically possible is a property of Math. All of math follows from axioms. Math doesn’t change without axioms changing. So if you have ambitions of rendering low-utility world inconsistent, I guess my question is this: which axioms of Math would you like to change and how? I understand you don’t hope to causally affect this, but how could you even hope to affect this logically? (I’m struggling to even put words to that; the most charitable phrasing I can come up with, in case you don’t like “affect this logically”, is “manifest different logic”, but I worry that phrasing is Confused.) Also, I’m capitalizing Math there because this whole conversation involves being Platonists about math, where Math is something that really exists, so you can’t just invent a new axiomatization of math and say the world is different now.
You’re taking issue with my evaluating the causal consequences of our choice of what program to run in the agent rather than the logical consequences? These should be the same in practice when we make an AGI, since we’re in some weird decision problem at the moment, so far as I can tell. Or if you think I’m missing something, what are the non-causal, logical consequences of building a CDT AGI?
Side note: I think the term “self-modify” confuses us. We might as well say that agent’s don’t self-modify; all they can do is cause other agents to come into being and shut themselves off.
The CDT agent will obviously fall prey to the problems that CDT agents face while it is active (like twin prisoner’s dilemma), but after a short period of time, it won’t matter how it behaves. Some better agent will be created and take over from there.
Finally, if you think an FDT agent will perform very well in this world, then you should also expect Son-of-CDT to look a lot like an FDT agent.
Why do you say “probably”? If there exists an agent that doesn’t make those wrong choices you’re describing, and if the CDT agent is capable of making such an agent, why wouldn’t the CDT agent make an agent that makes the right choices?
My intuitions are mostly that if you can provide significant rewards and punishments basically for free in imitated humans (or more to the point, memories thereof), and if you can control the flow of information throughout the whole apparatus, and you have total surveillance automatically, this sort of thing is a dictator’s dream. Especially because it usually costs money to make people happy, and in this case, it hardly does—just a bit of computation time. In a world with all the technology in place that a dictator could want, but also it’s pretty cheap to make everyone happy, it strikes me as promising that the system itself could be kept under control.
I don’t agree with your characterization, however. The concern is not that it would have roughly human-like planning, but rather super-human planning (since this is presumably simpler according to most reasonable priors).
Thanks for the clarification. Consider the sort of relatively simple, super-human planning algorithm that, for most goals, would lead the planner/agent to take over the world or do similarly elaborate and impactful things in the service of whatever goal is being pursued. A Bayesian predictor of the human’s behavior will consider the hypothesis Hg that the human does the sort of planning described above in the service of goal g. It will have a corresponding hypothesis for each such goal g. It seems to me, though, that these hypotheses will be immediately eliminated. The human’s observed behavior won’t include taking over the world or any other existentially dangerous behavior, as would have been implied by hypotheses of the form Hg. A hypothesis which says that the observed behavior is the output of human-like planning in the service of some goal which is slightly incorrect may maintain some weight in the posterior after a number of observations, but I don’t see how “dangerously powerful planning + goal” remains under consideration.
The post can basically be read as arguing that human imitation seems especially likely to produce mesa-optimization.
I suppose the point of human imitation is to produce a weak, conservative, lazy, impact-sensitive mesa-optimizer, since humans are optimizers with those qualifiers. If it weren’t producing a mesa-optimizer, something would have gone very wrong. So this is a good point. As for whether this is dangerous, I think the discussion above is the place to focus.
Another complication here is that the people trying to build ~AIXI can probably build an economically useful ~AIXI using less compute than you need for ~HSIFAUH (for jobs that don’t need to model humans), and start doing their own doublings.
Regarding the other two points, my intuition was that a few dozen people could work out the details satisfactorily in a year. If you don’t share this intuition, I’ll adjust downward on that. But I don’t feel up to putting in those man-hours myself. It seems like there are lots of people without a technical background who are interested in helping avoid AI-based X-risk. Do you think this is a promising enough line of reasoning to be worth some people’s time?
It seems this would only be the case if it had a deeper utility function that placed great weight on it ‘discovering’ its other utility function.
This isn’t actually necessary. If it has a prior over utility functions and some way of observing evidence about which one is real, you can construct the policy which maximizes expected utility in the following sense: it imagines a utility function is sampled from the set of possibilities according to its prior probabilities, and it imagines that utility function is what it’s scored on. This naturally gives the instrumental goal of trying to learn about which utility function was sampled (i.e. which is the real utility function), since some observations will provide evidence about which one was sampled.
I think for most utility functions, kicking over the bucket and then recreating a bucket with identical salt content (but different atoms) gets you back to a similar value to what you were at before. If recreating that salt mixture is expensive vs. cheap, and if attainable utility preservation works exactly as our initial intuitions might suggest (and I’m very unsure about that, but supposing it does work in the intuitive way), then AUP should be more likely to avoid disturbing the expensive salt mixture, and less likely to avoid disturbing the cheap salt mixture. That’s because for those utility functions for which the contents of the bucket were instrumentally useful, the value with respect to those utility functions goes down roughly by the cost of recreating the bucket’s contents. Also, if a certain salt mixture is less economically useful, there will be fewer utility functions for which kicking over the bucket leads to a loss in value, so if AUP works intuitively, it should also agree with our intuition there.
If it’s true that for most utility functions, the particular collection of atoms doesn’t matter, then it seems to me like AUP manages to assign a higher penalty to the actions that we would agree are more impactful, all without any information regarding human preferences.
Proposal: in the same way we might try to infer human values from the state of the world, might we be able to infer a high-level set of features such that existing agents like us seem to optimize simple functions of these features? Then we would penalize actions that cause irreversible changes with respect to these high-level features.
This might be entirely within the framework of similarity-based reachability. This might also be exactly what you were just suggesting.
Sure! The household of people could have another computer inside it that the humans can query, which runs a sequence prediction program trained on other things.
Why? What are those 7 billion HSIFAUH doing?
Well the number comes from the idea of one-to-one monitoring. Obviously, there’s other stuff to do to establish a stable unipolar world order, but monitoring seems like the most resource intensive part, so it’s an order of magnitude estimate. Also, realistically, one person could monitor ten people, so that was an order of magnitude estimate with some leeway.
But if there are 7 billion HSIFAUH which are collectively capable of taking over the world, how is not a potential existential catastrophe if they have inhuman values?
I think they can be controlled. Whoever is providing the observations to any instance of HSIFAUH has an arsenal of carrots and sticks (just by having certain observations correlate with actual physical events that occur in the household(s) of humans that generate the data), and I think merely human-level intelligence can kept in check by someone in a position of power over them. So I think real humans could stay at the wheel over 7 billion instances of HSIFAUH. (I mean, this is teetering at the edge of existential catastrophe already given the existence of simulations of people who might have the experience of being imprisoned, but I think with careful design of the training data, this could be avoided). But in terms of extinction threat to real-world humans, this starts to look more like the problem maintaining a power structure over a vast number of humans and less like typical AI alignment difficulties; historically, the former seems to be a solvable problem.
>Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N.
How? And why would it grow fast enough to get to a large enough N before someone deploys ~AIXI?
Right, this analysis gets complicated because you have to analyze the growth rate of N. Given your lead time from having more computing power than the reckless team, one has to analyze how many doubling periods you have time for. I hear Robin Hanson is the person to read regarding questions like this. I don’t have any opinions here. But the basic structure regarding “How?” is spend some fraction of computing resources making money, then buy more computing resources with that money.
>It should be possible to weaken the online version and get some of this speedup.
What do you have in mind here?
Well, nothing in particular when I wrote that, but thank you for pushing me. Maybe only update the posterior at some timesteps (and do it infinitely many times but with diminishing frequency). Or more generally, you divide resources between searching for programs that retrodict observed behavior and running copies of the best one so far, and you just shift resource allocation toward the latter over time.
You do have to solve some safety problems that the reckless team doesn’t though, don’t you? What do you think the main safety problems are?
If it turns out you have to do special things to avoid mesa-optimizers, then yes. Otherwise, I don’t think you have to deal with other safety problems if you’re just aiming to imitate human behavior.
I could imagine an efficient algorithm that could be said to be approximating a Bayesian agent with a prior including the truth, but I don’t say that with much confidence.
I agree with the second bullet point, but I’m not so convinced this is prohibitively hard. That said, not only would we have to make our (arbitrarily chosen) p(obs | utlity fn) un-game-able, one reading of my original post is that we would also have to ensure that by the time the agent was no longer continuing to gain much information, it would already have to have a pretty good grasp on the true utility function. This requirement might reduce to a concept like identifiability of the optimal policy.
Oh yeah sorry that isn’t shown there. But I believe the sum over all timesteps of the m-step expected info gain at each timestep is finite w.p.1 which would make it o(1/t) w.p.1.
Actually, you can. You just can’t have the team of humans look at the Oracle’s answer. Instead the humans look at the question and answer it (without looking at the Oracle’s answer) and then an automated system rewards the Oracle according to how close its answer is to the human team’s. As long as the automated system doesn’t have a security hole (and we can ensure that relatively easily if the “how close” metric is not too complex) then the Oracle can’t “trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this”.
Good point. I’m not a huge fan of deferring thinking into similarity metrics (the relatively reachability proposal also does this), since this is a complicated thing even in theory, and I suspect a lot turns on how it ends up being defined, but with that caveat aside, this seems reasonable.
Can expected information gain be directly implemented using ML, or do you need to do some kind of approximation instead? If the latter, can that be a safety issue?
It can’t tractably be calculated exactly, but it only goes into calculating the probability of deferring to the humans. Approximating a thoeretically-well-founded probability of deferring to a human won’t make it unsafe—that will just make it less efficient/capable. For normal neural networks, there isn’t an obvious way to extract the entropy of the belief distribution, but if there were, you could approximate the expected information gain as the expected decrease in entropy. Note that the entropy of the belief distribution is not the entropy of the model’s distribution over outputs—a model could be very certain that the output is Bernoulli(1/2) distributed, and this would entail an entropy of ~0, not an entropy of 1. I’m not familiar enough with Bayesian neural networks to know if the entropy would be easy to extract.
Oh, that aside, the actual question I wanted your feedback on was the idea of combining human imitations with more general oracles/predictors. :)
Right. So in this version of an oracle, where it is just outputting a prediction of the output of some future process, I don’t see what it offers that normal sequence prediction doesn’t offer. On our BoMAI discussion, I mentioned a type of oracle I considered that gave answers which it predicted would cause a (boxed) human to do well on a randomly sampled prediction task, and that kind of oracle could potentially be much more powerful than a counterfactual oracle, but I don’t really see the value of adding something like a counterfactual oracle to a sequence predictor that makes predictions about a sequence that is something like this:
(“Q1”, Q1), (“Q2“, Q2), (“Q3”, Q3), …, (“Q26”, Q26), (“A1”, A1), (“A2“, A2), (“Q27”, Q27), … (“A10”,
It’s also possible that this scheme runs into grain of truth problems, and the counterfactual oracle gives outputs that are a lot like what I’m imagining this sequence predictor would, in which case, I don’t think sequence prediction would have much to add to the counterfactual oracle proposal.
What do you think of the idea of combining oracles with human imitations, which was inspired in part by our conversation here, as a way to approach AIXI-like abilities while still remaining safe? See here for a specific proposal.
Regarding your particular proposal, I think you can only use a counterfactual oracle to predict the answers automatically answerable questions. That is, you can’t show the question to a team of humans and have them answer the question. The counterfactual possibility where the question is scored, it isn’t supposed to viewed by people, otherwise the oracle has an incentive to trick the scorers to implement unsafe AGI which takes over the world and fix the answer to be whatever message was output by the AGI to instigate this.
...unless the team of humans is in a box :)
On the topic of counterfactual oracles, if you are trying to predict the answers to questions which can be automatically checked in the future, I am unsure why you would run a counterfactual oracle instead of running sequence prediction on the following sequence, for example:
(“Q1”, Q1), (“Q2“, Q2), (“Q3”, Q3), …, (“Q26”, Q26), (“A1”, A1), (“A2“, A2), (“Q27”, Q27), … (“A10”,
This should give an estimate of the answer A10 to question Q10, and this can be done before the answer is available. In fact, unlike with the counterfactual oracle, you could do this even if people had to be involved in submitting the answer.
It seems like you’re imagining using a large number of ~HSIFAUH to take over the world and prevent unaligned AGI from arising. Is that right? How many ~HSIFAUH are you thinking and why do you think that’s enough? For example, what kind of strategies are you thinking of, that would be sufficient to overcome other people’s defenses (before they deploy ~AIXI), using only human-level phishing and other abilities (as opposed to superhuman AIXI-like abilities)?
Well that was the question I originally posed here, but I got the sense from commenters was that people thought this was easy to pull off and the only question was whether it was safe. So I’m not sure for what N it’s the case that N machines running agents doing human-level stuff would be enough to take over the world. I’m pretty sure N = 7 billion is enough. And I think it’s plausible that after a discussion about this, I could become confident that N = 1000 was enough. Or maybe the right way to look at it is whether N = 10 could finance a rapidly exponentially growing N. So it seemed worth having a discussion, but I am not yet prepared to defend a low enough N which makes this obviously viable.
Forgetting about the possibility of exponentially growing N for a moment, and turning to
Why is d << h the relevant question for evaluating this?
Yeah I wrote that post too quickly—this is wrong. (I was thinking of the leading team running HSIFAUH needing to go through d+h timesteps to get to a good performance, but they just need to run through d, which makes things easier.) Sorry about that. Let f be the amount of compute that the leading project has divided by the compute that the leading reckless project has. Suppose d > 0. (That’s all we need actually). Then it takes the leading reckless team at least f times as long to get to AIXI taking over the world as it takes the leading team to get to SolomonoffPredict predicting a human trying to do X; using similar tractable approximation strategies (whatever those turn out to be), we can expect it to take f times as long for the leading reckless team to get to ~AIXI as it takes the leading team to get to ~SolomonoffPredict. ~HSIFAUH is more complicated with the resource of employing the humans you learn to imitate, but this resource requirement goes down by time you’re deploying it toward useful things. Naively (and you might be able to do better than this), you could run f copies of ~HSIFAUH and get to human-level performance on some relevant tasks around the same time the reckless team takes over the world. So the question is whether N = f is a big enough N. In the train-then-deploy framework, it seems today like training takes much more compute than deploying, so that makes it easier for the leading team to let N >> f, once all the resources dedicated to training get freed up. It should be possible to weaken the online version and get some of this speedup.
By ~HSIFAUH I guess you mean a practical implementation/approximation of HSIFAUH. Can you describe how you would do that using ML, so I can more easily compare with other proposals for doing human imitations using ML?
I don’t know how to do this. But it’s the same stuff the reckless team is doing to make standard RL powerful.
Timesteps required for AIXI to predict human behavior: h
Timesteps required for AIXI to take over the world: h + d
I think d << h.
Timesteps required for Solomonoff inudction trained on human policy to predict human behavior: h
Timesteps required for Solomonoff inudction trained on human policy to phish at human level: h
Timesteps required for HSIFAUH to phish at human level: ~h
In general, I agree AIXI will perform much more strongly than HSIFAUH at an arbitrary task like phishing (and ~AIXI will be stronger than ~HSIFAUH), but the question at stake is how plausible it is that a single AI team with some compute/data advantage relative to incautious AI teams could train ~HSIFAUH to phish well while other teams are still unable to train ~AIXI to take over the world. And the relevant question for evaluating that is whether d << h. So even if ~AIXI could be trained to phish with less data than h, I don’t think that’s the relevant comparison. I also don’t think it’s particularly relevant how superhuman AIXI is at phishing when HSIFAUH can do it at a human level.