I think you are looking at this wrong. Yes, they had help from local rebellions and malcontents. So would an AGI. An AGI taking over the world wouldn’t necessarily look like robots vs. humans; it might look like the outbreak of World War 3 between various human factions, except that the AGI was manipulating things behind the scenes and/or acting as a “strategic advisor” to one of the factions. And when the dust settles, somehow the AGI is in charge...
So yeah, I think it really is fair to say that the Spanish managed to conquer empires of millions with just a few hundred men. Twice.
Hmm, I like that. I wonder what A&M would say in response. And I agree this is an important and relevant difference between the case of preferences and the case of science.
I still don’t think A&M show that the simplest explanation is a degenerate decomposition. They show that if it is, then Occam’s Razor won’t be sufficient, and moreover that there are some degenerate decompositions pretty close to maximally simple. But they don’t do much to rule out the possibility that the simplest explanation is the intended one.
I agree. I don’t think agents will outcompete tools in every domain; indeed in most domains perhaps specialized tools will eventually win (already, we see humans being replaced by expensive specialized machinery, or expensive human specialists, lots of places). But I still think that there will be strong competitive pressure to create agent AGI, because there are many important domains where agency is an advantage.
The ultimate test will be seeing whether the predictions it makes come true—whether agenty mesa-optimizers arise often, whether humans with tools get outcompeted by agent AGI.
In the meantime, it’s not too hard to look for confirming or disconfirming evidence. For example, the fact that militaries and corporations that make a plan and then task their subordinates with strictly following the plan invariably do worse than those who make a plan and then give their subordinates initiative and flexibility to learn and adapt on the fly… seems like confirming evidence. (See: agile development model, the importance of iteration and feedback loops in startup culture, etc.) Whereas perhaps the fact that AlphaZero is so good despite lacking a learning module is disconfirming evidence.
As for a test, well we’d need to have something that proponents and opponents agree to disagree on, and that might be hard to find. Most tests I can think of now don’t work because everyone would agree on what the probable outcome is. I think the best I can do is: Someday soon we might be able to test an agenty architecture and a non-agenty architecture in some big complex novel game environment, and this conjecture would predict that for sufficiently complex and novel environments the agenty architecture would win.
I feel like there’s a big difference between “similar complexity” and “the same complexity.” Like, if we have theory T and then we have theory T* which adds some simple unobtrusive twist to it, we get another theory which is of similar complexity… yet realistically an Occam’s-Razor-driven search process is not going to settle on T*, because you only get T* by modifying T. And if I’m wrong about this then it seems like Occam’s Razor is broken in general; in any domain there are going to be ways to turn T’s into T*’s. But Occam’s Razor is not broken in general (I feel).
Maybe this is the argument you anticipate above with ”...we aren’t actually choosing randomly.” Occam’s Razor isn’t random. Again, I might agree with you that intuitively Occam’s Razor seems more useful in physics than in preference-learning. But intuitions are not arguments, and anyhow they aren’t arguments that appeared in the text of A&M’s paper.
For the past year I’ve been thinking about the Agent vs. Tool debate (e.g. thanks to reading CAIS/Reframing Superintelligence) and also about embedded agency and mesa-optimizers and all of these topics seem very related now… I keep finding myself attracted to the following argument skeleton:
Rule 1: If you want anything unusual to happen, you gotta execute a good plan.
Rule 2: If you want a good plan, you gotta have a good planner and a good world-model.
Rule 3: If you want a good world-model, you gotta have a good learner and good data.
Rule 4: Having good data is itself an unusual happenstance, so by Rule 1 if you want good data you gotta execute a good plan.
Putting it all together: Agents are things which have good planner and learner capacities and are hooked up to actuators in some way. Perhaps they also are “seeded” with a decent world-model to start off with. Then, they get a nifty feedback loop going: They make decent plans, which allow them to get decent data, which allows them to get better world-models, which allows them to make better plans and get better data so they can get great world-models and make great plans and… etc. (The best agents will also be improving on their learning and planning algorithms! Humans do this, for example.)
Empirical conjecture: Tools suck; agents rock, and that’s why. It’s also why agenty mesa-optimizers will arise, and it’s also why humans with tools will eventually be outcompeted by agent AGI.
My baby daughter was born two weeks ago, and in honor of her existence I’m building a list of about 100 technology-related forecasting questions, which will resolve in 5, 10, and 20 years. Questions like “By the time my daughter is 5/10/20 years old, the average US citizen will be able to hail a driverless taxi in most major US cities.” (The idea is, tying it to my daughter’s age will make it more fun and also increase the likelihood that I actually go back and look at it 10 years later.)
I’d love it if the questions were online somewhere so other people could record their answers too. Does this seem like a good idea? Hive mind, I beseech you: Help me spot ways in which this could end badly!
On a more positive note, any suggestions for how to do it? Any expressions of interest in making predictions with me?
Thanks! OK, so I agree that normally in doing science we are fine with just predicting what will happen, there’s no need to decompose into Laws and Conditions. Whereas with value learning we are trying to do more than just predict behavior; we are trying to decompose into Planner and Reward so we can maximize Reward.
However the science case can be made analogous in two ways. First, as Eigil says below, realistically we don’t have access to ALL behavior or ALL events, so we will have to accept that the predictor which predicted well so far might not predict well in the future. Thus if Occam’s Razor settles on weird degenerate predictors, it might also settle on one that predicts well up until time T but then predicts poorly after that.
Second, (this is the way I went, with counterfactuals) science isn’t all about prediction. Part of science is about answering counterfactual questions like “what would have happened if...” And typically the way to answer these questions is by decomposing into Laws + Conditions and then doing a surgical intervention on the conditions and then applying the same Laws to the new conditions.
So, for example, if we use Occam’s Razor to find Laws+Conditions for our universe, and somehow it settles on the degenerate pair “Conditions := null, Laws := sequence of events E happens” then all our counterfactual queries will give bogus answers—for example, “what would have happened if we had posted the nuclear launch codes on the Internet?” Answer: “Varying the Conditions but holding the Laws fixed… it looks like E would have happened. So yeah, posting launch codes on the Internet would have been fine, wouldn’t have changed anything.”
I don’t follow?
The trick is that you can use the simplest method for constructing E in your statement “L=0 and E just happens.” So e.g. if you have some simple Laws l and Conditions c such that l(c) = E, your statement can be “L=0 and l(c) just happens.”
Thanks! I’m not sure I follow you. Here’s what I think you are saying:
--Occam’s Razor will be sufficient for predicting human behavior of course; it just isn’t sufficient for finding the intended planner-reward pair. Because (A) the simplest way to predict human behavior has nothing to do with planners and rewards, and so (B) the simplest planner-reward pair will be degenerate or weird as A&M argue.
--You agree that this argument also works for Laws+Initial Conditions; Occam’s Razor is generally insufficient, not just insufficient for inferring preferences of irrational agents!
--You think the argument is more likely to work for inferring preferences than for Laws+Initial Conditions though.
If this is what you are saying, then I agree with the second and third points but disagree with the first—or at least, I don’t see any argument for it in A&M’s paper. It may still be true, but further argument is needed. In particular their arguments for (A) are pretty weak, methinks—that’s what my section “Objections to the arguments for step 2” is about.
Edit to clarify: By “I agree with the second point” I mean I agree that if the argument works at all, it probably works for Laws+Initial Conditions as well. I don’t think the argument works though. But I do think that Occam’s Razor is probably insufficient.
Yep, agreed. I want all my friends and family to read the series… and then have a conversation with me about the ways in which it oversimplifies and misleads, in particular the higher mind vs. primitive mind bit.
On balance though I think it’s great that it exists and I predict it will be the gateway drug for a bunch of new rationalists in years to come.
Would it be fair to say that moral indefinability is basically what Yudkowsky was talking about with his slogan “Value is complex?”
What about the stance of Particularism in moral philosophy? On the face of it it seems very different, but I think it may be getting at a similar phenomenon.
Wow, now I take the “But what if a bug puts a negation on the utility function” AGI failure mode more seriously:
One of our code refactors introduced a bug which flipped the sign of the reward. Flipping the reward would usually produce incoherent text, but the same bug also flipped the sign of the KL penalty. The result was a model which optimized for negative sentiment while preserving natural language. Since our instructions told humans to give very low ratings to continuations with sexually explicit text, the model quickly learned to output only content of this form. This bug was remarkable since the result was not gibberish but maximally bad output. The authors were asleep during the training process, so the problem was noticed only once training had finished. (From OpenAI https://openai.com/blog/fine-tuning-gpt-2/)
Might be worth adding a link to this episode in the text?
Thanks, this is really cool!
I’m a bit concerned about this sort of thing: “The subagents argument offers a theoretical basis for the idea that humans have lots of internal subagents, with competing wants and needs, all constantly negotiating with each other to decide on externally-visible behavior.”
A worry I have about the standard representation theorems is that they prove too much; if everything can be represented as having a utility function, then maybe it’s not so useful to talk about utility functions. Similarly now I worry: I thought when people talked about subagent theories of mind, they meant something substantial by this—not merely that the mind has incomplete (though still acyclic) preferences!
I’m glad you are interested, and I’d love to hear your thoughts on the paper if you read it. I’d love to talk with you too; just send me an email when you’d like and we can skype or something.
What do you mean by “the more technical version of the problem” exactly?
My take right now is that algorithmic similarity (and instantiation) at least the versions of it relevant for consciousness and decision theory and epistemology will have to be either a brute empirical fact about the world, or a subjective fact about the mind of the agent reasoning about it (like priors and utility functions). What it will not be is some reasonably non-arbitrary property/relation with interesting and useful properties (like nash equilibria, centers of mass, and temperature)
Thanks, this is a good write-up!
Many years ago I wrote my undergraduate thesis on the waterfall problem (though it went by another name to me). Basically, I painstakingly and laboriously transformed an arbitrary human into an arbitrary rock of sufficient size, via a series of imperceptibly tiny steps none of which can be felt by the human. (I did this in imagination, not in reality, to be clear) The point was to see if any of the steps seemed like good places to draw a line and say “Here, consciousness is starting to go out; the system is starting to be less of a person.” As a result I became fairly convinced that there aren’t any good places to draw the line. So I guess I’m a waterfall apologist now!
I particularly like your “Logical vs. physical risk aversion” distinction, and agree that we should prioritize reducing logical risk. I think acausal trade makes this particularly concrete. If we make a misaligned superintelligence that “plays nice” in the acausal bargaining community I’d think that’s better than making an aligned superintelligence that doesn’t, because overall it matters far more that the community is nice than that it have a high population of people with our values.
I also really like your point about how providing evidence that AI safety is difficult may be one of the most important reasons to do AI safety research. I guess I’d like to see some empirically grounded analysis of how likely it is that the relevant policymakers and so forth will be swayed by such things. So far it seems like they’ve been swayed by direct arguments that the problem is hard, and not so much by our failures to make progress. If anything failure of AI safety researchers to make progress seems to encourage their critics.