This theory of goal-directedness has the virtue of being closely tied to what we care about:
--If a system is goal-directed according to this definition, then (probably) it is the sort of thing that might behave as if it has convergent instrumental goals. It might, for example, deceive us and then turn on us later. Whereas if a system is not goal-directed according to this definition, then absent further information we have no reason to expect those behaviors.
--Obviously we want to model things efficiently. So we are independently interested in what the most efficient way to model something is. So this definition doesn’t make us go that far out of our way to compute, so to speak.
On the other hand, I think this definition is not completely satisfying, because it doesn’t help much with the most important questions:
--Given a proposal for an AGI architecture, is it the sort of thing that might deceive us and then turn on us later? Your definition answers: “Well, is it the sort of thing that can be most efficiently modelled as an EU-maximizer? If yes, then yes, if no, then no.” The problem with this answer is that trying to see whether or not we can model the system as an EU-maximizer involves calculating out the system’s behavior and comparing it to what an EU-maximizer (worse, to a range of EU-maximizers with various relatively simple or salient utility and credence functions) would do, and if we are doing that we can probably just answer the will-it-deceive-us question directly. Alternatively perhaps we could look at the structure of the system—the architecture—and say “see, this here is similar to the EU-max algorithm.” But if we are doing that, then again, maybe we don’t need this extra step in the middle; maybe we can jump straight from looking at the structure of the system to inferring whether or not it will act like it has convergent instrumental goals.
Oh right, I forgot, the $1 incentive gives people an ulterior motive for signing. :/ OK, so this is part of the answer to my original question—I had not noticed that fact and thus overestimated their usefulness.
I wonder also if the conflicts that remain are nevertheless more peaceful. When hunter-gatherer tribes fight each other, they often murder all the men and enslave the women, or so I hear. Similar things happened with farmer societies sometimes, but also sometimes they just become new territories and have to pay tribute and levy conscripts and endure the occasional pillage. And then industrialized modern nations even have rules about how you can’t rape and pillage and genocide and sell into slavery the citizens of your enemy. Perhaps AI conflicts would be even more peaceful. For example, perhaps they would look something more like fancy maneuvers, propaganda, and hacking, with swift capitulation by the “checkmated” AI, which is nevertheless allowed to continue existing with some smaller amount of influence over the future. Perhaps no property would even be destroyed in the entire war!
Just spitballing here. I feel much less confident in this trend than in the trend I pointed out above.
But that doesn’t seem like a big cost to me. It seems that other methods of solving coordination problems have similarly high or even higher costs—e.g. campaign to raise awareness to get people to vote for legislation to solve the problem… Think of how many petitions there are on Change.org and how many signatures they regularly get. Now imagine that you got paid $1 on average for each one that you signed. People would be making shittons of money just by logging into change.org and browsing through proposals. Until, that is, a large portion of the population starts regularly doing this… then the money flow shrinks but change starts happening!
Yes it’s moving the cost of failure to the person sponsoring the contract, but I think for many of these problems there should be people with enough money and altruism willing to take the risk. E.g. political campaigns regularly spend comparable sums. And like you perhaps hint at with the game theory point, it’s different when the risk is all on one person—because it means we can be much more confident that the contract will trigger, conditional on someone taking the risk to fund it, and thus the risk is actually much smaller.
The first point you make doesn’t apply to dominant assurance contracts, which pay signers in the case where not enough people sign. I don’t know of any real-world instance of dominant assurance contracts being used, but boy do they seem like they would be super effective. Imagine during the 2016 election: “Sign this petition if you want Michelle Obama to be president! If at least 100million people sign, you promise to vote for her. Otherwise, you’ll get a $1 gift card to Target.” Note that even in the unlikely event that this gets 99 million signatures, it would cost the organizer an order of magnitude less than Clinton spent on her campaign. More likely it would either get ~5 million signatures (because Michelle just isn’t as popular as the organizer thought) or >100million.
Petitions and indiegogo campaigns aren’t dominant assurance contracts as far as I know. I agree that there is a cost to get people to understand them, but that’s true for all sorts of complicated financial instruments like mortgages which we have no problem with.
I think voting for third-party candidates would be significantly improved by assurance contracts. Ditto for marches & rallies, and things like the Free State Project. (Imagine how much of a fail the FSP would have been if they used a more traditional method.) And I think maybe also kickstarter stuff? IDK, maybe this disagreement comes down to a disagreement about the meaning of “significantly.”
I think I’d prefer calling it “acausal trade vs. pre-causal acausal trade” because it seems that the underlying phenomenon is exactly the same in both cases, it’s the circumstances surrounding that are different. But this is just a minor terminological quibble.
Yeah, me too. Well, I won’t exactly have done a full lit review by the time the blog post comes out… my post is mostly about other things. So don’t get your hopes up too high. A good idea for future work though… maybe we can put it on AI Impacts’ todo list.
I very much agree. Historical analogies:
To a tiger, human hunter-gatherers must be frustrating and bewildering in their ability to coordinate. “What the hell? Why are they all pouncing on me when I jumped on the little one? The little one is already dead anyway, and they are risking their own lives now for nothing! Dammit, gotta run!”
To a tribe of hunter-gatherers, farmers must be frustrating and bewildering in their ability to coordinate. “What the hell? We pillaged and slew that one village real good, they sure didn’t have enough warriors left over to chase us down… why are the neighboring villages coming after us? And what’s this—they have professional soldiers with fancy equipment riding horses? Somehow hundreds—no, thousands—of farmers cooperated over a period of several years to make this punitive expedition possible! How were we to know they would go to such lengths?”
To the nations colonized by the Europeans, it must have been pretty interesting how the Europeans were so busy fighting each other constantly, yet somehow managed to more or less peacefully divide up Africa, Asia, South America, etc. to be colonized between them. Take the Opium Wars and the Boxer Rebellion for example. I could imagine a Hansonian prophet in a native american tribe saying something like “Whatever laws the European nations use to keep the peace among themselves, we will benefit from them also; we’ll register as a nation, sign treaties and alliances, and rely on the same balance of power.” He would have been disastrously wrong.
I expect something similar to happen with us humans and AGI, if there are multiple AGI. “What? They all have different architectures and objectives, not to mention different users and owners… we even explicitly told them to compete with each other! Why are they doing X.… noooooooo....” (Perhaps they are competing with each other furiously, even fighting each other. Yet somehow they’ll find a way to cut us out of whatever deal they reach, just as European powers so often did for their various native allies.)
Voting for third-party candidates. Organizing marches and rallies. Things like the Free State Project (why aren’t lots of other subcultures and political factions doing that?) Sweet parties at my house.
Now that I think more about it, clubs/churches do this sort of thing all the time informally, e.g. survey the crowd and ask how many people would come to the event if it were held, and then hold the event iff at least x people say they would come, with social disapproval being the punishment for people who say they would come and then don’t.
And of course, the sort of things Kickstarter funds. So I guess that’s part of my answer right there.
Yes, glad you asked. The reason is that in the comments of my original post, it turned out that Paul Christiano agrees that a big world government might be able to pull ahead of the rest of the world and achieve DSA even in a soft takeoff scenario. Since even Paul agrees on that point, I figured I’d make a note and then move on to talk about the cases we still disagree on: sub-governmental entities like corporations and academic research projects.
But I mean… if you have good examples of government sources I’d be happy to hear them!
I agree for rockets and electric vehicles (and online payments?), but I think the jury is still out on the rest. Give it a few more years and we’ll see whether those efforts bear fruit.
I’m interested in both and even neither, but of course the closer the analogy to AGI the better.
Yeah, that’s a good fork of possibilities to think about. As for the issue of patents, insofar as someone has a lead because no one else is trying because patents, then that’s a disanalogy between their case and AGI because presumably patent laws won’t be powerful enough to prevent multiple projects from racing hard for AGI.
It helps me to understand more clearly your argument. I still disagree with it though. I object to this:
zip(reasonable pair) > zip(policy+complex bias facts)
I claim this begs the question against OSH. If OSH is true, then zip(reasonable pair) ≈ zip(policy).
Humans who believe in God still haven’t concluded that deception is a good strategy, and they have similar evidence about the non-omnipotence and non-omnibenevolence of God as an AI might have for its creators.
(Though maybe I’m wrong about this claim—maybe if we ask some believers they would tell us “yeah I am just being good to make it to the next life, where hopefully I’ll have a little more power and freedom and can go buck wild.”)
I agree that the decomposition of physics into laws+IC is much simpler than the decomposition of a human policy into p,R. (Is that what you mean by “more natural?”) But this is not relevant to my argument, I think.
I feel that our conversation now has branched into too many branches, some of which have been abandoned. In the interest of re-focusing the conversation, I’m going to answer the questions you asked and then ask a few new ones of my own.
To your questions: For me to understand your argument better I’d like to know more about what the pieces represent. Is file1 the degenerate pair and file2 the intended pair, and image1 the policy and image2 the bias-facts? Then what is the “unzip” function? Pairs don’t unzip to anything. You can apply the function “apply the first element of the pair to the second” or you can apply the function “do that, and then apply the MAXIMIZE function to the second element of the pair and compute the difference.” Or there are infinitely many other things you can do with the pair. But the pair itself doesn’t tell you what to do with it, unlike a zipped file which is like an algorithm—it tells you “run me.”
I have two questions. 1. My central claim—which I still uphold as not-ruled-out-by-your-arguments (though of course I don’t actually believe it) is the Occam Sufficiency Hypothesis: “The ‘intended’ pair is the simplest way to generate the policy.” So, basically, what OSH says is that within each degenerate pair is a term, pi (the policy), and when you crack open that term and see what it is made of, you see p(R), the intended policy applied to the intended reward function! Thus, a simplicity-based search will stumble across <p,R> before it stumbles across any of the degenerate pairs, because it needs p and R to construct the degenerate pairs. What part of this do you object to?
2. Earlier you said “given reasonable assumptions, the human policy is simpler than all pairs” What are those assumptions?
Once again, thanks for taking the time to engage with me on this! Sorry it took me so long to reply, I got busy with family stuff.
I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.
Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.
This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...
Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?
If you want to visualize features, you might just optimize an image to make neurons fire. Unfortunately, this doesn’t really work. Instead, you end up with a kind of neural network optical illusion — an image full of noise and nonsensical high-frequency patterns that the network responds strongly to.