I think a better framing is “does this model pay rent in anticipated experiences?” (using the local lingo), not “legitimate or illusory”.
I agree that even an AGI would have shifting goals. But at least at every single instance of time one assumes that there is a goal it optimizes for. Or a set of rules it follows. Or a set of acceptable behaviors. Or maybe some combination of those. Humans are not like that. There is no inner coherence ever, we just do stuff we are compelled to do in the moment.
I don’t think your listed points are the crux of the difference. Though maybe AI (self-)interpretability is an important one. My personal feeling is that what is important is that humans are not coherent agents with goals, we just do things, often sphexing and being random or, conversely, routine, not acting to advance any of the stated goals.
I find the arguments extremely unconvincing, they are very much cherry-picked. If you think for 5 minutes, you can find equally good examples of good intentions leading to unexpected disastrous consequences in the long or medium term. Give it a try. In addition, there is nothing to compare these “positive influence” actions against. They tend to be implicitly compared against a hypothetical counterfactual world where no action is taken, even though we have no way of knowing how such a world would develop.
Here are a couple of counter-examples where doing medium- and long-term good backfires, after 30 sec of thinking:
Colonization, an obvious long-term good for Europeans, ended up wiping out most of the American indiginous population.
Dissolving the Soviet Union led to several bloody wars in Europe
Spreading the world of Jesus or Mohammad resulted in extermination of millions of people over the millennia
And one fictional but illuminating example is Asimov’s The End of Eternity.
Basically, the claim that “We can make their lives go better” long-term holds no water. Your predictability horizon dies off pretty quickly with time.
That misses my point, which is that trusting the judgment of someone who is proclaiming opaquely calculated but accurate estimates of low probability events without an extremely good calibration track record is a bad idea.
I’d focus on noticing the Soldier mindset (“arguments as soldiers”). Find a personal “cherished belief” that would be really hard to let go of, and have your partner poke at it. Not a token belief, but a real one. Then notice your emotional state and the inner need to fight for it. That’s the mental state one wants to remember, “but I am right and this is true!” and notice when it happens in other situations.
Small probabilities are hard to calculate accurately. How did the boy know that it was a 5% chance and not 0.001% chance?
“Don’t be a straw vulcan”.
Since you switched the moderation to “easy-going”...
I have hinted at a definition in an old post https://www.lesswrong.com/posts/NptifNqFw4wT4MuY8/agency-is-bugs-and-uncertainty. Basically we use agency as a black-box description of something.
Of course, as generally agreed, agency is a convenient intentional stance model. There is no agency in a physical gears-level description of a system.
But this is circular. An abstraction for whom? What even is an abstraction, when you’re in the process of defining an agent? Is there some agent-free definition of an abstraction implicitly being invoked here?
To build it up from the first principles, we must start with a compressible (not fully random) universe, at a minimum, because “embedded agents”, whatever they might turn out to be, are defined by having a somewhat accurate (i.e. lossily compressed) internal model of the world, so some degree of compressibility is required. (Though maybe useful lossy compression of a random stream is a thing, I don’t know.)
Next, one would identify some persistent features of the world that look like they convert free energy into entropy (note that a lot of “natural” systems behave like that, say, stars).
Finally, merging the two, a feature of the world that contains what appears to be a miniature model of the (relevant part of the) world, which also converts energy into entropy to persist the model and “itself” would be sort of close to an “agent”.
There are plenty of holes in this outline, but at least there is no circularity, as far as I can tell.
Just commented on IRC the other day that
The mode of human extinction will not be “I must tile the universe with paperclips and humans are in the way”, but more like “Oops, I stepped on that bug”
This is unrelated to quantum in any way. The limit of accuracy to perfection is smooth. That is, it makes sense to 1-box even for a 51% accurate predictor, and the expected utility goes up with accuracy, reaching its maximum for a perfect predictor.
Somewhat unrelated and probably silly… Why reward the agent directly instead of letting it watch humans act in their natural environment and leaving it to build a predictive model of humans?
Again, it’s good to have a box a human could not get in or out of, as a matter of course. Such a box should not appreciably change any serious safety considerations.
Making credible promises (and occasionally breaking them) is a part of living in a society, and humans are social animals. No need for moral absolutism.
Your takeaways are great, your justifications are trash :) How do I know? Nearly every ethical system ends up there, regardless of where it starts from, so your premises have no bearing on the outcome.
To quote Aella from https://aella.substack.com/p/my-attempts-to-sensemake-ai-risk (emphasis mine).
if you’re granting a superintelligent AGI and you still think it won’t be able to get out of the researcher’s box (like, it’s on a computer disconnected from the internet and wants you to connect it to the internet, or something), then I don’t think you’re properly imagining superintelligence. Maybe this is a bit silly, but for my own calibration I’ve often imagined a bunch of five-year-olds who’ve been strictly instructed not to pass the key through your prison door slot, and you have to convince them to do it. The intelligence gap between you and five year olds is probably much smaller than the gap between you and an AGI, but probably you could convince the five year olds to let you out. People arguing they just wouldn’t let an AGI take any sort of control of anything strikes me as silly as the five year olds swearing they won’t let the adult out no matter what. Most other arguments around human beings controlling the AGI in any way once it happens feels equally as silly. You just can’t properly comprehend a thing vastly smarter than you!
If an AGI with a goal of escaping emerges, there is nothing you can do about it. It may take a bit longer if it is disconnected from everything by some “failsafe”, but a human idea of “disconnected from everything” is pathetically misguided compared to something many times smarter. Just… drop the reliance on boxing an AI at all. As johnswentworth said, might as well do it, anyway, but it should not factor in your safety calculations.
I have a similar confusion. I thought the definition of winning is objective (and frequentist): after a large number of identically set up experiments, the winning decision is the one that gains the most value. In Newcomb’s it’s one-boxing, in twin prisoner dilemma it’s cooperating, in other PDs it depends on the details of your opponent and on your knowledge of them, in counterfactual mugging it depends on the details of how trustworthy the mugger is, on whom it chooses to pay or charge, etc, the problem is underspecified as presented. If you have an “unfair” Omega who punishes a specific DT agent, the winning strategy is to be the Omega-favored agent.
There is no need for counterfactuals, by the way, just calculate what strategy nets the highest EV. Just like with Newcomb’s, in some counterfactual mugger setups only the agents who pay when lost get a chance to win. If you are the type of agent who doesn’t pay, the CFM predictor will not give you a chance to win. This is like a lottery, only you pay after losing, which does not matter if the predictor knows what you would do. Not paying when losing is equivalent to not buying a lottery ticket when the expected winning is more than the ticket price. I don’t know if this counts as decision theory, probably not.
Yes, if Paul thinks that he might not be a psychopath who dies, and has a probability associated with it, he would include this possible world in the calculation… obviously? Though this requires further specification of how much he values his life vs life with/without psychopaths around. If he values it infinitely, as most psychopaths do, presumably, then he would not press the button, on an off chance that he is wrong. If the value is finite, then there is a break-even probability where he is indifferent to pressing the button. I don’t understand how it is related to a decision theory, it’s just world counting and EV calculation. I must be missing something, I assume.
OK, I read the last one (again, after all these years), and I have no idea how it is applicable. It seems to be about the definition of probability, dutch-booking and such… nothing to do with the question at hand. The one before that is about how a “wrapper-mind”, i.e. a fixed-goal AGI is bad… Which is indeed correct, but… irrelevant? It has the best EV by its own metric?
Of course Paul could be wrong, and then you need to calculate probabilities, which is a trivial calculation that does not depend on a chosen decision theory. But the problem statement as is does not specify any of it, only that he is sure that only a psychopath would press the button, so take it as 100% confidence and 100% accuracy, for simplicity. The point does not change: you need a good specification of the problem, and once you have it, the calculation is evaluating probabilities of each world, multiplying by utilities, and declaring the agent that picks the world with the highest EV “rational”.