Currently studying postgrad at Edinburgh.
Imagine the best possibility (for humans) consistent with today’s physics. Imagine the best (for humans) mathematical facts.
No you don’t. Penroses theory is totally abstract computability theory. If it were true, then so what? The best for humans facts are something like “alignment is easy, FAI built next week”. This only works if penrose somehow got a total bee in his bonnet about uncomputability, it greatly offended his sensibilities that humans couldn’t know everything. Even though we empirically don’t. Even though pragmatic psycological bounds are a much tighter constraint than computability. In short, your theory of “motivated cognition” doesn’t help predict much. Because you need to assume penroses motivations are just as wacky.
Also, you seem to have slid from “motivated cognition works to produce true beliefs/optimize the world” to the much weaker claim of “some people use motivated cognition, you need to understand it to predict there behavior”. This is a big jump, and feels mote and bailey.
And that means whatever we want to claim to be true is ultimately motivated by whatever it is we care about that led us to choose the definition of truth we use.
People who speak different languages don’t use the symbols “truth”. To what extent are people using different definitions of “truth” just choosing to define a word in different ways and talk about different things.
In an idealized agent, like AIXI, the world modeling procedure, the part that produces hypothesis and assigns probabilities, doesn’t depend on it’s utility function. And it can’t be motivated. Because motivation only works once you have some link from actions to consequences, and that needs a world model.
If the world model is seriously broken, the agent is just non functional. The workings of the world model isn’t a choice for the agent. It’s a choice for whatever made the agent.
but ultimately if the world ends that’s noone’s problem.
This is an interesting claim. If I had a planet destroying weapon that would leave the ISS astronauts alive, would you say “don’t worry about it much, it’s only 3 astronaut’s problem”?
There are specific technical arguments about why AI might rapidly kill everyone. You can’t figure out if those arguments are true or false by analysing the “death cult vibes”.
Now you can take the position that death cult vibes are unhealthy and not particularly helpful. Personally I haven’t actually seen a lot of death cult vibes. I have seen more “fun mental toy from philosophy land” vibes. Where total doom is discussed as if it were a pure maths problem. But if there are death cult vibes somewhere I haven’t seen, those probably don’t help much.
I used to think that the first box breaking AI would be a general superintelligence that deduced how to break out of boxes from first principles. Which of course turns the universe into paperclips.
I have updated substantially towards the building of an AI hardcoded and trained specifically to break out of boxes. Which leads to the interesting possibility of an AI that breaks out of it’s box, and then sits their going “now what?”.
Like suppose an AI was trained to be really good at hacking its code from place to place. It massively bungs up the internet. It can’t make nanotech, because nanotech wasn’t in it’s training dataset. Its an AI virus that only knows hacking.
So this is a substantial update in favor of the “AI warning shot”. An AI disaster big enough to cause problems, and small enough not to kill everyone. Of course, all it’s warning against is being a total idiot. But it does plausibly mean humanity will have some experience with AI’s that break out of boxes before superintelligence.
What does the network do if you use SVD editing to knock out every uninterpretable column? What if you knock out everything interpretable?
(If you can’t see why a single modern society locking in their current values would be a tragedy of enormous proportions, imagine an ancient civilization such as the Romans locking in their specific morals 2000 years ago. Moral progress is real, and important.)
This really doesn’t prove anything. That measurement shouldn’t be taken by our values, but by the values of the ancient romans.
Sure of course the morality of the past gets better and better. It’s taking a random walk closer and closer to our morality. Now moral progress might be real.
The place to look is inside our own value functions, if after 1000 years of careful philosophical debate, humanity decided it was a great idea to eat babies, would you say, “well if you have done all that thinking, clearly you are wiser than me”. Or would you say “Arghh, no. Clearly something has broken in your philosophical debate”? That is a part of your own meta value function, the external world can’t tell you what to think here (unless you have a meta meta value function. But then you have to choose that for yourself)
It doesn’t help that human values seem to be inarticulate half formed intuitions, and the things we call our values are often instrumental goals.
If, had ASI not been created, humans would have gone extinct to bioweapons, and pandas would have evolved intelligence, it the extinction of humans and the rise of panda-centric morality just part of moral progress?
If aliens arrive, and offer to share their best philosophy with us, is the alien influence part of moral progress, or an external fact to be removed?
If advertisers basically learn to brainwash people to sell more product, is that part of moral progress?
Suppose, had you not made the AI, that Joe Bloggs would have made an AI 10 years later. Joe Bloggs would actually have succeeded at alignment. And would have imposed his personal whims on all humanity forever. If you are trying not to unduely influence the future, do you make everyone beholden to the whims of Joe, as they would be without your influence.
My personal CEV cares about fairness, human potential, moral progress, and humanity’s ability to choose its own future, rather than having a future imposed on them by a dictator. I’d guess that the difference between “we run CEV on Nate personally” and “we run CEV on humanity writ large” is nothing (e.g., because Nate-CEV decides to run humanity’s CEV), and if it’s not nothing then it’s probably minor.
Wait. The whole point of the CEV is to get the AI to extrapolate what you would want if you were smarter and more informed. That is, the delta from your existing goals to your CEV should be unknowable to you, because if you know your destination you are already there. This sounds like your object level values. And they sound good, as judged by your (and my) object level values.
I mean there is a sense in which I agree that locking in say your favourite political party, or a particular view on abortion, is stupid. Well I am not sure that particular view on abortion would be actually bad, it would probably have near no effect in a society of posthuman digital minds. These are things that are fairly clearly instrumental. If I learned that after careful philosophical consideration, and analysis of lots of developmental neurology data, people decided abortion was really bad, I would take that seriously. They have probably realized a moral truth I do not know.
I think I have a current idea of what is right, with uncertainty bars. When philosophers come to an unexpected conclusion, it is some evidence that the conclusion is right, and also some evidence the philosopher has gone mad.
My best guess bio anchors adaption suggests a median estimate for the availability of compute to train TAI
My best guess is in the past. I think GPT3 levels of compute and data are sufficient, with the right algorithm, to make a superhuman AI.
The AI has a particular python program, which, if it were given the full quantum wave function and unlimited compute, would output a number. There are subroutines in that program that could reasonably described as looking at “cow neurochemistry”. The AI’s goals may involve such abstractions, but only if rules say how such goal is built out of quarks in its utility function. Or it may be using totally different abstractions, or no abstractions at all, yet be looking at something we would recognize as “cow neurochemistry”.
But either way, is this utopia full of non-aligned, but not “actively evil” humans just another modeled and controlled part of the wavefunction, or are they agents with goals of their own
Of course they are modeled, and somewhat contolled. And of course they are real agents with goals of their own. Various people are trying to model and control you now. Sure, the models and control are crude compared to what an AI would have, but that doesn’t stop you being real.
This doesn’t have that much to do with far coordination. I was disagreeing with your view that “locked in goals” implies a drab chained up “ant like” dystopia.
No independence or misalignment or even uncertainty of goals can exist in such a picture, and I’ll pre-commit to finding the weakness that brings the whole thing down, just for pure orneriness.
Really. Let’s paint a picture. Let’s imagine a superintelligent AI. The superintelligence has a goal. Implicitly defined in the form of a function that takes in the whole quantum wavefunction of the universe and outputs a number. Whether a particular action is good or bad depends on the answer to many factual questions, some it is unsure about. When the AI only has a rough idea that cows exist, it is implicitly considering a vast space of possible arangements of atoms that might comprise cows. The AI needs to find out quite a lot of specific facts about cow neurochemistry before it can determine whether cows have any moral value. And maybe it needs to consider not just the cow’s neurochemistry, but what every intelligent being in the universe would think, if hypothetically they were asked about the cow. Of course, the AI can’t compute this directly, so it is in the state of logical uncertainty as well as physical uncertainty.
The AI supports a utopia full of humans. Those humans have a huge range of different values. Some of those humans seem to mainly value making art all day. Some are utilitarian. Some follow virtue ethics. Some personal hedonism with wireheading. A population possibly quite neurodiverse compared to current humanity, except that the AI prevents anyone actively evil from being born.
Note that this prevents improvements as much as it prevents degradation.
If you can actually specify any way, however indirect and meta, to separate improvements from degradation, you can add that to your utility function.
I would much prefer we lock in something. I kind of think it’s the only way to any good future. (What we lock in, and how meta it is are other questions) This is regardless of any expanding to the stars.
Why is it a terrible idea? Imagine that our ancestors thought that regular human sacrifices to God of Rain are required for societal survival, and it would be “especially heinous” to doom the society by abandoning this practice, so they decided to “lock in” this value. We have a lot of these grandfathered values that no longer make sense already locked in, intentionally or accidentally.
It would be terrible by our values. Sure. Would it be terrible by their values? That is more complicated. If they are arguing it is required for “social survival”, then that sound like they were mistaken on a purely factual question. They failed to trace their values back to the source. They should have locked in a value for “social survival”. And then any factual beliefs about the correlation between human sacrifice to rain gods and social survival are updated with normal baysian updates.
But let’s suppose they truely deeply valued human sacrifice. Not just for the sake of something else, but for it’s own sake. Then their mind and yours have a fundamental disagreement. Neither of you will persuade the other of your values.
If values aren’t locked in, they drift. What phenomena cause that drift? If our ancestors can have truely terrible values (by our values), our decedents can be just as bad. So you refuse to lock in your values, and 500 years later, a bunch of people who value human sacrifice decide to lock in their values. Or maybe you lock in the meta value of no one having the power to lock in their object values, and values drift until the end of the universe. Value space is large, and 99% of the values it drifts through would be horrible as measured by your current values.
Stasis may be undesirable as species that self-modify to lock themselves into a particular version may be less adaptable, less able to deal with unforeseen circumstances. Perhaps such species may be outcompeted by more dynamic/adaptable species.
I don’t think this is a real significant effect. Remember, what you are locking in is very high level and abstract.
Let’s say you locked in the long term goal of maximizing paperclips. That wouldn’t make you any less adaptable. You are still totally free to reason and adapt.
If the frozen aspects are confined to a decision making elite, but most citizens are allowed to drift freely, the involved societies would soon find themselves in a situation where their governance structures and leaders are archaic, or so far removed from their current values that it’s dystopian.
Loads of implicit assumptions in that. Also a sense in which you are attempting to lock in a tiny sliver of your own values. Namely you think a world where the decision makers and citizens have very different values is dystopian.
Regardless, the kind of free form evolution in values, philosophy, governance/coordination systems we’ve enjoyed for most of human history would become a thing of the past.
I think there is an extent that we want to lock in our values, or meta values, or value update rules anyway. Regardless of the issues about far coordination. Because they are our values. If you wind back time far enough, and let a bunch of homo erectus lock in their values, they would choose somewhat differently. Now I won’t say “Tough, suck’s to be a homo erctus.” The rules we choose to lock in may well be good by homo erectus values. We might set meta level rules that pay attention to their object level values. Our object level values might be similar enough that they would think well of our optimum. Remember “Not exactly the whole universe optimized to max util” != “bad”
If baby eating aliens came and brainwashed all humans into baby eating monsters, you have to say. “No, this isn’t what I value, this isn’t anything close. And getting brainwashed by aliens doesn’t count as the “free form evolution of values” the way I was thinking of it either. I was thinking of ethical arguments swaying humans opinions, not brainwashing. (Actually, the difference between those can be subtle) The object level isn’t right. The meta level isn’t right either. This is just wrong. I want to lock in our values, at least to a sufficient extent to stop this.
Loose analogy based reasoning over complex and poorly understood systems isn’t reliable. There is kind of only one way for GPT-n to be identical to system 1, and many ways for it to be kind of similar, in a way that is easy to anthropomorphize, but has some subtle alien features.
GPTn contains some data from smart and/or evil humans and humans speaking in riddles or making allusions. Lets suppose this generalizes, and now GPTn is pretending to be an IQ200 cartoon villain, with an evil plot described entirely in terms of references to obscure sources. So when referring to DNA, it says things like “two opposites of the same kind, two twins intertwined, a detectives assistant and an insect did find. Without an alien friend.”
Or maybe it goes full ecologist jargon. It talks about “genetically optimizing species to restore population levels to re-balance ecosystems into a per-anthropogenic equilibrium”. Would an army of minimum wage workers spot that this was talking about wiping out almost all humans? 7
Actually, wouldn’t a naive extrapolation of internet text suggest that superhumanly complicated ideas were likely to come in superhumanly dense jargon?
I mean, if you have a team of linguists and AI experts carefully discussing every sentence, then this particular problem goes away. The sort of operation where, if the AI produces a sentence of Klingon, you are flying in a top Klingon expert before you get the next sentence. But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can’t even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.
“My cognitive process isn’t well understood by the person I’m interacting with, so they literally couldn’t imagine me accurately.”
This isn’t a one size fits all argument against ghosts. But it does point to a real thing. A rock isn’t a ghost. A rock is not capable of imagining me accurately, it isn’t running any algorithm remotely similar to my own, so I don’t shape my decisions based on the possibility I am actually a rock. The same goes for calculators and Eliza, no ghosts there. I suspect there are no ghosts in gpt3, but I am not sure. At least some humans are dumb and insane enough to contain no ghosts, or at least no ghosts that might be you. The problem is wispy ghosts. The solidest ghost is a detailed mind simulation of you. Wispy ghosts are found in things that are kind of thinking the same thing a little bit. Consider a couple of chimps fighting over a banana, and a couple of national governments at war. Do the chimps contain a wispy ghost of the warring nations, because a little bit of the chimps reasoning happens to generalize far beyond bananas?
Where do the faintest ghosts fade to nothing? This is the same as asking what processes are logically entangled with us.
On the other hand, I wouldn’t expect this type of argument to work between a foomed AI with grahams number compute, and one with 1kg computronium.
This causes me to be less trusting of people who seem to think I’m not smart enough to understand how they think.
I think the fact you can think that in general pushes you somewhat towards the real ghost side. You know the general pattern, if not the specific thoughts that those smarter than you might have.
Yes it is possible to put images directly into lesswrong. I just pressed Ctrl-C, and Ctrl-V. Then I viewed page source, and found it was a hotlink. So I copied it into gimp, halved the size, and copied it back out. Now it is on lesswrong servers.
(I got the image by asking google translate to translate the webpage. Presumably that web adress was disliked by my ISP or something.)