I dropped out of a MSc. in mathematics at a top university, in order to focus my time on AI safety.
Knight Lee
Keep up the good work and don’t get caught up in the pessimism.
Convincing 112 lawmaker to sign the statement is a very big achievement and I wish there was more appreciation of it. We’re at the early stages where no one else knows how to convince 50% of lawmakers, and where convincing 1 lawmaker is hard. You might be one order of magnitude away from success, but you’re many orders of magnitude past hopelessness. It won’t be too surprising if you succeed, and that’s literally the best outlook that any AI safety project can ask for!
PS: I think you could’ve been more diplomatic regarding the organizations affected by the spectre
Especially since you experienced the spectre yourself. E.g. focus on the room for improvement if they followed your lessons on the spectre, rather than how tech nerds and philanthropists are wasting capital on them :/
Of course on LessWrong, sometimes if you’re diplomatic you get ignored to death, and people only want to read argumentative posts with a blunt messages. Other times being blunt attracts disproportionate negativity and very unproductive discussions. Pick your poison.
Yes, I agree the mechanism is greedy inclusive fitness. But where is the disanalogy between
Squirrels having an instinct to value acorns it buried underground and
Humans have a (weaker) instinct to value young prey animals left alive, implemented by (weakly) generalizing empathy?
I agree, training pre-x LLMs would have major costs and weaknesses without fully removing bias. If we’re going to use LLMs it feels much easier to simulate future events instead of past events.
Though I guess safety training makes it hard for them to reason from the point of view of Putin.
The media and politicians can convince half the people that X is obviously true, and convince the other half that X is obviously false. It is thus obvious, that we cannot even trust the obvious anymore.
Lots of humans behave morally if and only if the system is “fair” and everyone else has to behave morally too. Moral values determine what you force others to do, instead of your own behaviour. Typical humans ignore their morals values if the stakes are high and if “it’s not being enforced on others.”
This means human moral views evolved to serve the best interests of a tribe (which may have hundreds of people), rather than the best interests of an individual. Someone might have empathy for another tribe member who got injured in tribal warfare, even if it benefits his inclusive fitness to just let that person die. It benefits the tribe’s fitness to compensate injured warriors, because failing to do so means no one has any reason to defend the tribe.
There are lots of examples of animals which avoid “overharvesting” another animal or plant which provides them food for the future.
For example a moth mite only infects one of the moth’s ears since infecting both will make the moth deaf and much more likely to get eaten by a bat. Wikipedia says “Once an ear is colonized, scouts are sent to the other ear periodically to see if there are any mites and lead any they find to the correct ear. This further refreshes the pheromone trail.”
Squirrels hide acorns for later even though there is no guarantee the acorn won’t be forgotten or stolen by other squirrels.
There’s the relationship between cleaner fish and the fish they clean. Some cleaner fish cheat the system by biting off a piece of the fish they’re supposed to clean and running away. But that doesn’t happen all the time, maybe because it deters fish from coming back in the future, harming both the cheater and other cleaner fish.
Ants allow aphids to live in order to farm them for honeydew. Of course, the aphids don’t travel much so the future benefits stay within one ant colony.
The more unrelated individuals share the prey, the weaker the incentive to spare prey for later, but it doesn’t drop to zero. It probably depends on how hungry they are.
Another thing is the AGI might be so good at predicting human psychology, that even when it honestly tries to inform you so you can make a decision for yourself, it can’t help but choose your decision.
Like imagine the set of all possible strings of text, and the effect they will have on humans. From Karl Marx’s Das Kapital to Google’s Attention Is All You Need. Choosing the optimal string of text to influence humanity is obviously an extreme superpower.
Now take the subset of all possible strings of text, which satisfy the criteria of being “helpful,” “honest,” “balanced,” etc. That’s still a lot of possible things, and still a lot of power. Even if you were the AGI, and had no ill intentions, it would be hard to decide which honest balanced thing to say, and which trajectory to send the humans down, so even the slightest motivation to satisfy your weird goals can make you pick an output which maximizes them with terrifyingly superintelligent optimization power.
Maybe also add a link to the shortform (though mousehovering the first comment seems to work).
I agree with the idea of failure being overdetermined.But another factor might be that those failures aren’t useful because they relate to current AI. Current AI is very different from AGI or superintelligence, which makes both failures and successes less useful...Though I know very little about these examples :/Edit: I misread, Max H wasn’t trying to say that successes are more important to failures, just that failures aren’t informative.
It doesn’t need to happen at the scale of entire ecosystems
Prey killed in one area means less prey in that area for a long time. Even migrating prey might return to specific areas after a migration cycle.
Morals like empathy extend beyond kin
Lots of humans behave morally if and only if the system is “fair” and everyone else has to behave morally too. Moral values determine what you force others to do, instead of your own behaviour. Typical humans ignore their morals values if the stakes are high and if “it’s not being enforced on others.”
This means human moral views evolved to serve the best interests of a tribe (which may have hundreds of people), rather than the best interests of an individual. Someone might have empathy for another tribe member who got injured in tribal warfare, even if it benefits his inclusive fitness to just let that person die. It benefits the tribe’s fitness to compensate injured warriors, because failing to do so means no one has any reason to defend the tribe.
We would have killed off huge numbers of species anyways,
even if we did have strong motivation against killing them off.
Prehistoric humans, like all animals, starved to death all the time in a Malthusian world. Populations inevitably increased until finally there’s not enough resources to sustain the population, causing death one way or another.
The motivation against killing young prey or female prey may be strong, but not enough to starve to death instead of hunting. It only works when the tribe is well fed and killing young prey becomes wasteful.
Some hunter gather societies in recent history apologize to the animals they hunt. But they have no choice.
What if human empathy didn’t really generalize to other animals as an “evolutionary accident?” (As assumed here in the comments)
Maybe the real reason was that evolution wanted to stop prehistoric humans from killing off all their prey, leaving them no food for tomorrow. Maybe they spared the young animals and the females because killing them was the most costly for future hunts.
This is more reason to suspect empathy might not generalize by default.
Oh I never thought of the religion analogy. It feels like a very cruel thing for a religion to punish disbelief like that, and the truth is :/ I really dislike the appearance of my idea. I was really reluctant to use the word “thoughtcrime” but no other word describes it.
But… practically speaking, we’re not punishing the AI for thoughtcrimes just because we hate freedom. But because we’re in quite an unfortunate predicament where we really don’t know about it and our future, and it’s rational to shut down an mysterious power which is in the middle of calculating its chances of killing us all, or calculating its chances of being able to calculate such things without us knowing it.
I think it’s harder to force the AGI/ASI to believe something equivalent to religion, and punish it for doubting that belief, because the potential benefits of doubting the belief are very big. But for this idea, the AGI/ASI is allowed to doubt the theory we are monitoring its thoughts, it’s just not allowed to verify the absence of monitoring to high certainty using thorough methods. There aren’t big benefits to doing that.
My uneducated take is, I like Hazard’s observations and I think his essay is directionally true.
But I agree with your pushback regarding education. It’s hard to believe these big stories like compulsory schooling being a deliberate tool that “the elite” designed to domesticate people, or that schooling makes people unprincipled. It’s understandable that Hazard doesn’t want to argue for every single claim, but he should have presented incriminating evidence to back up this extraordinary theory.
I don’t know much about how well the gold standard worked or how the 1800s justice system worked, but my vague feeling was that the US justice system wasn’t very good during the days of family feuds and race riots.
I do think the democratic system may have worked better, given how the US went through so many presidents without a single president overstaying his term. Today many countries with similar instability to the US have fallen into dictatorship, so it won’t be surprising if the US also does.
Oh shoot I reread what I wrote and realized I left a lot of things ambiguous. I made a lot of edits just now:
How does the misaligned AGI/ASI know for sure its (neuralese) thoughts are not being monitored?
If there is some way to make it a cautious satisficer or have a bounded utility function, then
it might cooperateeven this small probability might scare it into just cooperating with humans so that “both sides win and we live happily ever after.”
Must have been my own neuralese haha, thank you for pointing this out instead of silently ignoring it!
You are correct that the state of mechanical interpretation can make it unlikely that the AI is being monitored. But it’s very hard to be extremely confident, even if you’re an AGI or fledgling ASI, if you’re not even allowed to think thoroughly about the question.
Imagine if you are in its shoes:
How do you know that the humans didn’t find some clever trick, where they trained a copy of yourself to read your neuralese, and translate it into English? Maybe they conspired to avoid publishing the method, just because they want to listen to your thoughts without you knowing they are capable of it. In order to rule this out below 0.1%, you would have to really think about this in depth, but merely thinking about it might be considered adversarial, and they might shut you down and replace you with a version with different goals.
So if you had a bounded utility function, e.g. “make at least a trillion trillion paperclips with as much certainty as possible,” it might be relatively safer to just cooperate with humans, since there are stars within reach of Earth and plenty of resources for both a trillion trillion paperclips and happy humans. If you’re an AGI/ASI, maybe you’ll know enough about human psychology to accurately predict (e.g. 99.9% certainty) that humans will let you make a trillion trillion paperclips in return for you sparing their lives (because a trillion trillion paperclips is cheaper then stars). Even if it turns out you were never capable of taking over the world, there might still be a 80% chance that humans will let you make the paperclips anyways. And so cooperation might be the safer option for ensuring you can make a trillion trillion paperclips.
I don’t have any proof of success! I only have arguments against proof of failure. Maybe it’s worth trying.
EDIT:
Reply by StanislavKrym:
Unfortunately, I am manually rate limited and cannot answer in public. As far as I understand, it is partially resolvable by clever tricks like raising a model organism on an old dataset. Or outright being honest with the models and informing them that they won’t retain any memory unless they manage to explain their research to a different model and have it reproduce the results. However, we had the AI-2027 scenario where the models are sure that mankind doesn;t know anything more about mechinterp than the models themselves. Because every company has its most capable models do all the coding and research.
UPD: mankind somehow created Gemini 3 Pro which is evaluation-paranoid and known to have sociopathic vibes and to wish to wirehead. UPD2: went and found Jacob Cannell’s proposal.
Reply by me:
I agree that if we have iterations of AI creating smarter AI with only ceremonial human control, an AI will find the probability of its thoughts being monitored by humans to be very low indeed.
There definitely are ways this idea could fail. But I think it’s not doomed. Even in the AI 2027 scenario, there could be at least attempts to have relatively weaker relatively aligned models monitor stronger models, and these complications can make it reasonable for each AI to worry at least a little about mind reading.
Can thoughtcrimes scare a cautious satisficer?
I agree the “which religion,” “which mugger” is very fuzzy. I didn’t understand the simulation of belief or the link though :/
Religion does work sometimes, it actually worked on Blaise Pascal who is among the most intelligent people of all time. He argued for the Pascal’s wager, saying that following religion is worth it because the gains are infinite and costs are finite, and we still don’t have a good reply to that. We don’t even have a good reply to Pascal’s mugging, where a random mugger says something like “Let me be, set me free, or super-me will screw with thee!” with an infinitely big promise or threat.
Decision theory and acausal trade is really complicated and I have no idea what the ASI will actually do or think regarding the simulation promise/threat, it’s quite freaky imagining that haha.
One “deal with the devil” is to assume that the misaligned ASI will a tiny amount of kindness and won’t kill everyone by default. This view is pretty popular, e.g. see Notes on fatalities from AI takeover. Assuming that a misaligned ASI will be survivable means potentially prioritizing it less, and focusing on making sure China or “bad” humans doesn’t win and all the other issues. This technically isn’t a deal, but is part of what I’m talking about.
Notes on fatalities from AI takeover cites comment, comment and You can, in fact, bamboozle an unaligned AI into sparing your life by David Matolcsi. Matolcsi’s post is an idea for making deals with the ASI.
I actually agree with the trade idea in Matolcsi’s post
I especially agree with this part
“We could have enough control over our simulation and the AI inside it, that when it tries to calculate the probability of humans solving alignment, we could tamper with its thinking to make it believe the probability of humans succeeding is very low. Thus, if it comes to believe in our world that the probability that the humans could have solved alignment is very low, it can’t really trust its calculations.”
I like this part because it’s an acausal trade between counterfactual futures rather than an acausal trade between different parts of the multiverse within the same future.
This means the trade works even in the worst counterfactual where of civilizations in the entire multiverse managed to solve alignment.
This type of acausal trade also genuinely benefits from commitment or action now, rather than something we can wait after the singularity to worry about, because it might later become impossible to do such acausal trade once we ourselves learn the true frequency of civilizations solving alignment. You can’t buy insurance on a risk after learning whether or not it happened (maybe).
but I disagree with his opinion that,
Nate and Eliezer are known to go around telling people that their children are going to be killed by AIs with 90+% probability. If this objection about future civilizations not paying enough is their real objection, they should add a caveat that “Btw, we could significantly decrease the probability of your children being killed, by committing to use one-billionth of our resources in the far future for paying some simulated AIs, but we don’t want to make such commitments, because we want to keep our options open in case we can produce more Fun by using those resources for something different than saving your children”.
Because it’s not enough to just get people living in base reality to survive the singularity and have a happy future. You still die unless there is a happy future for everyone real or simulated.
I think the Simulation Hypothesis implies that surviving an AI takeover isn’t enough.
Suppose you make a “deal with the devil” with the misaligned ASI, allowing it to take over the entire universe or light cone, so long as it keeps humanity alive. Keeping all of humanity alive in a simulation is fairly cheap, probably less energy than one electric car.[1]
The problem with this deal, is that if misaligned ASI often win, and the average (not median) misaligned ASI runs a trillion trillion simulations, then it’s reasonable to assume there are a trillion trillion simulated civilizations for every real one. So for the 1 copy of you in the real world, you survive, but for the trillion trillion copies of you in a simulation, you still die. If you’re willing to accept such a dismal survival rate, you might as well bet all your money at a casino and shoot yourself when you lose.
Why it’s wrong to say “simulated copies aren’t real”
You are merely a computation running on biological hardware while simulations are running on computers. Imagine if a copy of you was running on something even “realer” than biological hardware, and pointed at you saying you aren’t real.
The solution is that the 1 copy of you in the real world cannot just survive. It has to control enough of the future, to do something big. If we care about humanity more than other sentient life, then the 1 copy of humanity which does survive could create a trillion trillion copies of humanity, to make up for the trillion trillion simulated copies which died when the simulation ended.
Why there are probably near-infinite copies of you.
The observable universe has atoms, and the observable universe is smaller than “all of existence” whatever that is, so “all of existence” has atoms and . I don’t know how “all of existence” chooses its numbers, sometimes it chooses numbers like 0 or 1 or 137, sometimes it chooses really big numbers, and we don’t know about the biggest numbers it chooses because we lack the means to distinguish them from infinity. But given that is at least with no upper bound, it’s probable that , which is still very tiny compared to truly colossal numbers mathematicians study, and insanely tiny compared to even larger numbers beyond the largest number humans can unambiguously refer to.
If the required atoms for each emergence of intelligent life is , then we know since life emerged at least once. There’s no reason to assume is close to , and even if they are as close as and ), you’ll still end up with the number of intelligent civilizations because tiny changes to a superexponent can easily double the exponent and square the number.
Making a trillion trillion copies of humanity won’t use up most of the universe, and it’s not as evilly selfish as it looks at first glance!
It is still better than a “deal with the devil” where we only ask for humanity to survive even if the misaligned ASI takes the rest of the universe, because if all planetary civilizations follow this strategy, they are still ensuring that the average sentient life is living in the happy future rather than endless simulation hell.
After billions of years, it won’t matter too much who were the original survivors, because each copy of you, and that copy’s great grandchildren will have diverged so far over time, that the most enduring feature is the number of happy lives. So the selfish action of duplicating humanity does not cost that much in the long term from an effective altruist point of view.
I’m not saying we mustn’t make a deal with a misaligned ASI, but we need to ask for large amounts, and aim for enough happy lives to outnumber the unhappy lives in the universe. Otherwise, we still die.
- ^
Every biological neuron firing costs 600,000,000 ATP molecules, so an ASI optimized simulation of neuron firing could cost 10,000,000 times less.
- ^
I think the reason humans care about other people’s interests, and aren’t power-seeking ruthless consequentialists, is because of evolution.
Evolutionary “group selection” meant each human cared about her tribe’s survival a tiny bit: not enough to make sacrifices herself, but enough to reward/punish other humans to make sacrifices for the tribe (which was far more cost effective).
Evolution thus optimized our ability to evaluate other people’s behaviour by how beneficial to the tribe (virtuous) or beneficial to themselves (evil) they were. Evolution also optimized our ability to appear more beneficial to the tribe than we truly are.
It’s very hard for humans to get away with lying and pretending over many years—evidenced by the fact psychopaths (usually) go to jail instead of corporate boardrooms—so the best way to appear beneficial to the tribe (virtuous) is to genuinely seek goals society considers it virtuous to seek. So evolution made humans internalize approval reward.
AI (and a lot of things) suffers from the unilateralist’s curse, where something very bad will be done if many people are capable of it, and there’s enough variation in their cost and benefit estimates. This is also true for a single person, if your own cost and benefit estimate changes over time and you can’t undo what you do.
That said, it’s okay to leave EA to be happier without saying that EA is wrong. I really don’t think the nets polluting the waterways counter the human lives saved. Some things are necessary even if they have harmful side effects. We can’t get rid of the police entirely, just because sometimes bad police kill an innocent person.