If we’re being realistic, this kind of thing would only get criminalized after something bad actually happened. Until then, too many people will think “omg, it’s just a Chatbot”. Any politician calling for it would get made fun of on every Late Night show.
Prometheus
I think my main problem with this is that it isn’t based on anything. Countless times, you just reference other blog posts, which reference other blog posts, which reference nothing. I fear a whole lot of people thinking about alignment are starting to decouple themselves from reality. It’s starting to turn into the AI version of String Theory. You could be correct, but given the enormous number of assumptions your ideas are stacked on (and that even a few of those assumptions being wrong leads to completely different conclusions), the odds of you even being in the ballpark of correct seem unlikely.
At first I strong-upvoted this, because I thought it made a good point. However, upon reflection, that point is making less and less sense to me. You start by claiming current AIs provide nearly no data for alignment, that they are in a completely different reference class from human-like systems… and then you claim we can get such systems with just a few tweaks? I don’t see how you can go from a system that, you claim, provides almost no data for studying how an AGI would behave, to suddenly having a homunculus-in-the box that becomes superintelligent and kills everyone. Homunculi seem really, really hard to build. By your characterization of how different actual AGI is from current models, it seems this would have to be fundamentally architecturally different from anything we’ve built so far. Not some kind of thing that would be created by near-accident.
Contra One Critical Try: AIs are all cursed
I don’t feel like making this a whole blog post, but my biggest source for optimism for why we won’t need to one-shot an aligned superintelligence is that anyone who’s trained AI models knows that AIs are unbelievably cursed. What do I mean by this? I mean even the first quasi-superintelligent AI we get will have so many problems and so many exploits that taking over the world will simply not be possible. Take a “superintelligence” that only had to beat humans at the very constrained game of Go, which is far simpler than the real world. Everyone talked about how such systems were unbeatable by humans, until some humans used a much “dumber” AI to find glaring holes in Leela Zero’s strategy. I expect, in the far more complex “real world”, a superintelligence will have even more holes, and even more exploits, a kind of “swiss chess superintelligence”. You can say “but that’s not REAL superintelligence”, and I don’t care, and the AIs won’t care. But it’s likely the thing we’ll get first. Patching all of those holes, and finding ways to make such an ASI sufficiently not cursed will also probably mean better understanding of how to stop it from wanting to kill us, if it wanted to kill us in the first place. I think we can probably get AIs that are sufficiently powerful in a lot of human domains, and can probably even self-improve, and still be cursed. The same way we have AIs with natural language understanding, something once thought to be a core component of human intelligence, that are still cursed. A cursed ASI is a danger for exploitation, but it’s also an opportunity.
I’ve heard of many such cases of this from EA Funds (including myself). My impression is that they only had one person working full-time managing all three funds (no idea if this has changed since I applied or not).
Can humans become Sacred?
On 12 September 1940, the entrance to the Lascaux Cave was discovered on the La Rochefoucauld-Montbel lands by 18-year-old Marcel Ravidat when his dog, Robot, investigated a hole left by an uprooted tree (Ravidat would embellish the story in later retellings, saying Robot had fallen into the cave.)[8][9] Ravidat returned to the scene with three friends, Jacques Marsal, Georges Agnel, and Simon Coencas. They entered the cave through a 15-metre-deep (50-foot) shaft that they believed might be a legendary secret passage to the nearby Lascaux Manor.[9][10][11] The teenagers discovered that the cave walls were covered with depictions of animals.[12][13] Galleries that suggest continuity, context or simply represent a cavern were given names. Those include the Hall of the Bulls, the Passageway, the Shaft, the Nave, the Apse, and the Chamber of Felines. They returned along with the Abbé Henri Breuil on 21 September 1940; Breuil would make many sketches of the cave, some of which are used as study material today due to the extreme degradation of many of the paintings. Breuil was accompanied by Denis Peyrony, curator of Les eyzies (Prehistory Museum) at Les Eyzies, Jean Bouyssonie and Dr Cheynier.
The cave complex was opened to the public on 14 July 1948, and initial archaeological investigations began a year later, focusing on the Shaft. By 1955, carbon dioxide, heat, humidity, and other contaminants produced by 1,200 visitors per day had visibly damaged the paintings. As air condition deteriorated, fungi and lichen increasingly infested the walls. Consequently, the cave was closed to the public in 1963, the paintings were restored to their original state, and a monitoring system on a daily basis was introduced.
Lascaux II, an exact copy of the Great Hall of the Bulls and the Painted Gallery was displayed at the Grand Palais in Paris, before being displayed from 1983 in the cave’s vicinity (about 200 m or 660 ft away from the original cave), a compromise and attempt to present an impression of the paintings’ scale and composition for the public without harming the originals.[10][13] A full range of Lascaux’s parietal art is presented a few kilometres from the site at the Centre of Prehistoric Art, Le Parc du Thot, where there are also live animals representing ice-age fauna.[14]
The paintings for this site were duplicated with the same type of materials (such as iron oxide, charcoal, and ochre) which were believed to be used 19,000 years ago.[9][15][16][17] Other facsimiles of Lascaux have also been produced over the years.
They have also created additional copies, Lascaux III, Lascaux IV, and Lascaux V.
Consequently, the cave was closed to the public in 1963, the paintings were restored to their original state, and a monitoring system on a daily basis was introduced.
“I actually find it overwhelmingly hopeful, that four teenagers and a dog named Robot discovered a cave with 17,000-year-old handprints, that the cave was so overwhelmingly beautiful that two of those teenagers devoted themselves to its protection. And that when we humans became a danger to that caves’ beauty, we agreed to stop going. Lascaux is there. You cannot visit.”
-John Green
People preserve the remains of Lucy, work hard to preserve old books, the Mona Lisa is protected under bullet-proof glass and is not up for sale.
What is the mechanistic reason for this? There are perfect copies of these things, yet humans go through great lengths to preserve the original. Why is there the Sacred?
They have created copies of Lascaux, yet still work hard to preserve the original. Humans cannot enter. They get no experience of joy from visiting. It is not for sale. Yet they strongly desire to protect it, because it is the original, and no other reason.
Robin Hanson gave a list of Sacred characteristic, some I find promising:
Sacred things are highly (or lowly) valued. We revere, respect, & prioritize them.
Sacred is big, powerful, extraordinary. We fear, submit, & see it as larger than ourselves.
We want the sacred “for itself”, rather than as a means to get other things.
Sacred makes us feel less big, distinct, independent, in control, competitive, entitled.
Sacred quiets feelings of: doubts, anxiety, ego, self-criticism, status-consciousness.
We get emotionally attached to the sacred; our stance re it is oft part of our identity.
We desire to connect with the sacred, and to be more associated with it.
Sacred things are sharply set apart and distinguished from the ordinary, mundane.
Re sacred, we fear a slippery slope, so that any compromise leads to losing it all.
If we can understand the sacred, it seems like a concept that probably wouldn’t fall into a simple utility function, something that wouldn’t break out-of-distribution. A kind of Sacred Human Value Shard, something that protects our part of the manifold.
I’ve selected to opt-out of Patrov Day, not because I don’t want to participate, but because I think this is the most optimal strategy. The more people who opt-out, the less likely the button will be pushed.
Could you explain the rational behind the “Open” in OpenAI? I can understand the rational of trying to beat more reckless companies to achieving AGI first (albeit, this mentality is potentially extremely dangerous too), but what is the rational behind releasing your research? This will enable companies that do not prioritize safety to speed ahead with you, perhaps just a few years behind. And, if OpenAI hesitates to progress, due to concerns over safety, the more risk-taking orgs will likely speed ahead of OpenAI in capabilities. The bottomline is I’m concerned your efforts to achieve AGI might not do much to ensure an aligned AGI is actually created, but instead only speed-up the timeline toward achieving AGI by years or even decades.
Though I tend to dislike analogies, I’ll use one, supposing it is actually impossible for an ASI to remain aligned. Suppose a villager cares a whole lot about the people in his village, and routinely works to protect them. Then, one day, he is bitten by a werewolf. He goes to the Shammon, he tells him when the Full Moon rises again, he will turn into a monster, and kill everyone in the village. His friends, his family, everyone. And that he will no longer know himself. He is told there is no cure, and that the villagers would be unable to fight him off. He will grow too strong to be caged, and cannot be subdued or controlled once he transforms. What do you think he would do?
This isn’t what I mean. It doesn’t mean you’re not using real things to construct your argument, but that doesn’t mean the structure of the argument reflects something real. Like, I kind of imagine it looking something like a rationalist Jenga tower, where if one piece gets moved, it all crashes down. Except, by referencing other blog posts, it becomes a kind of Meta-Jenga: a Jenga tower composed of other Jenga towers. Like “Coherent decisions imply consistent utilities”. This alone I view to be its own mini Jenga tower. This is where I think String Theorists went wrong. It’s not that humans can’t, in theory, form good reasoning based on other reasoning based on other reasoning and actually arrive at the correct answer, it’s just that we tend to be really, really bad at it.
The sort of thing that would change my mind: there’s some widespread phenomenon in machine learning that perplexes most, but is expected according to your model, and any other model either doesn’t predict it as accurately, or is more complex than yours.
The following is a conversation between myself in 2022, and a newer version of myself earlier this year.
On AI Governance and Public Policy
2022 Me: I think we will have to tread extremely lightly with, or, if possible, avoid completely. One particular concern is the idea of gaining public support. Many countries have an interest in pleasing their constituents, so if executed well, this could be extremely beneficial. However, it runs high risk of doing far more damage. One major concern is the different mindset needed to conceptualize the problem. Alerting people to the dangers of Nuclear War is easier: nukes have been detonated, the visual image of incineration is easy to imagine and can be described in detail, and they or their parents have likely lived through nuclear drills in school. This is closer to trying to explain someone the dangers of nuclear war before Hiroshima, before the Manhattan Project, and before even tnt was developed. They have to conceptualize what an explosion even is, not simply imagining an explosion at greater scale. Most people will simply not have the time or the will to try to grasp this problem, so this runs the risk of having people calling for action to a problem they do not understand, which will likely lead to dismissal by AI Researchers, and possibly short-sighted policies that don’t actually tackle the problem, or even make the problem worse by having the guise of accomplishment. To make matters worse, there is the risk of polarization. Almost any concern with political implications that has gained widespread public attention runs a high risk of becoming polarized. We are still dealing with the ramifications of well-intentioned, but misguided, early advocates in the Climate Change movement two decades ago, who set the seeds for making climate policy part of one’s political identity. This could be even more detrimental than a merely uninformed electorate, as it might push people who had no previous opinion on AI to advocate strongly in favor of capabilities acceleration, and to be staunchly against any form of safety policy. Even if executed using the utmost caution, this does not stop other players from using their own power or influence to hijack the movement and lead it astray.
2023 Me: Ah, Me’22,, the things you don’t know! Many of the concerns of Me’22 I think are still valid, but we’re experiencing what chess players might call a “forced move”. People are starting to become alarmed, regardless of what we say or do, so steering that in a direction we want is necessary. The fire alarm is being pushed, regardless, and if we don’t try to show some leadership in that regard, we risk less informed voices and blanket solutions winning-out. The good news is “serious” people are going on “serious” platforms and actually talking about x-risk. Other good news is that, from current polls, people are very receptive to concerns over x-risk and it has not currently fallen into divisive lines (roughly the same % of those concerned fall equally among various different demographics). This is still a difficult minefield to navigate. Polarization could still happen, especially with an Election Year in the US looming. I’ve also been talking to a lot of young people who feel frustrated not having anything actionable to do, and if those in AI Safety don’t show leadership, we might risk (and indeed are already risking), many frustrated youth taking political and social action into their own hands. We need to be aware that EA/LW might have an Ivory Tower problem, and that, even though a pragmatic, strategic, and careful course of action might be better, this might make many feel “shut out” and attempt to steer their own course. Finding a way to make those outside EA/LW/AIS feel included, with steps to help guide and inform them, might be critical to avoiding movement hijacking.
On Capabilities vs. Alignment Research:
2022 Me: While I strongly agree that not increasing capabilities is a high priority right now, I also question if we risk creating a state of inertia. In terms of the realms of safety research, there are very few domains that do not risk increasing capabilities research. And, while capabilities continues to progress every day, we might risk failing to keep up the speed of safety progress simply because every action risks an increase in capabilities. Rather than a “do no harm” principle, I think counterfactuals need to be examined in these situations, where we must consider if there is a greater risk if we *don’t* do research in a certain domain.
2023 Me: Oh, oh, oh! I think Me’22 was actually ahead of the curve on this one. This might still be controversial, but I think many got the “capabilities space” wrong. Many AIS-inspired theories that could increase capabilities are for systems that could be safer, more interpretable, and easier to monitor by default. And by not working on such systems we instead got the much more inscrutable, dangerous models by default, because the more dangerous models are easier. To quote the vape commercials, “safer != safe” but I still quit smoking in favor of electronics because safer is still at least safer. This is probably a moot point now, though, since I think it’s likely too late to create an entirely new paradigm in AI architectures. Hopefully Me’24 will be happy to tell me we found a new, 100% safe and effective new paradigm that everyone’s hopping on. Or maybe he’ll invent it.
I don’t think this is the case. For awhile, the post with the highest karma was Paul Christiano explaining all the reasons he thinks Yudkowsky is wrong.
This has caused me to reconsider what intelligence is and what an AGI could be. It’s difficult to determine if this makes me more or leas optimistic about the future. A question: are humans essentially like GPT? We seem to be running simulations with the attempt to reduce predictive loss. Yes, we have agency; but this that human “agent” actually the intelligence or just generated by it?
21st Century Medicine cryopreserved and revived a rabbit kidney and planted inside a living rabbit. The kidney was still able to function. In a more recent study, memory retention seemed possible after cryopreservation, as mentioned. On top of this, 21st Century Medicine cryopreserved and thawed a rabbit brain with little damage: http://www.cryonics.org/news/mammal-brain-frozen-and-thawed-out-perfectly-for-first-time
My birds are singing the same tune.
Going to the moon
Say you’re really, really worried about humans going to the moon. Don’t ask why, but you view it as an existential catastrophe. And you notice people building bigger and bigger airplanes, and warn that one day, someone will build an airplane that’s so big, and so fast, that it veers off course and lands on the moon, spelling doom. Some argue that going to the moon takes intentionality. That you can’t accidentally create something capable of going to the moon. But you say “Look at how big those planes are getting! We’ve gone from small fighter planes, to bombers, to jets in a short amount of time. We’re on a double exponential of plane tech, and it’s just a matter of time before one of them will land on the moon!”
Contra Scheming AIs
There is a lot of attention on mesaoptimizers, deceptive alignment, and inner misalignment. I think a lot of this can fall under the umbrella of “scheming AIs”. AIs that either become dangerous during training and escape, or else play nice until humans make the mistake of deploying them. Many have spoken about the lack of an indication that there’s a “humanculi-in-a-box”, and this is usually met with arguments that we wouldn’t see such things manifest until AIs are at a certain level of capability, and at that point, it might be too late, making comparisons to owl eggs, or baby dragons. My perception is that getting something like a “scheming AI” or “humanculi-a-box” isn’t impossible, and we could (and might) develop the means to do so in the future, but that it’s a very, very different kind of thing than current models (even at superhuman level), and that it would take a degree of intentionality.
I dislike the overuse of analogies in the AI space, but to use your analogy, I guess it’s like you keep assigning a team of engineers to build a car, and two possible things happen. Possibility One: the engineers are actually building car engines, which gives us a lot of relevant information for how to build safe cars (toque, acceleration, speed, other car things), even if we don’t know all the details for how to build a car yet. Possibility Two: they are actually just building soapbox racers, which doesn’t give us much information for building safe cars, but also means that just tweaking how the engineers work won’t suddenly give us real race cars.
Yeah, all the questions over the years of “why would the AI want to kill us” could be answered with “because some idiot thought it would be funny to train an AI to kill everyone, and it got out of hand”. Unfortunately, stopping everyone on the internet from doing things isn’t realistic. It’s much better to never let the genie out of the bottle in the first place.
The biggest problem I have with a lot of these is they require human feedback. Imagine a Chess AI receiving human feedback on each move having to compete with Alpha Zero’s self-supervised RL system, which beat every other Chess AI and human after just 72 hours of training. I just don’t see how human-feedback systems can possibly compete.
I’m kind of surprised this has almost 200 karma. This feels much more like a blog post on substack, and much less like the thoughtful, insightful new takes on rationality that used to get this level of attention on the forum.