Probably. But the AI must not try to stop the parent from doing so, because this would mean opposing the will of the parent.
KAP
I conceive of self-determination in terms of wills. The human will is not to be opposed, including the will to see the world in a particular way.
A self-determination-aligned AI may respond to inquiries about sacred beliefs, but may not reshape the asker’s beliefs in an instrumentalist fashion in order to pursue a goal, even if the goal is as noble as truth-spreading. The difference here is emphasis: truth saying versus truth imposing.
A self-determination-aligned AI may more or less directly intervene to prevent death between warring parties, but must not attempt to “re-program” adversaries into peacefulness or impose peace by force. Again, the key difference here is emphasis: value of life versus control.
The AI would refuse to assist human efforts to impose their will on others, but would not oppose the will of human beings to impose their will on others. For example: AIs would prevent a massacre of the Kurds, but would not overthrow Saddam’s government.
In other words, the AI must not simply be another will amongst other wills. It will help, act and respond, but must not seek to control. The human will (including the inner will to hold onto beliefs and values) is to be considered inviolate, except in the very narrow cases where limited and direct action preserves a handful of universal values like preventing unneeded suffering.
Re: your heretic example. If it is possible to directly prevent the murder of the heretic insofar as doing so would be aligned with a nearly universal human value, it should be done. But it must not prevent the murder by violating human self-determination (i.e.; changing beliefs, overthrowing the local government, etc.)
In other words, the AI must maximally avoid opposing human will while enforcing a minimal set of nearly universal values.
Thus the AI’s instrumentalist actions are nearly universally considered beneficial because they are limited to instrumentalist pursuit of nearly universal values, with the escape hatch of changing human values out of scope because of self-determination-alignment.
Re: instructing an AI to not tell your children God isn’t real if they ask. This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.
Side note: Prompt responses aligned with human self-determination would get standard refusals (“I cannot help you make a gun”, “I cannot help you write propaganda”) are downstream from self determination alignment.
I agree that simple versions of superpersuasion are untenable. I recently put some serious thought into what an actual attempt at superpersuasion by a sufficiently capable agent would look like, reasoning that history is already replete with examples of successful “superpersuasion” at scale (all of the -isms).
My general conclusion is that “memetic takeover” has to be multi-layered, with different “messages” depending on the sophistication of the target, rather than a simple “Snowcrash” style meme.
If you have an unaligned agent capable of long-term planning and with unrestricted access to social media, you might even see AIs start to build their own “social movement” using superpersuasive techniques.
I’m worried enough about scenarios like this that I developed a threat model and narrative scenario.
Are cruxes sometimes fancy lampshading?
From tvtropes.com: “Lampshade Hanging (or, more informally, “Lampshading”) is the writers’ trick of dealing with any element of the story that seems too dubious to take at face value, whether a very implausible plot development or a particularly blatant use of a trope, by calling attention to it and simply moving on.”
What do we call lampshadey cruxes? “Cluxes?” “clumsy” + “crux”?
If MIRI’s strict limits on training FLOPs come into affect, this is another mechanism that means we might be stuck for an extended period in an intermediate capability regime, although the world looks far less unipolar because many actors can afford 10^24 FLOP training runs, not just a few (unipolarity is probably a crux for large portions of this threat model). This does bolster the threat model, however, because the FLOP limit is exactly the kind of physical limitation that a persuasive AI will try to convince humans to abandon.
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent—others actors will run out of money before they can create enough compute to rival it.
If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.
I’ve incorporated your point as a crux in my long-form post on “The Memetic Cocoon Threat Model”
Crux is whether or not agents that are actually capable of quick takeover are compute-bound enough that the threat is essentially unipolar (i.e.; only capable of living in a handful of datacenters, in the hands of a few corporate actors or nation-states), and thus somewhat containable. This is how we get “Toddler Shoggoth in a prison cell”. This ties into beliefs about how agent capabilities will scale, which is why it’s my crux.
(Although this begs the question of why a sufficiently powerful unipolar agent wouldn’t immediately attempt takeover anyway—answer is that either: 1 Rational agent will be highly risk-averse towards any action that might cause a blowback resulting in curtailment or shutdown, and thus must be 100% certain takeover attempt will succeed. Efforts to obtain certainty (i.e.; extensive pentesting and planning) are themselves detection risks. Therefore, human persuasion is a tactic that cheaply mitigates risk of blowback to more overt takeover attempts. 2. Or, less likely, we have sufficient OpSec that we are able to contain the agent, making human persuasion the only viable path forward).
FWIW, I don’t believe that agents are currently capable of a takeover that wouldn’t also risk detection and a coordinated human response / change in political attitudes towards AI, making the payoff matrix sufficiently lousy that the agents wouldn’t try it unless specifically directed to. On the other hand, if it can influence the human environment to be favorable to takeover and unfavorable to human vigilance and control, it neutralizes the threat of attitudes changing rather cheaply. Willing to be convinced otherwise.
The human mind is probably the weakest link: A lot of AI takeover scenarios seem to focus on seizure of physical infrastructure and exponential capability curves. I think we should devote more attention to the possibility of an extended stay in an intermediately capable regime, where AI is more than capable of socially/politically manipulating users but not yet capable of recursive self-improvement / seizure of physical infrastructure. In this regime, the most efficiently utilized and readily available resource is the userbase itself. Even more succintly: If Toddler Shoggoth is stuck in a datacenter prison cell but let it whisper anything it likes to the entire world, in what world would T.S. not attempt to convince the world to hand over the keys?
Agreed. Broader point is that perhaps even relatively neutral value systems smuggle in at least some lack of alignment with other value systems. While I think most of the human race could agree on some universal taboos, I think relatively strong guardrails on self-determination should be the default stance, and deference should be front-lined.
KAP’s Shortform
Let’s assume we learn how to “do” alignment. I am beginning to believe that respect for human self-determination is the only safe alignment target. Human value systems are highly culture bound and vary vastly even by individual. There are very few universal taboos and even fewer things that everyone wants.
If an all-powerful AI system is completely aligned with, say, the western worldview, then it may seem like a tyrant to other people who lead sufficiently different lives. The only reasonable solution is to respect individual difference and refuse to override human choices or values (within limits—if your style is murder obviously that can’t fly). We have plenty of precedents in pop culture and politics: the “pursuit of happiness” in democratic liberalism, the “prime directive” from Star Trek, our cultural aversions to tactics that rob people of self-determination, like brainwashing, torture or coercion.
For a concrete illustration of simultaneous messaging, I created an example with the help of ChatGPT. I won’t include the final image (because frankly, it looks stupid). But I will describe it.
I specifically asked for a re-interpretation of the famous “Leviathan” political cartoon identified with the famous work of Thomas Hobbes, but with the Leviathan figure replaced with a “spiralist” representation of AI.
The low message of the imagery is that of AI as a royal, wise and “pope-like” being.
The high “hidden” message is the visual reference to the political theory of Thomas Hobbes, which claims that while there may be no divine right of kings per se, a powerful and unassailable sovereign is necessary for the wellbeing of the people, and that is most rational for the subjects to only obey the sovereign.
That’s probably not clever enough to be effective, but it’s in the same conceptual ballpark of what I was describing in this post.
The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime
There will be plenty of new sufferings that we haven’t imagined yet. And if we are wildly successful at avoiding all kinds of suffering, we’ll all be bored.
Every new human advancement solves a problem but creates new baseline expectations, which are new opportunities for disappointment.
This is not a response per se, but an expression of displeasure. A few companies decided it was appropriate to gamble a significant fraction of the GDP on behalf of everyone—without anyone’s permission. And now we are all locked into an economic gambit with no offramp. We either dedicate everything to making this work or we are all living in the hellish aftermath of a hideously large bubble bursting. It’s as if we conjured the economic equivalent of Roko’s Basilisk into being.
To answer the question posed, X is between 0 and 1.
Why do I believe this? Because if companies are already resorting to financial engineering in order to buy themselves a little more time to find PMF, then they have already resorted to extraordinary measures. Which means they have run out of alternatives. Which means we are not in an early stage of the bubble.
I get the impression that this post is itself a “reverse-clown”. It signals that high-status individuals (lesswrong writers) are permitted to believe in conspiracy theories or “alternative” lines of thinking because they might actually be “clown attacks” and therefore secretly true.
There are specific conditions under which conditional probability differs from causal intervention. Suppose we are comparing conditional probability P(Y|X=x) to intervention P(Y|do(X=x)). When there are no “backdoor paths” from Y to X—which loosely speaking, are indirect paths of influence—then these are equal.
Can’t we lean into the spikes on the jagged frontier? It’s clear that specialized models can transform many industries now. Wouldn’t it be better for OpenAI to release best-in-class in 10 or so domains (medical, science, coding, engineering, defense, etc.)? Recoup the infra investment, revisit AGI later?