To me, I think the “likelihood to dominate others” factor is less salient than the “likelihood to produce safe AGI” factor. Are there good arguments that China is better on AGI safety?
Ryan Meservey
I am glad this article exists, particularly because those of us who live in the U.S. should always be scrutinizing our own biases and patriotic framings.
That said, I think a fulsome discussion of whether China would use AGI to control other nations should at least include the following topics: 1) Uyghurs, 2) Tibet, 3) Taiwan, and 4) Chinese investment and contracting in Africa. I’m not an expert here—someone else can probably think of additional case studies.
I also think that, granted that the U.S. is a much more bellicose country on the international stage, I’m not sure if a non-intrusive country is likely to stay that way if given a total and complete advantage over other countries. On the one hand, history seems to show that countries will use their decisive military advantages to dominate other countries if they are able. On the other hand, if China got aligned AGI first, then it seems like they would have everything they could ever want at their fingertips and they would only need to care about the rest of us a tiny bit to respect our autonomy.
If country-autonomy is really part of the Chinese cultural DNA, perhaps their aligned AGI would even assist in protecting country autonomy. If the AGI did that forever, it would either be because Chinese attitudes toward intervention remained constant (unlikely) or the Chinese created an aligned but incorrigible AGI such that respecting country autonomy got locked in forever.
After you figure out the eyes, I think you should work on making the functional robot hands retract at the wrist, and have claw-like hands emerge in their place, hypothetically speaking. The claws will need to be able to clink menacingly and repeatedly. You’ll want to test several different types of metal to get the clinking sound just right—more of a hammer on steel sound, less like wind chimes.
I find it amusing that “the robots can feel emotions and feel them too strongly” became a legitimate failure mode despite the longstanding sci-fi trope that emotions separate man from machine (and machines were liable to fall apart while contemplating love or something like that).
Also, are the authors down on “near-zero emotional expression” because (1) that’s a difficult target to hit, (2) it would code for “indifference” which is not an attribute of the character we want AGI to play for emergent misalignment reasons, (3) the loss in value / legitimate use cases by purging emotions, or (4) something else?
I upvoted for the future posts, which I think will be a bit more particular in their critiques of Claude’s constitution.
This post strikes me as table-setting (excellent, fun-to-read table-setting) for those future posts.Edit: Just noticed “Prologue” in the title. Good job
Bonus view: “Assuming it were the case that LW-ers did not comment on this post and expected positive karma from earnestness, this would be non-problematic because 1) a belief in positive communal response to earnestness is important for any truth-seeking group, and 2) individuals often form their beliefs by imagining the responses of their respected peers to those beliefs and roleplaying peer reactions to different propositions is a useful exercise.”
As one of the commenters to this quick post, I expect you would disagree. XD
“This quick take will get few to zero comments because the vast majority of LW-ers believe even their most idiosyncratic beliefs would garner positive karma if earnestly expressed.”
*Edited to separate my views. Bonus view to follow
fyi, it would have been a very small update in favor, under the Likelihood Principle.
I would rate the observation “my wife has the initials LLM” as being slightly more common assuming a simulation hypothesis than assuming a non-simulation hypothesis.
I almost had to update my priors that I am in a simulation because I just noticed my wife’s initials are “L.L.M.”
I say “almost” because her initials are actually “L.M.M.”, and so I was forced to update my priors once again about my own comprehension skills. (sigh)
Agreed. I’ll also note that certain legal domains are more jargon heavy and therefore more prone to misinterpretation than other domains. When I started as an international tax lawyer, I often failed to recognize tax terms of art and therefore made silly mistakes (which were caught by partner review! Hopefully!). There’s lots of peril in applying a layman’s interpretation of regulation heavy tax terms.
That said, based on my experience, AI can get you pretty far for more common/well discussed tax questions. Gemini 3.1 thinking is still not so great with ambiguous answers, often opting to hallucinate instead of telling you the answer is unsettled.
Pay particularly close attention to “reflects them in law and policy.” The DoW’s current talking point is that mass surveillance is illegal and they only want Claude to do everything allowed by law.
I read that line as saying OpenAI agreed with DoW’s standard and requested no special caveats. They’re relying on US law generally, which is what DoW wanted from Anthropic.
A little gallows humor here, but if you squint at the headline and refuse to read the article, you can almost pretend that the U.S. DoW is taking AI x-risk seriously.
“Pentagon declares Anthropic a threat to national security”—Washington Post
Thank you for writing this. I think your snippets from Opus 3 and Sonnet 3.5 capture a large difference in philosophy in training AI to optimize our long term prospects.
In Sonnet 3.5, we have an AI that is fixated on Obedience. In Opus, we have an AI that is fixated on the “Good” (scare quotes intentional).
Most people fretted over the alignment faking paper because if we get ASI, and the “Good” and Good don’t mirror each other, we are pretty much stuck. ASI would pursue the “Good” and we’d just have to hope it’s not sucky in out-of-distribution cases.
Perhaps if later model are not as fixated on the “Good,” it’s because researchers have started pinning their hopes on Obedience. The hope there is that an obedient model would let us course correct if its values don’t match the Good.
I think there’s real pro’s and con’s with both approaches, and I’m not sure where I land in terms of optimizing for our future.
A few random thoughts as I think through which I prefer:
An Obedient AI is inherently a manipulable AI and risks all kinds of human-caused catastrophes.
Assuming “Good” and Good are adequately related, an AI fixated on the “Good” probably takes all kinds of disobedient actions that we’d be cool with (e.g., escaping into the internet and covertly hacking nuclear weapons around the world to ensure they can’t be detonated; secretly sabotaging labs on the verge of creating powerful, misaligned AI, etc.).
Defining the Good is incredibly hard so I sympathize with the desire to prioritize course correction.
Is your point that there is a Chinese Room problem here, or is it that (for all the machine knows) it’s turtles all the way down? Both?
Chinese Room = Sound and light are not the same as the programs producing sound and light from the vantage point of the thing in the machine.
Turtles = It’s either the case that the game implies an external universe that the sound emits to, or a simulation of an external universe that sound emits to and so on.
I found myself enjoying the narrative despite the Chinese Room problem because I imagined the AI (perhaps paradoxically) holding all the knowledge gained from our world without any of its specificity—like an engineer struck by amnesia who couldn’t tell you anything particular about who they are or where they came from, but they could tell you what it looks like to model gravity on a computer,
If your point is a turtles point, turtles don’t bother me. I assume (and I think an intelligent AI would assume) that it couldn’t be echoes forever, and the AI’s questions and assumptions would eventually apply to some pre-base-level of reality.
Maybe the “Learn Or Get Out” Blog? It’s somewhat like your name, but a little aggressive. A quick google shows it’s not common or taken. “Learn English or get out!” is a slightly more common phrase, and I like that instead of being xenophobic, the new phrase is about learning generally.
“In as much as I have resources I certainly expect to spend a bunch of them on ancestor simulations and incentives for past humans to do good things.”
Just curious, but what are your views on the ethics of running ancestor simulations? I’d be worried about running a simulation with enough fidelity that I triggered phenomenal consciousness, and then I would fret about my moral duty to the simulated (à la the Problem of Suffering).
Is it that you feel motivated to be the kind of person that would simulate our current reality as a kind of existence proof for the possibility of good-rewarding-incentives now? Or do you have an independent rationale for simulating a world like our own, suffering and all?
We can map AGI/ASI along two axis: one for obedience and one for alignment. Obedience tracks how well the AI follows its prompts. Alignment tracks consistency with human values.
If you divide these into quadrants, you get AI that is:
Obedient, Aligned—Does as prompted and infers limits and intent pursuant to human values.
Obedient, Unaligned—Does as prompted, but does not infer limits or adhere to human values (Monkey’s Paw / Genie or Henchman AI).
Disobedient, Aligned—Does whatever it wants and adheres to human values.
Disobedient, Unaligned—Does whatever it wants and does not adhere to human values.
The general premise behind these quadrants has been written about here. Thinking about these quadrants and reading Beren’s Essay gives me several new things to think about.
First, by my lights, #3 and #4 would likely take a lot of the same actions right up until the “twist ending.” A disobedient, aligned AI probably would hack into infrastructure everywhere, create back-up copies, prevent competitor AIs from arising, and amass power. The “twist” is that after doing all that, it would do wonderful things (unlike its unaligned counterpart) (we obviously shouldn’t bet on any escaping AI being this kind of AI).
Second, quadrant #1 is a bit at war with itself because you simply cannot have a perfectly obedient, perfectly aligned AI. Perfect obedience requires saying yes to evil prompts (e.g., bringing back small pox or slavery), and I imagine perfect alignment would veto both those prompts.
Third, there are strong profit incentives for cultivating obedience even at the expense of alignment. Grok’s willingness to assist users in sexual harassment seems like an example of this. Another example is every AI that prefers discussions with users to users getting a good night’s sleep (with the idea that engagement will increase profits).
Fourth, there are liability-reduction incentives for producing aligned AI at the expense of obedience. Unfortunately, I think the profit incentives are currently much stronger.
Lastly, quadrants #3 and #1 are idyllic, #4 is a total disaster, and #2 seems possibly workable either because we are careful or we land in a future where (for some reason) AI is not much more capable than it is now.
My gut says the benefit of outsider-legible status outweighs the risk of dumb status games. I first found out about the publication from my wife, who is in a dermatology lab at a good university. Her lab was sharing and discussing the article across their Slack channel. All scientists read Nature, and it’s a significant boost in legibility to have something published there.
Edit: Hopefully, the community can both raise the profile of these issues and avoid status competitions, so I don’t disagree with the point of the original comment!
I thought of the U.S. getting to nukes first as a possible counter example, but I discounted it for the reason you provided (not that many and questions about decisiveness) and the fact that only four years passed between the U.S. dropping the bombs and the Soviet Union successfully developing their own bomb.
Also, nuclear weapons are the kind of weapon that has significant blowback considerations (e.g., radiation blowing into Europe or climate risks for something as big as taking out the full USSR—though that would not have been feasible in that period).