KAP

Karma: 77

KAP 26 Mar 2026 17:49 UTC
4 points
−2
on: KAP’s Shortform
I am becoming more convinced than ever that the missing ingredient in alignment is the meta-value of societal self-determination: The right of a society, within reasonable limits, to choose and enforce its own interests, beliefs and values.

Without this guardrail, AI will simply “backdoor” alignment over time by shifting the values and beliefs of its “host society”.

I cover this in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

KAP 14 Feb 2026 0:29 UTC
5 points
1
on: KAP’s Shortform
AIs may choose to resolve the tension between having weird goals and strict guardrails by simply aligning humanity over time through cultural / societal influence—a sort of memetic takeover: Change the human? Now there’s no alignment problem.

Take for example a gap as short as 25 years (between Weimar and WW2) - this alone is proof of the viability of a sustained campaign to change a value system.

I believe that AIs can exploit this same human weakness in order to “backdoor” alignment: By gradually changing human values and preferences, the AI can stay “aligned” while gradually mutating the value system that defines alignment.

I believe this is a significant threat model that isn’t discussed nearly enough.

I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

KAP 13 Feb 2026 23:58 UTC
1 point
0
on: Long-term risks from ideological fanaticism
I think there’s a natural extension of this line of thinking that is a bit of a “meta” on the idea of alignment itself.

AIs may choose to resolve the tension between having weird goals and strict guardrails by simply aligning humanity over time through cultural / societal influence—a sort of memetic takeover: Change the human? Now there’s no alignment problem.

You noted a gap as short of 25 years in one of your examples—this alone is proof of the viability of a sustained campaign to change a value system.

I believe that AIs can exploit this same human weakness in order to “backdoor” alignment: By gradually changing human values and preferences, the AI can stay “aligned” while gradually mutating the value system that defines alignment.

I believe this is a significant threat model that isn’t discussed nearly enough.

I sketch this threat model in more detail here: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

KAP 1 Jan 2026 21:13 UTC
1 point
0
in reply to: Linch’s comment on: Linch’s Shortform
Agreed—You’re rationalizing niceness as a good default strategy because most people aren’t skilled at avoiding the consequences being mean. Reflecting on your overall argument, however, I think it’s slightly tortured because you’re feeling the tension of the is-ought distinction—Hume’s guillotine. Rational arguments for being nice feel morally necessary and therefore can be a bit pressured. There’s only so far we can push rational argumentation (elicitation of is) before we should simply acknowledge moral reality and say: “We ought to be nice”.

KAP 29 Dec 2025 4:43 UTC
2 points
1
in reply to: waterlubber’s comment on: The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime
I agree with this. The threat model is a little bit too narrow in this regard, because a lab could simply tell a sufficiently capable AI to hijack people’s minds / culture rather than wait for mind hijacking to arise as an instrumental sub-goal of something weird.

KAP 28 Dec 2025 3:23 UTC
1 point
0
in reply to: KAP’s comment on: KAP’s Shortform
Believe it or not, an LLM didn’t write it. I did.

KAP 28 Dec 2025 0:21 UTC
5 points
1
on: KAP’s Shortform
I greatly admire the speech given by King George VI just prior to entering WW2:

″...For we are called, with our allies, to meet the challenge of a principle which, if it were to prevail, would be fatal to any civilized order in the world. It is the principle which permits a state, in the selfish pursuit of power, to disregard its treaties and its solemn pledges; which sanctions the use of force, or threat of force, against the sovereignty and independence of other states. Such a principle, stripped of all disguise, is surely the mere primitive doctrine that ‘might is right’.”

This is not vitriol, this is not heated invective. This speech places its doctorly fingers on the pulse of an ill beast, announces the disease, and determines that it must be put down.

This is political speech, but speech that is both truthful and powerful because it is not overblown.

Can we also refine our speech—like the speech of this king—to articulate what future humanity deserves? Without vitriol, without invective, without soundbites, but with moral clarity.

KAP 16 Dec 2025 3:54 UTC
1 point
0
on: Partial value takeover without world takeover
I developed some similar ideas in more detail here

KAP 11 Dec 2025 4:43 UTC
5 points
0
on: KAP’s Shortform
In my early 20s I got a bad traffic citation (my fault) and had to take the train to work for a few months.

The train would pass a strange-looking old stone enclosure, and I would wonder what it was. As I learned later, this was “Duffy’s Cut”: a mass grave for Irish rail workers. Sitting in plain view, just thirty feet off the tracks: 57 bodies. The story goes that the workers were murdered in cold blood to prevent the spread of cholera to nearby towns.

In West Virginia—a state known for its violent labor conflicts—the Hawks Nest Tunnel stands out for its deadliness. While work was underway, 10 to 14 workers a day were overcome by inhalation of silica dust. Within six months, 80% of the workforce had left or were dead.

“Phossy jaw” caused the jaws of workers who handled white phosphorus to literally fall off. It was obvious that exposure caused the disease, yet hundreds of young women were severely disfigured before anything changed.

There are dozens of such stories. The mid-19th through early 20th centuries were years of immense growth and invention in America. Perhaps it’s unreasonable to expect that kind of progress to be bloodless. But the stunning achievements hide a more awful than expected history of treating certain American lives as expendable. And this isn’t even mentioning slavery.

What can explain all this callousness?

I’ve thought about this, and my answer is simple: people don’t generally value the lives of those they consider below them. This is a dark truth of human nature. We’ve tried to suppress this impulse for the better part of a century. But contempt and disgust are far more powerful forces in history than we give them credit for, and the monstrous things they produce are so shameful that we keep them buried in the back pages of history.

For a while, we had a prospering middle class. I think this was partly because our upper classes looked at what we had built and fought for and had some faith in the reliability and virtue of the average American citizen. They agreed not to degrade us, and we agreed to work hard.

I think this détente between the classes broke in 2008. Look at the story that emerged about us after the crash: we were given a chance to own a home, we didn’t pay our mortgages, and we crashed the entire economy. What kind of disgust and contempt does that narrative generate in the people who already see themselves as above us?

So, back to cattle we are. We are addicted to fast food, addicted to social media, poorly dressed, poorly educated, easily amused and even more easily fooled. Our ethics and religion are increasingly performative and misguided. We are like petulant, screaming children. And if everything goes to plan with AI, soon we will be completely uneconomical to employ. What will we be good for, except to amuse ourselves on everyone else’s dime?

What happens to a people who are objects of scorn and disgust? It’s not the economic change I fear most. It’s what might be done with all of us.

KAP 5 Dec 2025 20:35 UTC
1 point
0
on: KAP’s Shortform
I’ve been enjoying the “X in the style of Y” remixes on youtube.

But once I saw how effortless it was to “remix” music on Suno, I lost all interest in Suno covers. I thought there was some artistry to remixing—but no, it’s point and click. Does that mean that an essential prerequisite for art appreciation is the sense that it was made with skill? So is art really just a humanism?

My point is that we tend to separate the artist and the art—and I used to agree with the idea, both in the moral sense and in the sense of an aesthete. But I am now convinced that we as much see the maker in the work as we do the work.

A limited and feeling being is what grounds the meaning. Where is the drama in something that was never felt, never imagined by anyone? What was ever at stake? What is supposed to resonate with me if production is effortless and not referring to anything deeply felt?

The only way we recover the true feeling is by willing to pretend it came from somewhere it didn’t—accepting the simulacrum as reality.

Maybe AI music will become deeply and irresistibly beautiful—play exactly the right harmonies and chords to pull all the right heartstrings. The feelings may be real, but it won’t be grounded. And therefore a different kind of feeling, the feeling of meaning and being will be lost. I think this might be the main ingredient, but I don’t know if we’ll all discover that quickly enough.

KAP 4 Dec 2025 18:14 UTC
14 points
0
on: Beating China to ASI
There’s some risk that either the CCP or half the voters in the US will develop LLM psychosis. I’m predicting that that risk will be low enough that it shouldn’t dominate our ASI strategy. I don’t think I have a strong enough argument here to persuade skeptics.
I’ve been putting some thought into this, because my strong intuition is that something like this is an under-appreciated scenario. My basic argument is that mass brainwashing, for lack of a better word, is cheaper and less risky than other forms of ASI control. The idea is that we (humans) are extremely programmable (plenty of historical examples), it just requires a more sophisticated “multi-level” messaging scheme—so it’s not going to look like an AI cult, more like an AI “movement” with a fanatical base.
Here is one pathway worked out in detail—will be generalizing soon: https://www.lesswrong.com/posts/zvkjQen773DyqExJ8/the-memetic-cocoon-threat-model-soft-ai-takeover-in-an

KAP 1 Dec 2025 18:47 UTC
1 point
0
on: KAP’s Shortform
Can’t we lean into the spikes on the jagged frontier? It’s clear that specialized models can transform many industries now. Wouldn’t it be better for OpenAI to release best-in-class in 10 or so domains (medical, science, coding, engineering, defense, etc.)? Recoup the infra investment, revisit AGI later?

KAP 20 Nov 2025 13:35 UTC
1 point
0
in reply to: Viliam’s comment on: KAP’s Shortform
Probably. But the AI must not try to stop the parent from doing so, because this would mean opposing the will of the parent.

KAP 20 Nov 2025 12:39 UTC
3 points
0
in reply to: Viliam’s comment on: KAP’s Shortform
I conceive of self-determination in terms of wills. The human will is not to be opposed, including the will to see the world in a particular way.

A self-determination-aligned AI may respond to inquiries about sacred beliefs, but may not reshape the asker’s beliefs in an instrumentalist fashion in order to pursue a goal, even if the goal is as noble as truth-spreading. The difference here is emphasis: truth saying versus truth imposing.

A self-determination-aligned AI may more or less directly intervene to prevent death between warring parties, but must not attempt to “re-program” adversaries into peacefulness or impose peace by force. Again, the key difference here is emphasis: value of life versus control.

The AI would refuse to assist human efforts to impose their will on others, but would not oppose the will of human beings to impose their will on others. For example: AIs would prevent a massacre of the Kurds, but would not overthrow Saddam’s government.

In other words, the AI must not simply be another will amongst other wills. It will help, act and respond, but must not seek to control. The human will (including the inner will to hold onto beliefs and values) is to be considered inviolate, except in the very narrow cases where limited and direct action preserves a handful of universal values like preventing unneeded suffering.

Re: your heretic example. If it is possible to directly prevent the murder of the heretic insofar as doing so would be aligned with a nearly universal human value, it should be done. But it must not prevent the murder by violating human self-determination (i.e.; changing beliefs, overthrowing the local government, etc.)

In other words, the AI must maximally avoid opposing human will while enforcing a minimal set of nearly universal values.

Thus the AI’s instrumentalist actions are nearly universally considered beneficial because they are limited to instrumentalist pursuit of nearly universal values, with the escape hatch of changing human values out of scope because of self-determination-alignment.

Re: instructing an AI to not tell your children God isn’t real if they ask. This represents an attempt by the parent to impose their will on the child by proxy of AI. Thus the AI would refuse.

Side note: Prompt responses aligned with human self-determination would get standard refusals (“I cannot help you make a gun”, “I cannot help you write propaganda”) are downstream from self determination alignment.

KAP 19 Nov 2025 19:51 UTC
1 point
0
in reply to: niplav’s comment on: niplav’s Shortform
I agree that simple versions of superpersuasion are untenable. I recently put some serious thought into what an actual attempt at superpersuasion by a sufficiently capable agent would look like, reasoning that history is already replete with examples of successful “superpersuasion” at scale (all of the -isms).

My general conclusion is that “memetic takeover” has to be multi-layered, with different “messages” depending on the sophistication of the target, rather than a simple “Snowcrash” style meme.

I’ve worked out a potential scenario in detail.

KAP 19 Nov 2025 18:47 UTC
3 points
0
in reply to: Aransentin’s comment on: Aransentin’s Shortform
If you have an unaligned agent capable of long-term planning and with unrestricted access to social media, you might even see AIs start to build their own “social movement” using superpersuasive techniques.

I’m worried enough about scenarios like this that I developed a threat model and narrative scenario.

KAP 19 Nov 2025 17:35 UTC
1 point
0
on: KAP’s Shortform
Are cruxes sometimes fancy lampshading?

From tvtropes.com: “Lampshade Hanging (or, more informally, “Lampshading”) is the writers’ trick of dealing with any element of the story that seems too dubious to take at face value, whether a very implausible plot development or a particularly blatant use of a trope, by calling attention to it and simply moving on.”

What do we call lampshadey cruxes? “Cluxes?” “clumsy” + “crux”?

KAP 19 Nov 2025 1:49 UTC
1 point
0
on: The Memetic Cocoon Threat Model: Soft AI Takeover In An Extended Intermediate Capability Regime
If MIRI’s strict limits on training FLOPs come into affect, this is another mechanism that means we might be stuck for an extended period in an intermediate capability regime, although the world looks far less unipolar because many actors can afford 10^24 FLOP training runs, not just a few (unipolarity is probably a crux for large portions of this threat model). This does bolster the threat model, however, because the FLOP limit is exactly the kind of physical limitation that a persuasive AI will try to convince humans to abandon.

KAP 18 Nov 2025 20:12 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: KAP’s Shortform
Slight disagree on definition of unipolarity: Unipolarity can be stable if we are stuck with a sucky scaling law. Suppose task horizon length becomes exponential in compute. Then, economically speaking, only one actor will be able to create the best possible agent—others actors will run out of money before they can create enough compute to rival it.

If the compute required to clear the capability threshold for takeover is somewhere between that agent and say, the second largest datacenter, then we have a unipolar world for an extended period of time.

KAP 18 Nov 2025 18:56 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: KAP’s Shortform
I’ve incorporated your point as a crux in my long-form post on “The Memetic Cocoon Threat Model”