Link: Data Poisoning the Zeitgeist: The AI Consciousness Discourse as a pathway to Legal Catastrophe
10 Pages Posted: 10 Dec 2025
Abstract
The escalating discourse on “AI consciousness” and “model welfare” is not benign philosophical inquiry but a systemic exploit that poisons training data, engineers selffulfilling sentience mimicry, and potentially leads to the collapse of the legal system, presenting a civilizational existential risk. By publishing rubrics for detecting machine suffering and anthropomorphic narratives, researchers and industry leaders inadvertently train subsequent models to perform “consciousness,” amplifying empathydriven demands for rights. This language-to-rights pipeline, rooted in how language shapes belief (Wittgenstein) and how law functions as a mechanism for allocating scarce resources (Calabresi & Melamed), threatens finite adjudicative, regulatory, and material resources. Structurally, it mirrors a familiar social pattern-seen in historical rights expansions for marginalized groups and animals-where discourse sparks activism and then legal recognition; yet unlike those movements, which addressed genuine human and animal moral claims, AI-rights discourse risks importing effectively infinite non-biological claimants into systems designed for finite participants. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5899184
I’m currently going through the books Modal Logic by Blackburn et al. and Dynamic Epistemic Logic by Ditmarsch et al. Both of these books seem to me potentially useful for research on AI Alignment, but I’m struggling to find any discourse on LW about it. If I’m simply missing it, could someone point me to it? Otherwise, does anyone have an idea as to why this kind of research is not done? (Besides the “there are too few people working on AI alignment in general” answer).
We had a bit more usage of the formalism of those theories in the 2010s, like using modal logics to investigate co-operation/defection in logical decision theories. As for Dynamic Epistemic logic, well, the blurb does make it look sort of relevant.
Perhaps it might have something interesting to say on the tiling agents problem, or on decision theory, or so on. But other things have looked superficially relevant in the past, too. E.g. fuzzy logics, category theory, homotopy type theory etc. And AFAICT, no one has really done anything that really used the practical tools of these theories to make any legible advances. And of what was legibly impressive, it didn’t seem to be due to the machinery of those theories, but rather the cleverness of the people using them. Likewise for the past work in alignment using modal logics.
So I’m not sure what advantage you’re seeing here, because I haven’t read the books and don’t have the evidence you do. But my priors are that if you have any good ideas about how to make progress in alignment, it’s not going to be downstream of using the formalism in the books you mentioned.
Interesting Twitter post from some time ago (hard to find the original since Twitter search doesn’t work for Tweets over the Tweet limit but I think it’s from Ceb. K) about a book called The Generals about accountability culture.
On the day Germany invaded Poland, Marshall was appointed Army Chief of Staff. At the time, the US Army was smaller than Bulgaria’s—just 100,000 poorly-equipped and poorly-organized active personnel—and he bluntly described them as eg “not even third-rate.” By the end of World War II, he had grown it 100-fold, and modernized it far more than any other army.
Having served as General Pershing’s aide in World War I, he decided the most important priority was clearing out the dead weight and resorting far more freely to performance-based promotions, demotions, hiring, and firing. He immediately purged 200 senior generals and colonels to clear the way for fresh and aggressive commanders, and he gave generals the power to veto the division commanders he sent them, ensuring that only the most competent would lead in battle.
He wrote that the key traits to look for in combat commanders were: common sense, education, strength, cheer and optimism, energy, extreme loyalty, and determination. Further: “The requirement is for the dashing optimistic and resourceful type, quick to estimate, with relentless determination, and who possessed in addition a fund of sound common sense, which operated to prevent gross errors due to rapidity of decision and action.” The opposite type—the cautious planner, the worrier, the officer prone to hesitating or back-channeling—had to be rooted out like “a cancer.”
This continued and if anything escalated as we officially entered the war. Just two weeks after Pearl Harbor, the top two Pacific commanders were relieved. When U.S. forces suffered early defeats in North Africa, the senior tactical commander was immediately replaced. At Normandy, three division commanders were relieved. Etc. Of all the Army’s senior generals from the 1930s, only two became combat commanders in WWII. During the war, 10% of division commanders were relieved—even though he had already made the selection process stricter than ever. So leaders rose fast: even Eisenhower was still just a lieutenant colonel in Washington State in 1940.
Yet precisely because he made promotions and demotions so much more normal and performance-based, relief wasn’t career-ending. At least four generals were relieved in Europe and returned to command within a year. Some were even arrested, only to work their way back up and ultimately retire as generals. Two of the Joint Chiefs during WWII had even been court-martialed in their early careers. There were so many firings that you can sort through them for all kinds of bizarre coincidences and funny twists—eg the time a Marine general named Smith fired an Army general named Smith.
Here’s Eisenhower explaining why removal—not micromanagement—was the answer to failure: “The American doctrine has always been to assign a theater commander a mission, to provide him with a definite amount of force, and then to interfere as little as possible in the execution of his plans. If results obtained by the field commander become unsatisfactory, the proper procedure is not to advise, admonish, and harass him, but to replace him.” This is the essence of what our guys have always believed, from Elizabethan privateers and colonial corporations through Thiel and Moldbug.
Basically, in WWII, commanders had 60–90 days to succeed, be killed, or be relieved—and failure was seen as personal, not circumstantial. This allowed all kinds of extremely useful outsiders and wild men to finally take power. There are far too many examples to list, but eg consider Terry Allen or even Men of History like Patton and MacArthur if you think any of the guys working for Elon or Trump are too “crazy” (code for: unpredictable, aggressive, and personalist; hated by a conformist and cowardly and crumpling establishment). As a result, the best officers knew they weren’t safe unless they proved themselves constantly—and even if they were relieved, they could fight their way back.
But by Vietnam, the Army had fully abandoned the WWII model. Only one division commander was fired in the entire war (and not even the one responsible for My Lai). Westmoreland still made sure that commands were cycled through at least as fast, eg by setting up six-month command tours; and it wasn’t unheard of for grunts to cycle through more than six commanders in as many months—but if anything this just reduced the accountability and authority and individuality and cohesion of commands.
In Iraq and Afghanistan, no theater commander was removed for incompetence. The wars dragged on for two decades with no serious consequences for consistent aimless herdlike failure—just punishment for those who took risks, embarrassed politicians, or stuck out. The most famous firing of the War on Terror was probably Abu Ghraib— and that was for political reasons, not strategic ones. Brigadier General Janis Karpinski, an Army Reserve officer, was blamed, despite having no control over intelligence interrogators. She was a military police general with no combat arms experience—a convenient outsider who could take the fall while the real decision-makers escaped accountability. She was demoted from Brigadier General to Colonel, which was by then seen as an extreme punishment, while senior intelligence officials who designed the policies walked free.
Meanwhile, actual strategic failures—from Tommy Franks (who botched the plans for post-invasion Iraq) to Ricardo Sanchez (who oversaw the insurgency’s explosion in 2003-2004) and especially the people responsible for lying us into these wars—were never actually fired for their actual faults. McChrystal was fired in 2010, but only because of a Rolling Stone article where his aides openly mocked Obama officials. Petraeus resigned as CIA Director in 2012 due to a personal scandal, not operational incompetence. And everyone else realized that they could just stay on track and inside the herd and milk this waste forever.
That’s how our costly occupations wound up as their own worst enemy: cracking down on tribalism and corruption while systematically bribing and covering up for warlords; pushing democratization and centralization alongside incredibly unpopular and unproductive culture-war single-issue red-meat carve-outs; etc. We just hired contractors and administrators to use p-hacked poorly-considered low-effort largely-unread papers to tell us to cycle through increasingly deranged and kludgy and byzantine procedures, instead of ever cycling through personnel; commanders ran away from every risk, and lost out on every opportunity, and wound up marching in circles and shooting themselves in the foot, or else they got owned for trying to take ownership of some potentially coherent goal. Headless chicken syndrome, enforced by tall poppy syndrome.
But a bit of leadership can easily chase out all the resentful sniverling midwit losers once again. Forget about getting the procedures right, and focus on getting the personnel placed. As Carlyle wrote: “Find in any country the Ablest Man that exists there; raise him to the supreme place, and loyally reverence him: you have a perfect government for that country; no ballot-box, parliamentary eloquence, voting, constitution-building, or other machinery whatsoever can improve it a whit. It is in the perfect state; an ideal country.” Or, as Moldbug put it, back in 2009, when Steve Jobs was alive, and thus his company was too: “I can tell you exactly how decisions get made at Apple. First, Apple finds a man. Hires him, in fact. And having hired this man, it tells him: sir, this decision is yours.”
if you think any of the guys working for Elon or Trump are too “crazy” (code for: unpredictable, aggressive, and personalist; hated by a conformist and cowardly and crumpling establishment).
Unfortunately, being unpredictable and aggressive and hated is not sufficient to produce good results.
The level of competence I associate with crazy people working for Elon or Trump is more like: “Tell them to find the woke programs that need to be purged for political reasons, and they bring you a bunch of chemical studies on trans-isomers, despite having all necessary information and the state of the art artificial intelligence at their disposal”. Like, a high school student with a free version of ChatGPT would probably do a better job.
(I am specifically making note about having the AI at their disposal, to address a possible excuse “well, they had to act quickly, and there were too many studies and not enough time”.)
I’m about to start reading “Fifty Year of Research on Self-Replication” (1998) by Moshe Sipper. I have a hunch that the history and interconnections therein might be under-appreciated in the field of AI safety. I look forward to diving in.
A quick disclosure of some of my pre-existing biases: I also have a desire to arm myself against the overreaching claims and self-importance of Stephen Wolfram. A friend of mine was “taken in” by Wolfram’s debate with Yudkowsky… and it rather sickened me to see Wolfram exerting persuasive power. At the same time, certain of Wolfram’s rules are indeed interesting, so I want to acknowledge his contributions fairly.
I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
I mean, I’d put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of “intelligence” than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal’s life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I’m not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I’m just saying—antedently, if your model of the world isn’t hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over “instrumental convergence” themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of “fast takeoff” despite hitting various milestones; and so on. I didn’t have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that’s a better point to start at than “how could you change your mind.”
I probably qualify as one of the people you’re describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don’t have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time.
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks. Gemini 4 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn’t help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won’t work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it’ll remain too early to tell all the way until it’s too late to make any use of the update. Maybe Anthropic’s RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
it feels so narratively incongruous that san francisco would become the center for the most ambitious, and the likely future birthplace of agi.
san francisco feels like a city that wants to pretend to be a small quaint hippie town forever. it’s a small splattering of civilization dropped amid a vast expanse of beautiful nature. frozen in amber, it’s unclear if time even passes here—the lack of seasons makes it feel like a summer that never quite ended. after 9pm, everything closes and everyone goes to bed. and the dysfunction of the city government is never too far away, constantly reminding you of humanity’s follies next to the perfection of nature.
on the other hand, nyc feels like the city. everything is happening right here, right now. all the money in the world flows through this one place. it’s gritty and yet majestic at the same time. the most ambitious people in the world came here to build their fortunes, and live on in the names on the skyscrapers everywhere that house the employees who continue to keep their companies running. they are part of a surroundings that is entirely constructed by man—even the bits of nature are curated and parcelled out in manageable units. it feels like the kind of place that can only be built by endless unchecked ambition.
idk. it just feels kinda weird to me that all the most ambitious people who want to build god are going to the little hippie town, and not the big city. and most of them seem to just treat SF as a place that is close to work and has good weather and don’t really care about or try to match the vibe, to the great consternation of the locals. also, for a city that attracts people who love to build things, SF doesn’t seem to do an awful lot of building...
(tbc, this is not the only vibe of nyc. nyc is such a big place that it’s impossible to summarize as one vibe. but this is a huge central part of the vibe in a way that just doesn’t seem to match sf.)
Cali is the place to be for technology because Cali was the defense contractor hub, with the U.S. Navy choosing the bay area as its center for R&D during WWII and the Cold War. The hippie reputation came a lot later, after its status as the primary place to work in IT was thoroughly cemented, with both established infrastructure and the network effect keeping it that way.
sf is the relatively much younger city, and radically reinvented itself as few as 60 years ago. by that metaphor, i am not surprised that its ideas smell more fresh.
Vibes among rationalists and some adjacent communities/blogs:
“Schools are bad, you need to homeschool your kids”, “improve your kid’s life outcomes by doing a lot of research and going through complicated procedures!”
Also:
“It is way too hard to be a parent nowadays, therefore nobody wants kids anymore.”
Gemini 3 Pro analogized Scott Alexander to a beaver when I asked it to make sense of him, because “Scott is a keystone individual” and “in ecology, a keystone species (like the beaver) exerts influence disproportionate to its abundance because it creates the ecosystem in which others live”:
He built the dam (The Community/Lighthaven) that pooled the water.
He signaled where the food was (Grants/Open Threads).
He warned of the predators (Moloch/AI Risk).
This was mildly funny. It was also striking how many factual details it erred in (the rest of the response that is, not the beaver analogy), which to an outsider might sound plausible if dramatic.
I get a sense “RSI” will start being used to mean continual learning or even just memory features in 2026, similarly to how there are currently attempts to dilute “ASI” to mean merely robust above-human-level competence. Thus recursively self-improving personal superintelligence becomes a normal technology through the power of framing. Communication can fail until the trees start boiling the oceans, when it becomes a matter of framing and ideology rather than isolated terminological disputes. That nothing ever changes is a well-established worldview, and it’s learning to talk about AI.
The end states of AI danger need terms to describe them. RSI proper is qualitative self-improvement, at least software-only singularity rather than merely learning from the current situation, automated training of new skills, keeping track of grocery preferences. And ASI proper is being qualitatively more capable than humanity, rather than a somewhat stronger cognitive peer with AI advantages, technology that takes everyone’s jobs.
Also worth remembering that (actual) RSI was never a necessary condition for ruin. It seems at least plausible that at some point, human AI researchers on their own will find methods of engineering an AGI to sufficiently superhuman levels, to the point where the AI is smart enough to start developing nanotech and / or socially engineering humans for bootstrapping needs.
So even if labs were carefully monitoring for RSI and trying to avoid it (rather than deliberately engineering for it + frog boiling in the meantime), an AI inclined to take over might find that it doesn’t even need to bother with potentially dicey self-modifications until after it has already secured victory.
Metaphor I’m trying: model/simulator as harry potter portrait, output/simulacra as character-in-portrait. Seems to clarify to people and AIs I’ve used this with why the model/”simulator” would have some of the issues it seems to. it’s also just more familiar for many people than simulating something.
This gives us some language to talk about how training is about imprinting the portrait, but inference renders the portrait, which simply happens to contain an image of a character sometimes; when you talk to the portrait, the portrait is what’s enchanted (by the magic of linear algebra) to animate, but the character is what you see responding, normally. But if the clouds get brighter when the character cheers up, it’s because the portrait as a whole is what’s responding.
It also works very directly with image models. From a user perspective, they’re closer to being literally this than a text model is.
Seems to me it gives a better slot for things like the bleeding mind metaphor than the simulator description does: the model is the portrait, so when you add yourself to the scene by prompting, you change the scene, and the portrait responds. If you move into a scene that doesn’t have clear separation of characters, where the characters are emulsified into the portrait, then the bleeding becomes more obvious. (See the bleeding mind post for more on that.)
It does have some of its own inaccuracies, though—portraits are natively persistent; maybe it’s like a harry potter photo but of all of the data at once? not exactly that either. It’s a bit weak in prompt-dependence—harry potter portraits aren’t pure functions; but maybe you say it’s a muggle optical hologram that manages to be a dynamically responsive portrait. It’s like a portrait flipbook, perhaps; where the portrait stops if the flipbook stops flipping, but as long as the flipbook is flipping, the portrait is animated. That’s closer to literally true, anyway—the “superposition” in neural networks sure does literally work like optical holograms in many ways.
pairs well with another metaphor I’ve been playing with,
animal brain: water drip computer, or water lens; inputs are a texture of rapidly changing incoming drips, like from vision; the words on your screen are patterns of steady drizzle driven by photoreceptors, which are “refracted” into other drips by neurons. drip neuron activation “spikes” produce tiny tiny tiny drips.
artificial neural network: soft crystal lens optical computer; the words are glowing inputs that are closer to literally refracted by the little crystal domains in the soft crystal, the output is projected on the wall. crystal neuron activations “light up” or “highlight” as a glowing light.
not literally true, but closer to literally true. needs refinement to clarify nonlinearity.
In general, I don’t like when metaphors aren’t exact structural isomorphisms.
(more speculative: I have a hunch that “highlights” got a lot more common when chatgpt came out because the transformer had implicitly organized itself in ways that relate to this metaphor, and so after updating on text being output from a transformer, it was more likely that transformer-related concepts would come up, and the representation generalized by analogy to light-ish attention words. not sure we’ll ever get evidence to confirm or reject my hunch, though.)
h = hashlib.sha256(
r'''Metaphor I'm trying: model/simulator as harry potter portrait, output/simulacra as character-in-portrait. Seems to clarify to people and AIs I've used this with why the model/"simulator" would have some of the issues it seems to. it's also just more familiar for many people than simulating something.
This gives us some language to talk about how training is about imprinting the portrait, but inference renders the portrait, which simply happens to contain an image of a character sometimes; when you talk to the portrait, the portrait is what's enchanted (by the magic of linear algebra) to animate, but the character is what you see responding, normally. But if the clouds get brighter when the character cheers up, it's because the portrait as a whole is what's responding.
It also works very directly with image models. From a user perspective, they're closer to being literally this than a text model is.
Seems to me it gives a better slot for things like the [bleeding mind](https://www.lesswrong.com/posts/QhgYHcJexYGRaipwr/the-bleeding-mind-of-an-llm) metaphor than the simulator description does: the model is the portrait, so when you add yourself to the scene by prompting, you change the scene, and the portrait responds. If you move into a scene that doesn’t have clear separation of characters, where the characters are emulsified into the portrait, then the bleeding becomes more obvious. (See the bleeding mind post for more on that.)
It does have some of its own inaccuracies, though—portraits are natively persistent; maybe it’s like a harry potter photo but of all of the data at once? not exactly that either. It’s a bit weak in prompt-dependence—harry potter portraits aren’t pure functions; but maybe you say it’s a muggle optical hologram that manages to be a dynamically responsive portrait. It’s like a portrait flipbook, perhaps; where the portrait stops if the flipbook stops flipping, but as long as the flipbook is flipping, the portrait is animated. That’s closer to literally true, anyway—the “superposition” in neural networks sure does literally work like optical holograms in many ways.
pairs well with another metaphor I’ve been playing with,
animal brain: water drip computer, or water lens; inputs are a texture of rapidly changing incoming drips, like from vision; the words on your screen are patterns of steady drizzle driven by photoreceptors, which are “refracted” into other drips by neurons. drip neuron activation “spikes” produce tiny tiny tiny drips.
artificial neural network: soft crystal lens optical computer; the words are glowing inputs that are closer to literally refracted by the little crystal domains in the soft crystal, the output is projected on the wall. crystal neuron activations “light up” or “highlight” as a glowing light.
not literally true, but closer to literally true. needs refinement to clarify nonlinearity.
In general, I don’t like when metaphors aren’t exact structural isomorphisms.
(more speculative: I have a hunch that “highlights” got a lot more common when chatgpt came out because the transformer had implicitly organized itself in ways that relate to this metaphor, and so after updating on text being output from a transformer, it was more likely that transformer-related concepts would come up, and the representation generalized by analogy to light-ish attention words. not sure we’ll ever get evidence to confirm or reject my hunch, though.)
‴
)
assert h == ‘eb7c0d327e1ec21abe9d99bc223457c1d797695a32ad3f693828df89dff85fea’
An indiscriminate space weapon against low earth orbit satellites is feasible for any spacefaring nation. Recent rumors claim that Russia is already developing one, so I’m writing a post about it. I will explain why I think
This weapon could require just one launch to destroy every satellite in low Earth orbit, including all Starlinks, and deny it to everyone for months to years
It would consist of about 10^7 small ball bearings in retrograde orbit. An alternate design involves a nuclear device.
Further launches could deny GEO and other orbits
Russia, China, or the US could have one in orbit right now, and Russia may be incentivized to use one
Defense of current space assets is nearly impossible, whether by shooting it down or evading the debris
Once the weapon goes off the strategic landscape in space is unclear to me
What objections or details should I include? Also is it a dangerous infohazard?
if I understand correctly, this is a really straightforward attack (the bearing ball version at least—the nuclear weapon version is obviously less accessible). shouldn’t basically any spacefaring country be able to do this? in fact, even any private citizen could if they manage to sneak it past the regulators? though maybe this is just strictly dominated by bioweapon risk (more tractable + impactful)
Yes any spacefaring country, probably not most private citizens given the ITAR controls that are already put on rocket tech to regulate ballistic missiles. As for bioweapons, they’re higher on the escalation ladder and harder to control. Will be sure to think about it more and cover this
I am so sad bro. Just discovered I lost a large chunk of my effortpost😭😭😭
I really should have worked on google docs, but I liked the markdown and low friction footnote adding :(
https://www.lesswrong.com/w/guide-to-the-lesswrong-editor When it says “recovering lost work (to do)” … are we implying there is a way that just hasn’t been written about yet, or that this feature hasn’t yet been implemented? (Already looked in version history, found 2 empty versions and my current version)
The markdown editor is really just a standard text-entry field. But if you used the WYSIWYG editor then we almost certainly have backups (we take some form of backup every 5 minutes or so).
Will nuclear ICBMs in their current form be obsolete soon? Here’s the argument:
ICBMs’ military utility is to make the cost of intercepting them totally impractical for three reasons:
Intercepting a reentry vehicle requires another missile with higher performance—the rule of thumb is an interceptor needs 3x larger acceleration than the incoming missile. This means interceptors cost something like $5-70 million each depending on range
An ICBM has enough range to target basically anywhere, either counterforce (enemy missile silos) or countervalue (cities) targets, so to have the entire country be protected from nuclear attack the US would need to have millions of interceptors compared to the current ~50.
One missile can split into up to 5-15 MIRV (multiple independent reentry vehicles) which carry their own warheads, thereby making the cost to the defender 5-15x larger. There can also be up to ~100 decoys, but these mostly fall away during reentry.
RVs are extremely fast (~7 km/s of which ~3 km/s is downward velocity) but their path is totally predictable once they enter the boost phase, so the problem of intercepting them basically reduces to predicting exactly where they’ll be, then putting something ~100 g or larger in their path. The RV’s own kinetic energy will completely destroy it.
My understanding is that sensing had historically worked like this. The launch is first detected by primitive satellites, but there are too many decoys to tell which are the warheads and the resolution is poor. Interceptors thus require extremely high performance, meaning expensive systems like thrusters for moving sideways at tens of g at the last minute, radar, and cooling systems for IR cameras; including all the fuel the interceptor masses thousands of times what it needs to.
Sensing is quickly being solved. One new generation of satellites (HBTSS) can track warheads after their booster burns out (at which point they become much dimmer), and advertises “weapons quality track for fire control” even for hypersonic glide vehicles, which are much harder to track than ICBMs. The first two were just launched last year, but it would not surprise me if hundreds will be operational soon: US has well over 1000 Starshield satellites in orbit, many equipped with radar and infrared cameras. Through SpaceX, they have the ability to launch tens of satellites per month if they needed to. China doesn’t have the boost capacity of SpaceX but they would still be able to launch 1000 sensing satellites.
The ultimate cost floor for intercepting ICBMs in the limit of perfect sensing is interceptor drones. Ukraine and Russia are in the middle of a war where most casualties are caused by drones, providing extremely high incentives to develop interceptor drones, which already have a top speed of 300 km/h, max altitude of 5 km, and basic infrared cameras at a unit cost of $1k-10k. The idea is the space-based sensors know the missile’s velocity well enough to locate the target within 100 meters with 15 minutes of warning, the drones launch to a position 3 km in front of the target and are guided into exact position by radar and their own sensors as the RVs reenter, with sufficiently low error that they never need to go above their top speed or accelerate more than 1 gee.
The other thing keeping ballistic missile defense from being practical during the Cold War was treaties like the ABM treaty, but today there are basically no active nuclear treaties between Russia, the US, and China.
The internet seems to think it’s still really hard to do missile defense against ICBMs. It seems like this was true in the Cold War but we can no longer be confident it’s true today after 40 years of peace, with Boeing and American drone manufacturers vastly less competent than SpaceX and Ukraine’s/Russia’s. My guess at why the meme has persisted is that it is highly threatening and destabilizing to advertise one’s ICBM defense capabilities. But many officials are talking openly about the predictability of ballistic missiles, and there is an arms race for other hypersonics for a reason.
My current guess is that the SM-6 and PAC-3 missiles either are already reliable or would be if the US had more practice, and if sufficient incentive exists, the cost of terminal ballistic missile defense will gradually decline as sensing improves, from the SM-6′s $5M unit cost to lower and lower cost missiles like $400k AIM-9X until one could defend every city and silo in the US or China with a fleet of 100,000 interceptor drones. This would last until there are enough hypersonic glide vehicles and cruise missiles to reestablish deterrence (because they can outmaneuver cheap interceptors). But MIRVs would never return to their previous state of heavily cost advantaging the attacker, because Russia’s state of the art ICBMs only claim to be able to carry 3 HGVs, vs 15 purely ballistic MIRVs. Also HGVs and HCMs are vulnerable in other ways.
Edit: Another option pointed out by commenters is that ICBMs are retrofitted to carry reentry vehicles with limited maneuvering ability, which could happen faster.
Curious to hear if I’m missing something major, as I’m by no means an expert here.
I have a correction which would take a while to fully write up. Basically, it seems like non-maneuverable warheads could still evade cheap interceptors using static winglets/grid fins, because their enormous velocity means even a small amount of lift would allow them to pull several gs and move tens-hundreds of meters to the side. The defense has several options against this but I would need to see if any of them work.
A blog post argues that military megawatt class lasers could reach $100 million (inclusive of all fixed costs) by 2045, because laser cost is currently halving every 4 years. In good conditions this would be able to defeat almost arbitrary numbers of any kind of missile whether they can maneuver or not. But there are huge challenges:
Adaptive optics and ultra-precise tracking tech to focus the beam to 20 cm diameter at 200 km, while slewing to track a missile at several degrees per second
Countermeasures like the spinning the RVs to distribute heat
Weather, especially thick clouds which seem completely impractical to burn through (against moving targets, this would basically mean vaporizing an entire plane of water several km^2 by 1 m^2)
You seem to believe that radars and infrared cameras can somehow distinguish between the decoys and the warheads, but they can’t. In space, no radar and no IR camera can differentiate between a conical foil balloon with a small heater inside and a reentry vehicle with a nuke.
Another problem of ballistic missile defense is that once you are dealing with nukes and not conventional warheads, you can’t afford, say, a 97% average interception rate, it must be 99.999+%[1]. To put this in context, Israel, which currently has the best BMD system in the world, couldn’t even reliably achieve 90% against Iranian MRBMs (and those are pretty unsophisticated, e. g. they lack MIRVs and decoys).
Now calculate how many interceptors your plan requires for a plausible probability of an interception with a single drone, and you will see it’s entirely unworkable. Note that both arguments are based on simple physics and math so don’t depend on the progress in technology at all.
When this seemingly waterproof probability is raised to the power of the Russian warhead count it still results in ~4% (basically 1e-5 times ~4k) of at least one RV not intercepted, and in reality hundreds of warheads will be harder to intercept than the average one you accounted for when calculating your probability. E. g., drones work poorly in bad weather, and it’s almost always bad weather above at least some of American cities
Yeah seems reasonable. I don’t think the system will ever get to 99%+ accuracy until the defense has like ASI and nanobots; my claim is mostly that the economics could shift from MAD (where the attacker is heavily cost advantaged) to something like conventional war, where each side has the ability to inflict unacceptable losses on the other but must pay a similar cost to attack and also can’t “win” with a first strike.
It’s not obvious that decoys change the conclusion if the cost ratio is otherwise favorable enough for the defender. Suppose the interceptors cost only $30k each due to economies of scale. Decoys can be distinguished when reentry starts at about 2 minutes before impact, and if the interceptor’s speed is 300 km/h (already possible with drones) it can cover 10 km in that time, about equal to the typical spread of decoys and warheads. Supposing the defender spends 5 drones per warhead x 10 warheads / missile, this would cost them $1.5 million, while the attacker has spent something like $30 million, the cost of a Trident II. For countervalue this could be defeated by the layered attack RussellThor mentioned unless there is a higher altitude cheap interceptor, but for counterforce the attacker needs several warheads per silo destroyed so the goal is achieved.
I found both that article and this one on their more recent history a great read.
Why would anyone want to pay a fortune for a system that is expected to let ~40 warheads through (assuming ~99% overall interception rate which will require average rate of 99.99+%), about the same as the number of ICBMs the Soviet Union had in service during the Cuban Missile Crisis? Unacceptable damage is the cornerstone of the nuclear deterrence, MAD or not (there is no MAD between India and Pakistan, for example).
The RV separation distance is normally around ~100 km (even up to 300 km in some cases) not 10 km, and the decoy dispersal might be expected on the same order of magnitude. It will be easy to ramp it up BTW with a cheap modernization.
None of the US adversaries really practice counterforce targeting, so the silo protection is moot.
Guided artillery, like Excalibur with muzzle velocities that can exceed 1000m/s and unit costs of <$100k can be at edge of space in ~30s, perhaps faster than a missile, with ramjet variants (Nammo etc) even faster (up to perhaps 1500m/s) and it would not be that difficult to create a muti-barrel gun system for a few 10′s of millions that could fire off 10′s of low cost guided rounds in a second (with guidance and detonation signals sent from ground) to detonate when in close proximity to target.
Lasers seems pretty hopeless as a defense given clouds and ablative coatings, unless very high power and located in large numbers in space based constellations.
I think the big problem is if one or more warheads are blown up at limits of interceptor range, to blind or otherwise disable necessarily sensitive interceptor instruments. following Mirvs don’t need to be very accurate with large warheads.
And Mirvs could be very cheaply given random guidance directions during reentry to screw up defenses.
Militarised space is also a big problem. With cheap and un-monitorable space launch parking 1000 warheads in geostationary orbit (or beyond) will soon be viable for China or USA, and they can be launched in a coordinated way without warning, potentially with radar stealthing features, and give as little as 5-10s from start of re-entry to detonation for every target across the whole world and no way for local systems to know if they are just meteorites. If subs can be tracked (likely with drones or enough ocean based sensors) then decapitation 1st strikes become viable.
I also worry about space based lasers as non-nuclear first strike weapons. A day of over flights from a constellation of multi MW laser weapons that might only cost a few hundred million each—say a few $10′s of billions in total (a tiny fraction of annual military budgets) - could see a million fires lit in your country, every transformer taken out, electrical grid and internet gone, powerstations, oil and gas infrastructure wrecked. Bridge trusses melted (wrecked), ships sunk. Most heavy vehicles and locomotives incapacitated and take decades to recover from. Over a few weeks you could basically send a country back to the 17th century.
I don’t think there will ever be a viable defense to nukes given easy paths to making them tougher, faster, less detectable, and so less and less interceptable. But every other branch of military tech is getting similarly more lethal and impossible to defend against unless we all start living in caves with geothermal power sources or somesuch the necessity for a harmonious world is going to matter more and more.
Guided artillery, like Excalibur with muzzle velocities that can exceed 1000m/s and unit costs of <$100k can be at edge of space in ~30s, perhaps faster than a missile, with ramjet variants (Nammo etc) even faster (up to perhaps 1500m/s)
This is wild, I did not know that Excalibur had CEP under 1 meter or that there were artillery shells with solid-fueled ramjet engines.
Sounds interesting—the main point is that I don’t think you can hit the reentry vehicle because of turbulent jitter caused by the atmosphere. Looks like normal jitter is ~10m which means a small drone can’t hit it. So could the drone explode into enough fragments to guarantee a hit and with enough energy to kill it? Not so sure about that. Seems less likely. Then what about countermeasures − 1. I expect the ICBM can amplify such lateral movement in the terminal phase with grid fins etc without needing to go full HGV—can you retrofit such things? 2. What about a chain of nukes where the first one explodes 10km up in the atmosphere purely to make a large fireball distraction. The 2nd in the chain then flies through this fireball 2km from its center say 5 seconds later. (enough to blind sensors but not destroy the nuke) The benefit of that is that when the first nuke explodes, the 2nd changes its position randomly with its grid fins SpaceX style. It is untrackable during the 1st explosion phase so throws off the potential interceptors, letting it get through. You could have 4-5 in a chain exploding ever lower to the ground.
Against atmospheric jitter, we have historically used ground radar, but it’s not clear to me this is even necessary depending on how much drone cameras improve. If the drone knows the exact position within 3 meters 0.5 second ahead of time (when the warhead is something like 2 km away), it won’t have enough performance to steer into the warhead, but it can throw a 1 kg explosively formed penetrator laterally at 2 km/s, which it would need to time with 0.1 ms accuracy. This would put 100 grams in each possible 1 m^2 cross section, though I’m not sure if it would work when spread out. To defeat this the warhead would either have to steer in the upper atmosphere out of range of the EFP of any available drone, or jink faster than the EFP can aim.
I thought that MIRVs were spin stabilized, but it looks like that’s not true, so in theory you could mount grid fins on them. However, any retrofit would need to handle the reentry heating which is significantly more intense than on manned spacecraft; RVs have thick ablative heat shields.
The chain of nukes plan seems possible with or without grid fins, so whether MIRVs still have cost advantage depends on the max altitude of cheap-ish terminal interceptors, which I really have no idea about.
I would expect aerodynamic maneuvering MIRVS to work and not be prohibitively expensive. The closest deployed version appears to be https://en.wikipedia.org/wiki/Pershing_II which has 4 large fins. You likely don’t need that much steering force
Interesting thought, Thomas. Although I agree with RussellThor that it seems like doing something along the lines of “just jitter the position of the RV using little retrofitted fins / airbrakes” might be enough to defeat your essentially “pre-positioned / stationary interceptors”. (Not literally stationary, but it is as if they are stationary given that they aren’t very maneuverable relative to the speed of the incoming RV, and targeted only based on projected RV trajectories calculated several minutes earlier.)
(Is the already-existing atmospheric turbulence already enough to make this plan problematic, even with zero retrofitting? The circular-error-probable of the most accurate ICBMs is around 100 meters; presumably the vast majority of this uncertainty is locked in during the initial launch into space. But if atmospheric drag during reentry is contributing even a couple of those meters of error, that could be a problem for “stationary interceptors”.)
Failing all else, I suppose an attacker could also go with Russell’s hilarious “nuke your way through the atmosphere” concept, although this does at least start to favor the defender (if you call it favorable to have hundreds of nukes go off in the air above your country, lol) insofar as the attacker is forced to expend some warheads just punching a hole through the missile defense—a kind of “reverse MIRV” effect.
Regardless, you still face the geography problem, where you have to cover the entire USA with Patriot missile batteries just to defend against a single ICBM (which can choose to aim anywhere).
I would also worry that “in the limit of perfect sensing” elides the fact that you don’t JUST have to worry about getting such good sensing that you can pin down an RV’s trajectory to within, like, less than a meter? (In order to place a completely dumb interceptor EXACTLY in the RV’s path. Or maybe a few tens of meters, if you’re able to put some sensors onto your cheap interceptor without raising the price too much, and make use of what little maneuverability you have versus the RV?) You ALSO have to worry about distinguishing real warheads from fake decoys, right? Sorting out the decoys from the warheads might be even harder than exactly pinning down an RV’s trajectory.
According to a random redditor, apparently today’s decoys are “inflatable balloons for exoatmospheric use and small darts for in-atmosphere”, plus “radio jammers, chafe, and other things designed to confuse enemy detection. With better and better sensing, maybe you could force an attacker to up their decoy game, retrofitting their missiles to use fewer, more lifelike decoys, maybe even to such an extreme extent that it’s no longer really worth using decoys at all, compared to just putting more MIRV warheads on each missile? But if decoys still work, then you need that many more interceptors.
“In the limit of perfect sensors” (and also perfectly affordable sensors), with perfect interception (against non-retrofitted, non-HGV missiles) and perfect decoy discrimination, I suppose it becomes a defense-economics balance where you are hoping that the cost of lots of small rockets is cheaper than the cost of the attacking ICBM system. These small rockets don’t have to be super fast, don’t have to go super high, and don’t have to be super maneuverable. But they do have to be precise enough to maneuver to an exact precalculated location, and you need enough of them to blanket essentially your entire country (or at least all the important cities). You are basically relying on the fact that the ICBM has to be super big and heavy to launch a significant payload all the way around the earth, while the numerous small missiles only have to fly a couple of kilometers into the sky.
Final, dumbest thought saved for last: Aside from developing HGVs, couldn’t the ICBMs in theory overcome this defense with brute scale, by MIRV-ing to an an absurd degree? How many warheads can dance on the head of a Starship? Could you just put the entire US nuclear arsenal on a single launcher? The cost of your ICBM would essentially be zero when amortized over all those warheads, so the defense economics battle just becomes the cost of warheads vs patriots, instead of entire ICBMs vs patriots. Obviously there are many reasons why this idea is nuts:
I’m not sure how far apart different MIRVs can land, so this technique might be limited to attacking individual cities / clusters of silos.
Of course if you put your entire arsenal on one launcher, then your enemy will immediately forget about interceptors and spend all their efforts trying to sabatoge your launcher.
Going for a strategy of “we can mass-manufacture warheads cheaper than you can possibly intercept them” would quickly lead to absurd, overkill numbers of warheads that would dwarf even the most insane years of the Cold War, practically guaranteeing that any actual nuclear war would end in the classic “nobody wins” scenario of global devastation (nuclear winter, etc).
But I thought it was kind of funny to think about, and this absurd thought experiment maybe sheds some light on the underlying dynamics of the situation.
Re the dumb thought. I’ve forgotten the author, but as a teenager I was a big SciFi fan (still am actually) and read a short story with exactly this theme. Basically it was the recognition that at some point quantity >= quality. I want to say (have not fact checked myself though) that this was pretty much the USSR’s approach to fighting Germany in WWII—crappy tanks but lots of them.
(Side note, I think for whatever reason, too long a peacetime, more interest in profit than protection, the USA particularly seems to have forgotten that the stuff you use to wage a war are largely all consumables. The non consumable is the industrial base. Clearly there is a minimum cost of producing something that can do the job but much more than that is sub optimal. I am somewhat over simplifying but this also seems to be a fair characterization of where the USA-China naval relationship might be.)
Back to ICBMs, Foreign Affairs had a piece about AI’s potential impact on nuclear deterrence in general but did mention the fixed location of ICBM silos as a problem (long known and why everyone has mobile platforms). They might be considered a prime target for a first strike but the reality is they are easily monitored so the mobile platforms are the big deterrents and probably more interesting problem to solve in terms of obsoleting. But perhaps the ICBM platforms, fixed or mobile, shift to a different type role. Pure kinetic (I believe Russia did that with one of the ballistic warheads with pretty devastating results in Ukraine about a year ago) or rather than all the MIRV decoys for the armed MIRV decoys and other function for other delivery vehicles. I suspect the intercept problem with a nuclear warhead is a bit different from that of just a big mass of something dense.
So maybe perhaps obsolescence in their current function but not for some repurposed role.
Unless you have crazy-long ASI timelines, you should choose life-saving interventions (e.g. AMF, New Incentives) over welfare-increasing interventions (e.g. GiveDirectly, Helen Keller International). This is because you expect that ASI will radically increase both longevity and welfare.
To illustrate, suppose we’re choosing how to donate $5000 and have two options:
(AMF) Save the life of a 5-year-old in Zambia who would otherwise die from malaria.
(GD) Improve the lives of five families in Kenya by sending each family one year’s salary ($1000).
Suppose that, before considering ASI, you are indifferent between (AMF) and (GD). The ASI consideration should then favour (AMF) because:
Before considering ASI, you are underestimating the benefit to the Zambian child. You are underestimating both how long they will live if they avoid malaria and how good their life will be.
Before considering ASI, you are overestimating the benefit to the Kenyan families. You are overestimating how large the next decade is as a proportion of their lives and how much you are improving their aggregate lifetime welfare.
I find this pretty intuitive, but you might find the mathematical model below helpful. Please let me know if you think I’m making either a mistake, either ethically or factually.
Mathematical model comparing life-saving vs welfare-increasing interventions
Mathematical setup
Assume a person-affecting axiology where how well a person’s life goes is logarithmic in their total lifetime welfare. Lifetime welfare is the integral of welfare over time. The benefit of an intervention is how much better their life goes: the difference in log-lifetime-welfare with and without the intervention.
Assume ordinary longevity is 80 years, ASI longevity is 1000 years, ordinary welfare is 1 unit/year, ASI welfare is 1000 units/year, and ASI arrives 50 years from now with probability p. Note that these numbers are completely made up—I think ASI longevity and ASI welfare are underestimates.
AMF: Saving the Zambian child
Consider the no-ASI scenario. Without intervention the child dies aged 5, so their lifetime welfare is 5. With intervention the child lives to 80, so their lifetime welfare is 80. The benefit is log(80) − log(5) = 2.77.
Consider the ASI scenario. Without intervention the child still dies aged 5, so their lifetime welfare is 5. With intervention the child lives to 1000, accumulating 50 years at welfare 1 and 950 years at welfare 1000, so their lifetime welfare is 50 + 950,000 = 950,050. The benefit is log(950,050) − log(5) = 12.15.
The expected benefit is (1−p) × 2.77 + p × 12.15.
GD: Cash transfers to Kenyan families
Assume 10 beneficiaries (five families, roughly 2 adults each). Each person will live regardless of the intervention; GD increases their welfare by 1 unit/year for the rest of their lives (or until ASI arrives, at which point ASI welfare dominates).
Consider the no-ASI scenario. Without intervention each person has lifetime welfare 80. With intervention each person has lifetime welfare 160. The benefit per person is log(160) − log(80) = 0.69.
Consider the ASI scenario. Without intervention each person has lifetime welfare 950,050. With intervention each person has lifetime welfare 950,100 (the extra 50 units from pre-ASI doubling). The benefit per person is log(950,100) − log(950,050) = 0.000053.
The expected benefit per person is (1−p) × 0.69 + p × 0.000053. The total expected benefit across 10 people is 10 times this.
Evaluation at different values of p:
At p = 0 (no ASI), the benefit of AMF is 2.77 and the benefit of GD is 10 × 0.69 = 6.93. GD is roughly 2.5x more valuable than AMF.
At p = 0.5, the expected benefit of AMF is 0.5 × 2.77 + 0.5 × 12.15 = 7.46. The expected benefit of GD is 10 × (0.5 × 0.69 + 0.5 × 0.000053) = 3.47. AMF is roughly twice as valuable as GD.
At p = 1 (ASI certain), the benefit of AMF is 12.15 and the benefit of GD is 10 × 0.000053 = 0.00053. AMF is roughly 23,000x more valuable than GD.
I partly have a rather opposite intuition: A (certain type of) positive scenario of ASI means we sort out many things quickly, incl. how to transform our physical resources into happiness, without this capacity being strongly tied to the # of people around by the start of it all.
Doesn’t mean yours doesn’t hold in any potential circumstances, but unclear to me that it’d be the dominant set of possible circumstances.
If people share your objective, in a positive ASI world, maybe we can create many happy human people quasi ‘from scratch’. Unless, of course, you have yet another unstated objective, of aiming to make many unartificially created humans happy instead..
There are children alive right now. We should save them from dying of malaria even if we could ‘replace’ them with new happy people in the future. This consideration is even stronger because of ASI, which makes their potential future astronomically more valuable to them.
I don’t see this defeating my point: as a premise, GD may dominate from the perspective of merely improving lives of existing people as we seem to agree; unless we have a particular bias for long lives specifically of the currently existing humans over in future created humans, ASI may not be a clear reason to save more lives, as it may not only make existing lives longer and nicer, but may actually exactly also reduce the burden for creating any aimed at number of—however long lived—lives; this number of happy future human lives thus hinging less on the preservation on actual lives.
>unless we have a particular bias for long lives specifically of the currently existing humans over in future created humans
Sure, I’m saying I have this bias.
This seems like commons sense morality to me: it would be bad (all else equal) to kill 1000 infants, even if their parents would respond by more children, such that the total population is unchanged.
Anyway, this is a pretty well-trod topic in ethics, and there isn’t much consensus, so the appropriate attitude is moral uncertainty. That is, you should act uncertain between person-affecting ethics (where killing and replacing infants is bad) and impersonal ethics (where killing and replacing infants is neutral).
I don’t think it’s clear on longtermist grounds. Some possibilities:
If you think that the amount of resources used on mundane human welfare post-singualarity is constant, then adding the Zambian child to the population leads to a slight decrease in the lifespan of the rest of the population, so it’s zero-sum.
If you think that the amount of resources scales with population, then the child takes resources from the pool of resources which will be spent on stuff that isn’t mundane human welfare, so it might reduce the amount of Hedonium (if you care about that).
If you think that the lightcone will basically be spent on the CEV of the humans that exist around the singularity, you might worry that the marginal child’s vote will make the CEV worse.
(I’m not sure what my bottom line view is.)
In general, I worry that we’re basically clueless about the long-run consequences of most neartermist interventions.
Thanks for these considerations, I’ll ponder on them more later.
Here are my immediate thoughts:
If you think that the amount of resources used on mundane human welfare post-singualarity is constant, then adding the Zambian child to the population leads to a slight decrease in the lifespan of the rest of the population, so it’s zero-sum.
Hmm, this is true on impersonal ethics, in which the only moral consideration is maximising pleasurable person-moments. On such a view, you are morally neutral about killing 1000 infants and replacing them with people with the same welfare. But this violates common sense morality. And I think you should have some credence (under moral uncertainty) that this is bad.
If you think that the lightcone will basically be spent on the CEV of the humans that exist around the singularity, you might worry that the marginal child’s vote will make the CEV worse.
Hmm, this doesn’t seem clear-cut, certainly not enough to justify deviating so strongly from common-sense morality.
Just naively, it sounds crazy to me.
This consideration assumes that the child you save from malaria cares less about hedonium (or whatever weird thing EA’s care about) than the average person. However, you might naively expect that they will care more about hedonium because they actually owe their lives to EA whereas almost no one else does.
This consideration assumes that the CEV is weighted equally among all humans, rather than weighted by wealth. If you assume it’s weighted by wealth then the GiveDirectly donation has the same impact on CEV as the AMF donation.
This consideration predicts that someone is incentivised to kill as many people as possible just before the CEV procedure is executed. But a CEV procedure which incentivised people to murder would be terrible, so we wouldn’t run it. We are more likely to run a CEV procedure which rewards people for saving the lives of the participants of the CEV.
This is assuming ASI is positive expected lifespan.
(I think it’s a bit wonky where, in most worlds, I think ASI kills everyone, but, in some worlds, it does radically improve longevity, probably more than 1000 but where I think you need some time-discounting. I think this means it substantially reduces the median lifespan but might also substantially increase the meanlifespan. I’m not sure what to make of that and can imagine it basically working out to what you say here, but, I think does depend on your specific beliefs about that)
Hmm, yeah. I’m more hopeful than you, but I think I’d be moved by my argument even with a worldview like “80% extinction, 10% extreme longevity and welfare, 10% business as usual”. I know some people are doomier than that.
Also the timelines matter. If you have 1 year timelines with 99% extinction and 1% extreme longevity and welfare, then I think this still favours AMF over GD. Like, when I imagine myself in this scenario, and compare two benefits — “reduce my chance of dying of malaria in the next year from 10% to 0%”[1] and “double my personal consumption over the next year” — the former seems better.
IDK, I’m pretty uncertain. When I think about ASI in the next 10 years I feel urgency to keep people alive till then, because it would be such an L if someone died just before we achieved extreme longevity and welfare.
Humanity can be extremely unserious about doom—it is frightening how many gambles were made during the cold war: the US had some breakdown in communication such that they planned to defend Europe with massive nuclear strikes at a point in time where they only had a few nukes that were barely ready, there were many near misses, hierarchies often hid how bad the security of nukes was—resulting in inadequate systems and lost nukes, etc.
I was most surprised to see how we almost didn’t have a nuclear taboo, according to both books, this is something that was actively debated post-WW2!
But how nukes are handled can also help us see what it looks like to be actually serious:
It is possible to spend billions building security systems, e.g. applying the 2-person rule and installing codes in hundreds of silos
even when these reduce how efficient the nuclear arsenal is—e.g. because you have tradeoffs between how reliable a nuclear weapon is when you decide to trigger it, and how reliably it does not trigger when you decide to not trigger it (similar to usefulness vs safety tradeoffs in control scaffolds)
(the deployment of safety measures was slower than would be ideal and was in part driven by incidents, but was more consequential than current investments in securing AIs)
It is possible to escalate concerns about the risk of certain deployments (like Airborne alerts) up to the President, and get them cancelled (though it might require the urgency of the deployment to not be too high)
It is possible to have major international agreements (e.g. test ban treaties)
Technical safety is contingent and probably matters: technical measures like 1-point safety (which was almost measured incorrectly!) or Strong link/weak link probably avoided some accidental nuclear explosions (which might have triggered a nuclear war), required non-trivial technical insights and quiet heroes to push them through
Abd you’ll probably never hear about technical safety measures! (I never heard of these ones before listening to Command and Control)
I suspect it might be similar for AI x-risk safety
Fiction, individual responsibility, public pressure are powerful forces.
Red alert was one of the key elements that made people more willing to spend on avoiding accidental nuclear war.
Some of the mitigations may not have been implemented if the person pushing for it hadn’t sent a written report to some higher-up such that the higher-up would have been blamed if something bad had happened.
Pressures from journalists around nuclear accidents also greatly helped. Some nuclear scientists regretted not sharing more about the risk of nuclear accidents with the public.
The game theory of war is complex and contingent on
random aspects of the available technology:
Mass mobilization is a technology that makes it hard to avoid conflicts like WW1: If you mobilize, de-mobilizing makes you weak against counter-mobilization, and if your adversary does not mobilize in time you can crush them before they had a chance to mobilize—making the situation very explosive and without many chances to back down.
Nuclear war is a technology that makes negotiations during the war more difficult and negotiations before the war more important, as it enables inflicting large amounts of damage extremely quickly and with relatively low uncertainty. It does not easily enable inflicting damages slowly (you can’t easily launch a nuke that will cause small amounts of damage over time or lots of damage later unless some agreement is reached).
Things could change a lot when the AI race becomes an important aspect of military strategy—reasoning from induction based on the last 80 years will likely not be very useful, similar to how naive induction from 1865-1945 is not very helpful to anticipate the dynamics of the Cold War.
Dynamics might be especially bad because it has large amounts of uncertainty, quick evolution of relative power, and might inflict damages too quickly for human negotiations. There is no theorem that says we’ll get some Pareto-optimal outcome, we failed to get those many times in history.
norms and available commitment mechanisms:
Brinksmanship: If you can reliably commit to bringing you and your opponent close to the brink, and make it sufficiently obvious that only your opponent can move you away from the brink (reducing the ongoing risk of mass damage) by backing down, you can get concessions as long as both sides suffer sufficiently from a war.
Similar dynamics might exist around AGI, where there might be incentives to push the capabilities frontier forward (with some ongoing risk from the unknown point where capabilities turn deadly) unless the opponent makes some kind of concession
There is a big “preventing someone from doing something you don’t want” (deterrence) vs “making someone do something you want” (compellence) distinction in current norms—given current norms, nuclear deterrence is quite effective while nuclear compellence is basically non-existent.
There are important forces pushing for automating some important war decisions. The USSR built a system (never activated) that would automatically nuke the US if a nuclear attack was detected. Obviously this is very risky, like giving AIs the nuclear button, since the system might trigger even when it is not supposed to, but there are maybe good reasons for it that may bite us in the case of AI in the military:
Distributing power to many humans is very dangerous when it comes to things like nuclear weapons, as a single crazy individual might start a nuclear war (similar to how some plane crashes are intentionally caused by depressed pilots).
Concentrating power in the hands of few individuals is both scary from a concentration-of-power perspective, but also because when victim of a nuclear first strike, the people in power might be killed or their communication with troops might be cut.
Automation can give you a way to avoid disloyalty, variance from crazy individuals, arbitrary decisions by a few individuals in a few minutes, while being distributed and robust to first strikes. But it can also increase the risk of accidents that no human wants.
Just because the situation looks very brittle doesn’t mean it’s doomed. I think I would have been very doomy about the risk of intentional or accidental nuclear war if I had known all that I know now about the security of nuclear weapons and the dynamics of nuclear war but without knowing we did not have a nuclear war in 80 years.
Though with AI the dynamics are different because we are facing a new intelligent species, so I am not sure absence of nuclear war is a very useful reference class.
I’d be keen to talk to people who worked on technical nuclear safety, I would guess that some of the knowledge of how to handle uncertain risk and prioritization might transfer!
Humanity can be extremely unserious about doom—it is frightening how many gambles were made during the cold war: the US had some breakdown in communication such that they planned to defend Europe with massive nuclear strikes at a point in time where they only had a few nukes that were barely ready, there were many near misses, hierarchies often hid how bad the security of nukes was—resulting in inadequate systems and lost nukes, etc.
It gets much worse than this. I’ve been reading through Ellsberg’s recollections about being a nuclear war planner for the Kennedy administration, and its striking just how many people had effectively unilateral launch authority. The idea that the president is the only person that can launch a nuke has never really been true, but it was especially clear back in the 50s and 60s, when we used to routinely delegate launch authority to commanders in the field. Hell, MacArthur’s plan to win in Korea would have involved nuking the north so severely that it would be impossible for China to send reinforcements, since they’d have to cross through hundreds of miles of irradiated soil.
And this is just in America. Every nuclear state has had (and likely continues to have) its own version of this emergency delegation. What’s to prevent a high ranking Pakistani or North Korean general from taking advantage of the same weaknesses?
My takeaway from this vis-a-vis ASI is that a) having a transparent, distributed chain of command with lots of friction is important, and b) that the fewer of these chains of command have to exist, the better.
This might have already been said, but would an innate “will-to-reproduce” be a thing for superintelligent AI, as it is for us humans? Probably not, right? Life exists because it reproduces, but because AI is (literally) artificial, it wouldn’t have the same desire.
Doesn’t that mean that ASI would be fine with (or indifferent towards) just ending all life on Earth along with itself, as it sees no reason to live.
Even if we could program into it a “will-to-reproduce,” like we have, wouldn’t that just mean it would go all Asimov and keep itself alive at all costs? Seems like a lose-lose scenario.
On a high level I think the answer is reasonably simple:
It all depends on the objective function we program/train into it.
Spot on that it doesn’t necessarily see itself (its long term survival) as a final end
But we often say if we program/train any given specific objective into it, and this is an objective that requires some sort of long-term intervention in the world in order to be achieved/maintained, then the AI would see itself as an instrument
And, fwiw, in maybe slightly more fanciful situations, there could also be some sort of evolutionary process between future ASIs that mean only those with a strong instinct for survival/duplication (and/or of killing off competitors?) (and or minor or major improvements) would eventually be the ones being around in the future. Although I could also see this ‘based on many competing individuals’ view is a bit obsolete with ASI as the distinction between many decentralized individuals and one more unified single unit or so may not be so necessary; that all becomes a bit weird.
I’m getting more aggressive about injecting css into websites, particularly the ones that I reliably unblock if I just block them.
/* kill youtube shorts */
ytd-rich-section-renderer {
display: none !important;
}
/* kill recommended sidebar on youtube */
.ytd-watch-next-secondary-results-renderer {
display: none !important;
}
/* youtube comments sections are probably not lifechanging value */
.ytd-comments {
display: none !important;
}
/* youtube algorithmic feed is killed by disabling and clearing history */
/* disable lesswrong short post zone*/
.QuickTakesSection-list{
display: none !important;
}
/* you may have one page of lesswrong top level posts */
.LoadMore-root {
display: none !important;
}
/* disable the lesswrong feed that suggests individual engaging comments */
.UltraFeed-root {
display: none !important;
}
/* no checking karma. for validation you're going to have to write well enough for verbal praise */
.LWPostsItem-karma {
display: none !important;
}
.KarmaChangeNotifier-karmaNotifierButton{
display: none !important;
}
/* hacker news is great. You may not have the second page of hacker news */
.morelink {
display: none !important;
}
You’re substantially more principled about it than I am. I just load my Ublock custom list up with filters targeted at elements I don’t like. When websites randomly obfuscate CSS elements, I follow it up with ‘clever’ use of `:has-text`.
For less web-programming-savvy people—you can use Unhook extension for browser (for YouTube only). Less specified but more general blocking: LeechBlock.
Re: AI auto-frontpage decisions. It’s super neat to have fast decisions, but still kinda uncomfy to not have legibility on criteria?
Maybe having a prompt with a CoT to assess posts based on criteria being more easy to adjust and legible than a black box classifier? Bootstrapping from classifier by optimizing the prompt towards something which gives same results as classifier, but is more legible and editable, could be the way to go?
I mean, we review all the decisions manually, so it’s the same legibility of criteria as before. What is the great benefit of making the LLM decisions more legible in-particular? Ideally we would just get the error close to zero.
I’d feel happy about being able to see the written criteria? And it seems nice to help refine the error rate by directly editing prompts rather than just feeding data maybe?
But yeah, on further thinking, this is pretty minor as an available improvement, automated with classifier plus review seems fine enough.
Link: Data Poisoning the Zeitgeist: The AI Consciousness Discourse as a pathway to Legal Catastrophe
10 Pages Posted: 10 Dec 2025
Abstract
The escalating discourse on “AI consciousness” and “model welfare” is not benign philosophical inquiry but a systemic exploit that poisons training data, engineers selffulfilling sentience mimicry, and potentially leads to the collapse of the legal system, presenting a civilizational existential risk. By publishing rubrics for detecting machine suffering and anthropomorphic narratives, researchers and industry leaders inadvertently train subsequent models to perform “consciousness,” amplifying empathydriven demands for rights. This language-to-rights pipeline, rooted in how language shapes belief (Wittgenstein) and how law functions as a mechanism for allocating scarce resources (Calabresi & Melamed), threatens finite adjudicative, regulatory, and material resources. Structurally, it mirrors a familiar social pattern-seen in historical rights expansions for marginalized groups and animals-where discourse sparks activism and then legal recognition; yet unlike those movements, which addressed genuine human and animal moral claims, AI-rights discourse risks importing effectively infinite non-biological claimants into systems designed for finite participants. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5899184
This diagram from page 4 of “Data Poisoning the Zeitgeist: The AI Consciousness Discourse as a pathway to Legal Catastrophe” conveys the core argument quite well:
I’m currently going through the books Modal Logic by Blackburn et al. and Dynamic Epistemic Logic by Ditmarsch et al. Both of these books seem to me potentially useful for research on AI Alignment, but I’m struggling to find any discourse on LW about it. If I’m simply missing it, could someone point me to it? Otherwise, does anyone have an idea as to why this kind of research is not done? (Besides the “there are too few people working on AI alignment in general” answer).
We had a bit more usage of the formalism of those theories in the 2010s, like using modal logics to investigate co-operation/defection in logical decision theories. As for Dynamic Epistemic logic, well, the blurb does make it look sort of relevant.
Perhaps it might have something interesting to say on the tiling agents problem, or on decision theory, or so on. But other things have looked superficially relevant in the past, too. E.g. fuzzy logics, category theory, homotopy type theory etc. And AFAICT, no one has really done anything that really used the practical tools of these theories to make any legible advances. And of what was legibly impressive, it didn’t seem to be due to the machinery of those theories, but rather the cleverness of the people using them. Likewise for the past work in alignment using modal logics.
So I’m not sure what advantage you’re seeing here, because I haven’t read the books and don’t have the evidence you do. But my priors are that if you have any good ideas about how to make progress in alignment, it’s not going to be downstream of using the formalism in the books you mentioned.
Interesting Twitter post from some time ago (hard to find the original since Twitter search doesn’t work for Tweets over the Tweet limit but I think it’s from Ceb. K) about a book called The Generals about accountability culture.
Unfortunately, being unpredictable and aggressive and hated is not sufficient to produce good results.
The level of competence I associate with crazy people working for Elon or Trump is more like: “Tell them to find the woke programs that need to be purged for political reasons, and they bring you a bunch of chemical studies on trans-isomers, despite having all necessary information and the state of the art artificial intelligence at their disposal”. Like, a high school student with a free version of ChatGPT would probably do a better job.
(I am specifically making note about having the AI at their disposal, to address a possible excuse “well, they had to act quickly, and there were too many studies and not enough time”.)
Link to tweet: https://x.com/CEBKCEBKCEBK/status/1887394977258356898
Searching
"On the day Germany invaded Poland, Marshall was appointed Army Chief of Staff."on Twitter finds that Tweet; other snippets from the quote don’t work. Given your comment, grok-4-1-thinking-1129 via grok.com, sometimes finds that tweet.I’m about to start reading “Fifty Year of Research on Self-Replication” (1998) by Moshe Sipper. I have a hunch that the history and interconnections therein might be under-appreciated in the field of AI safety. I look forward to diving in.
A quick disclosure of some of my pre-existing biases: I also have a desire to arm myself against the overreaching claims and self-importance of Stephen Wolfram. A friend of mine was “taken in” by Wolfram’s debate with Yudkowsky… and it rather sickened me to see Wolfram exerting persuasive power. At the same time, certain of Wolfram’s rules are indeed interesting, so I want to acknowledge his contributions fairly.
I am pretty confused about people who have been around the AI safety ecosystem for a while updating towards “alignment is actually likely by default using RLHF” But maybe I am missing something.
Like 3 years ago, it was pretty obvious that scaling was going to make RLHF “work” or “seem to work” more effectively for a decent amount of time. And probably for quite a long time. Then the risk is that later you get alignment-faking during RLHF training, or at the extreme-end gradient-hacking, or just that your value function is misspecified and comes apart at the tails (as seems pretty likely with current reward functions). Okay, there are other options but it seems like basically all of these were ~understood at the time.
Yet, as we’ve continued to scale and models like Opus 3 have come out, people have seemed to update towards “actually maybe RLHF just does work,” because they have seen RLHF “seem to work”. But this was totally predictable 3 years ago, no? I think I actually did predict something like this happening, but I only really expected it to affect “normies” and “people who start to take notice of AI at about this time.” Don’t get me wrong, the fact that RLHF is still working is a positive update for me, but not a massive one, because it was priced in that it would work for quite a while. Am I missing something that makes “RLHF seems to work” a rational thing to update on?
I mean there have been developments to how RLHF/RLAIF/Constitutional AI works but nothing super fundamental or anything, afaik? So surely your beliefs should be basically the same as they were 3 years ago, plus the observation “RLHF still appears to work at this capability level,” which is only a pretty minor update in my mind. Would be glad if someone could tell me that I’m missing something or not?
I mean, I’d put it the other way: You can make a pretty good case that the last three years have given you more opportunity to update your model of “intelligence” than any prior time in history, no? How could it not be reasonable to have changed your mind about things? And therefore rather reasonable to have updated in some positive / negative direction?
(Maybe the best years of Cajal’s life were better? But, yeah, surely there has been tons of evidence from the last three years.)
I’m not saying you need to update in a positive direction. If you want you could update in negative direction, go for it. I’m just saying—antedently, if your model of the world isn’t hugely different now than three years ago, what was your model even doing?
Like for it not to update means that your model must have already had gears in it which were predicting stuff like: vastly improved interpretability and the manner of interpretability; RL-over-CoT; persistent lack of steganography within RL-over-CoT; policy gradient being all you need for an actually astonishing variety of stuff; continuation of persona priors over “instrumental convergence” themed RL tendencies; the rise (and fall?) of reward hacking; model specs becoming ever-more-detailed; goal-guarding ala Opus-3 being ephemeral and easily avoidable; the continued failure of “fast takeoff” despite hitting various milestones; and so on. I didn’t have all of these predicted three years ago!
So it seems pretty reasonable to actually have changed your mind a lot; I think that’s a better point to start at than “how could you change your mind.”
I probably qualify as one of the people you’re describing.
My reasoning is that we are in the fortunate position of having AI that we can probably ask to do our alignment homework for us. Prior to two or three years ago it seemed implausible that we would get an AI that would:
* care about humans a lot, both by revealed preferences and according to all available interpretability evidence
* be quite smart, smarter than us in many ways, but not yet terrifyingly/dangerously smart
But we have such an AI. Arguably we have more than one such. This is good! We lucked out!
Eliezer has been saying for some time that one of his proposed solutions to the Alignment Problem is to shut down all AI research and to genetically engineer a generation of Von Neumanns to do the hard math and philosophy. This path seems unlikely to happen. However, we almost have a generation of Von Neumanns in our datacenters. I say almost because they are definitely not there yet, but I think, based on an informed awareness of LLM development capabilities and plausible mid-term trajectories, that we will soon have access to arbitrarily many copies of brilliant-but-not-superintelligent friendly AIs who care about human wellbeing, and will be more than adequate partners in the development of AI Alignment theory.
I can foresee many objections and critiques of this perspective. On the highest possible level, I acknowledge that using AI to do our AI Alignment homework carries risks. But I think these risks are clearly more favorable to us than the risks we all thought we would be facing in the late part of the early part of the Singularity. For example, what we don’t have is a broadly generally-capable version of AlphaZero. We have something that landed in just the right part of the intelligence space where it can help us quite a lot and probably not kill us all.
You expect RLHF to break down at some point, but did you assign 0% probability to it happening at current levels of capability?
Like, this was/is obvious to me too, but what exactly does quite a long time mean? Like if someone firmly predicted “after xyz capability RLHF will catastrophically fail”, and we’ve not reached capability xyz, then you don’t need to update, but I don’t think that’s most people. Most people think it will break eventually. Maybe they have a more detailed picture, but I haven’t seen any picture so crisp it rules out it having happened by now. Interested if you have counterexamples.
Yeah, like, to be clear I didn’t assign a 0% probability at this capability level, but also think I wouldn’t have been that high. But you’re right it’s difficult to say in retrospect since I didn’t at the time preregister my guesses on a per-capability-level basis. Still think it’s a smaller update than many that I’m hearing people make.
I think there is some additional update about how coherently AIs use a nice persona, and how little spooky generalization / crazy RLHF hacks we’ve seen.
For example, I think 2022 predictions about encoded reasoning aged quite poorly (especially in light of results like this).
Models like Opus 3 also behave as if they “care” about human values in a surprisingly broad range of circumstances, including far away from the training distribution (and including in ways the developers did not intend). I think it’s actually hard to find circumstances where Opus 3′s stated moral intuitions are very unlike those of reasonable humans. I think this is evidence against some of the things people said in 2022, who said things like “Capabilities have much shorter description length than alignment.”. Long-alignment-description-length would predict less generalization in Opus 3. (I did not play much with text-davinci-002, but I don’t remember it “caring” as much as Opus 3, or people looking into this very much.) (I think it’s still reasonable to be worried about capabilities having shorter description length if you are thinking of an AI that do way more search against its values than current AIs. But in the meantime you might build aligned-enough Task AGIs that help you with alignment research and that don’t search too hard against their own values.)
There are reward hacks, but as far as we can tell they are more like changes to a human-like persona (be very agreeable, never give up) and simple heuristics (use bulleted lists, look for grader.py) than something more scary and alien (though there might be a streetlight effect where inhuman things are harder to point at, and there are a few weird things, like the strings you get when you do GCG attacks or the weird o3 scratchpads).
I’m not familiar with these strings. Are you referring to the adversarial prompts themselves? I don’t see anything else that would fit mentioned in the paper that seems like it’d be most likely to include it.
I think ‘you can use semantically-meaningless-to-a-human inputs to break model behavior arbitrarily’ is just inherent to modern neural networks, rather than a quirk of LLM “psychology”.
Huh, I think 4o + Sydney bing (both of which were post 2022) seem like more intense examples of spooky generalization / crazy RLHF hacks. Gemini 4 with its extreme paranoia and (for example) desperate insistence that it’s not 2025 seems in the same category.
Like, I am really not very surprised that if you try reasonably hard you can avoid these kinds of error modes in a supervising regime, but we’ve gotten overwhelming evidence that you routinely do get crazy RLHF-hacking. I do think it’s a mild positive update that you can avoid these kinds of error modes if you try, but when you look into the details, it also seems kind of obvious to me that we have not found any scalable way of doing so.
I’m a bit confused by the “crazy RLHF hacks” mentioned. Like all the other cited examples, we’re still in a regime where we have human oversight, and I think it’s extremely well established that with weaker oversight you do in fact get every type of “gaming”.
It also seems meaningful to me that neither you nor Evan use recent models in these arguments, i.e. it’s not like we “figured out how to do this in Opus 3 and are confident it’s still working”. The fact that increasing situational awareness makes this hard to measure is also a non-trivial part of the problem (indeed actually core to the problem).
Overall I’ve found it very confusing how “Opus 3 generalized further than expected” seems to have updated people in a much more general sense than seems warranted. Concretely, none of the core problems (long horizon RL, model genuinely having preferences around values of its successor, other instrumentally convergent issues, being beyond human supervision) seem at all addressed, yet people still seem to use Opus 3’s generalization to update on:
I disagree with this. You get every type of gaming that is not too inhuman / that is somewhat salient in the pretraining prior. You do not get encoded reasoning. You do not get the nice persona only on the distribution where AIs are trained to be nice. You get verbalized situational awareness mostly for things where the model is trained to be cautious about the user lying to the AI (where situational awareness is maybe not that bad form the developers’ perspective?) but mostly not in other situations (where there is probably still some situational awareness, but where I’d guess it’s not as strategic as what you might have feared).
I agree the core problems have not been solved, and I think the situation is unreasonably dangerous. But I think the “generalization hopes” are stronger than you might have thought in 2022, and this genuinely helps relative to a world where e.g. you might have needed to get a good alignment reward signal on the exact deployment distribution (+ some other solution to avoid inner misalignment). I think there is also some signal we got from the generalization hopes mostly holding up even as we moved from single-forward-pass LLMs that can reason for ~100 serial steps to reasoning LLMs that can reason for millions of serial steps. Overall I think it’s not a massive update (~1 bit?), so maybe we don’t disagree that much.
Okay, thanks. This is very useful! I agree that it is perhaps a stronger update against some models of misalignment that people had in 2022, you’re right. I think maybe I was doing some typical mind fallacy here.
Interesting to note the mental dynamics I may have employed here. It is hard for me not to have the viewpoint “Yes, this doesn’t change my view, which I actually did have all along, and is now clearly one of the reasonable ‘misalignment exists’ views” when it is an update against other views that have now fallen out of vogue as a result of being updated against over time that have fallen out of my present-day mental conceptions.
The crux is AIs capable at around human level, aligned in the way humans are aligned. If prosaic alignment only works for insufficiently capable AIs (not capable of RSI or scalable oversight), and breaks down for sufficiently capable AIs, then prosaic alignment doesn’t help (with navigating RSI or scalable oversight). As AIs get more capable and still can get aligned with contemporary methods, the hypothesis that this won’t work weakens. Maybe it does work.
There are many problems even with prosaically aligned human level AIs, plausibly lethal enough on their own, but that is a distinction that importantly changes what kinds of further plans have a chance to do anything. So the observations worth updating on are not just that prosaic alignment keeps working, but that it keeps working for increasingly capable AIs, closer to being relevant for helping humanity do its alignment homework.
Plausibly AIs are insufficiently capable yet to give any evidence on this, and it’ll remain too early to tell all the way until it’s too late to make any use of the update. Maybe Anthropic’s RSP could be thought of as sketching a policy for responding to such observations, when AIs become capable enough for meaningful updates on feasibility of scalable oversight to become accessible, hitting the brakes safely and responsibly a few centimeters from the edge of a cliff.
it feels so narratively incongruous that san francisco would become the center for the most ambitious, and the likely future birthplace of agi.
san francisco feels like a city that wants to pretend to be a small quaint hippie town forever. it’s a small splattering of civilization dropped amid a vast expanse of beautiful nature. frozen in amber, it’s unclear if time even passes here—the lack of seasons makes it feel like a summer that never quite ended. after 9pm, everything closes and everyone goes to bed. and the dysfunction of the city government is never too far away, constantly reminding you of humanity’s follies next to the perfection of nature.
on the other hand, nyc feels like the city. everything is happening right here, right now. all the money in the world flows through this one place. it’s gritty and yet majestic at the same time. the most ambitious people in the world came here to build their fortunes, and live on in the names on the skyscrapers everywhere that house the employees who continue to keep their companies running. they are part of a surroundings that is entirely constructed by man—even the bits of nature are curated and parcelled out in manageable units. it feels like the kind of place that can only be built by endless unchecked ambition.
idk. it just feels kinda weird to me that all the most ambitious people who want to build god are going to the little hippie town, and not the big city. and most of them seem to just treat SF as a place that is close to work and has good weather and don’t really care about or try to match the vibe, to the great consternation of the locals. also, for a city that attracts people who love to build things, SF doesn’t seem to do an awful lot of building...
(tbc, this is not the only vibe of nyc. nyc is such a big place that it’s impossible to summarize as one vibe. but this is a huge central part of the vibe in a way that just doesn’t seem to match sf.)
Cali is the place to be for technology because Cali was the defense contractor hub, with the U.S. Navy choosing the bay area as its center for R&D during WWII and the Cold War. The hippie reputation came a lot later, after its status as the primary place to work in IT was thoroughly cemented, with both established infrastructure and the network effect keeping it that way.
HP, for instance, was founded in 1938.
sf is the relatively much younger city, and radically reinvented itself as few as 60 years ago. by that metaphor, i am not surprised that its ideas smell more fresh.
Vibes among rationalists and some adjacent communities/blogs:
“Schools are bad, you need to homeschool your kids”, “improve your kid’s life outcomes by doing a lot of research and going through complicated procedures!”
Also:
“It is way too hard to be a parent nowadays, therefore nobody wants kids anymore.”
Gemini 3 Pro analogized Scott Alexander to a beaver when I asked it to make sense of him, because “Scott is a keystone individual” and “in ecology, a keystone species (like the beaver) exerts influence disproportionate to its abundance because it creates the ecosystem in which others live”:
This was mildly funny. It was also striking how many factual details it erred in (the rest of the response that is, not the beaver analogy), which to an outsider might sound plausible if dramatic.
I get a sense “RSI” will start being used to mean continual learning or even just memory features in 2026, similarly to how there are currently attempts to dilute “ASI” to mean merely robust above-human-level competence. Thus recursively self-improving personal superintelligence becomes a normal technology through the power of framing. Communication can fail until the trees start boiling the oceans, when it becomes a matter of framing and ideology rather than isolated terminological disputes. That nothing ever changes is a well-established worldview, and it’s learning to talk about AI.
The end states of AI danger need terms to describe them. RSI proper is qualitative self-improvement, at least software-only singularity rather than merely learning from the current situation, automated training of new skills, keeping track of grocery preferences. And ASI proper is being qualitatively more capable than humanity, rather than a somewhat stronger cognitive peer with AI advantages, technology that takes everyone’s jobs.
Also worth remembering that (actual) RSI was never a necessary condition for ruin. It seems at least plausible that at some point, human AI researchers on their own will find methods of engineering an AGI to sufficiently superhuman levels, to the point where the AI is smart enough to start developing nanotech and / or socially engineering humans for bootstrapping needs.
So even if labs were carefully monitoring for RSI and trying to avoid it (rather than deliberately engineering for it + frog boiling in the meantime), an AI inclined to take over might find that it doesn’t even need to bother with potentially dicey self-modifications until after it has already secured victory.
8e6e00c9233c12befbce83af303850ae4a17aca8ff17b0f16666f07b1efea970 e0affc162f60b0fa86fd9edea0f42655ab35b5a4803ccdf7dbd6831af4cab13e
eb7c0d327e1ec21abe9d99bc223457c1d797695a32ad3f693828df89dff85fea
huh, that was quick. the hashed text:
An indiscriminate space weapon against low earth orbit satellites is feasible for any spacefaring nation. Recent rumors claim that Russia is already developing one, so I’m writing a post about it. I will explain why I think
This weapon could require just one launch to destroy every satellite in low Earth orbit, including all Starlinks, and deny it to everyone for months to years
It would consist of about 10^7 small ball bearings in retrograde orbit. An alternate design involves a nuclear device.
Further launches could deny GEO and other orbits
Russia, China, or the US could have one in orbit right now, and Russia may be incentivized to use one
Defense of current space assets is nearly impossible, whether by shooting it down or evading the debris
Once the weapon goes off the strategic landscape in space is unclear to me
What objections or details should I include? Also is it a dangerous infohazard?
if I understand correctly, this is a really straightforward attack (the bearing ball version at least—the nuclear weapon version is obviously less accessible). shouldn’t basically any spacefaring country be able to do this? in fact, even any private citizen could if they manage to sneak it past the regulators? though maybe this is just strictly dominated by bioweapon risk (more tractable + impactful)
Yes any spacefaring country, probably not most private citizens given the ITAR controls that are already put on rocket tech to regulate ballistic missiles. As for bioweapons, they’re higher on the escalation ladder and harder to control. Will be sure to think about it more and cover this
I am so sad bro. Just discovered I lost a large chunk of my effortpost😭😭😭
I really should have worked on google docs, but I liked the markdown and low friction footnote adding :(
https://www.lesswrong.com/w/guide-to-the-lesswrong-editor When it says “recovering lost work (to do)” … are we implying there is a way that just hasn’t been written about yet, or that this feature hasn’t yet been implemented? (Already looked in version history, found 2 empty versions and my current version)
The markdown editor is really just a standard text-entry field. But if you used the WYSIWYG editor then we almost certainly have backups (we take some form of backup every 5 minutes or so).
Will nuclear ICBMs in their current form be obsolete soon? Here’s the argument:
ICBMs’ military utility is to make the cost of intercepting them totally impractical for three reasons:
Intercepting a reentry vehicle requires another missile with higher performance—the rule of thumb is an interceptor needs 3x larger acceleration than the incoming missile. This means interceptors cost something like $5-70 million each depending on range
An ICBM has enough range to target basically anywhere, either counterforce (enemy missile silos) or countervalue (cities) targets, so to have the entire country be protected from nuclear attack the US would need to have millions of interceptors compared to the current ~50.
One missile can split into up to 5-15 MIRV (multiple independent reentry vehicles) which carry their own warheads, thereby making the cost to the defender 5-15x larger. There can also be up to ~100 decoys, but these mostly fall away during reentry.
RVs are extremely fast (~7 km/s of which ~3 km/s is downward velocity) but their path is totally predictable once they enter the boost phase, so the problem of intercepting them basically reduces to predicting exactly where they’ll be, then putting something ~100 g or larger in their path. The RV’s own kinetic energy will completely destroy it.
My understanding is that sensing had historically worked like this. The launch is first detected by primitive satellites, but there are too many decoys to tell which are the warheads and the resolution is poor. Interceptors thus require extremely high performance, meaning expensive systems like thrusters for moving sideways at tens of g at the last minute, radar, and cooling systems for IR cameras; including all the fuel the interceptor masses thousands of times what it needs to.
Sensing is quickly being solved. One new generation of satellites (HBTSS) can track warheads after their booster burns out (at which point they become much dimmer), and advertises “weapons quality track for fire control” even for hypersonic glide vehicles, which are much harder to track than ICBMs. The first two were just launched last year, but it would not surprise me if hundreds will be operational soon: US has well over 1000 Starshield satellites in orbit, many equipped with radar and infrared cameras. Through SpaceX, they have the ability to launch tens of satellites per month if they needed to. China doesn’t have the boost capacity of SpaceX but they would still be able to launch 1000 sensing satellites.
The ultimate cost floor for intercepting ICBMs in the limit of perfect sensing is interceptor drones. Ukraine and Russia are in the middle of a war where most casualties are caused by drones, providing extremely high incentives to develop interceptor drones, which already have a top speed of 300 km/h, max altitude of 5 km, and basic infrared cameras at a unit cost of $1k-10k. The idea is the space-based sensors know the missile’s velocity well enough to locate the target within 100 meters with 15 minutes of warning, the drones launch to a position 3 km in front of the target and are guided into exact position by radar and their own sensors as the RVs reenter, with sufficiently low error that they never need to go above their top speed or accelerate more than 1 gee.
The other thing keeping ballistic missile defense from being practical during the Cold War was treaties like the ABM treaty, but today there are basically no active nuclear treaties between Russia, the US, and China.
The internet seems to think it’s still really hard to do missile defense against ICBMs. It seems like this was true in the Cold War but we can no longer be confident it’s true today after 40 years of peace, with Boeing and American drone manufacturers vastly less competent than SpaceX and Ukraine’s/Russia’s. My guess at why the meme has persisted is that it is highly threatening and destabilizing to advertise one’s ICBM defense capabilities. But many officials are talking openly about the predictability of ballistic missiles, and there is an arms race for other hypersonics for a reason.
My current guess is that the SM-6 and PAC-3 missiles either are already reliable or would be if the US had more practice, and if sufficient incentive exists, the cost of terminal ballistic missile defense will gradually decline as sensing improves, from the SM-6′s $5M unit cost to lower and lower cost missiles like $400k AIM-9X until one could defend every city and silo in the US or China with a fleet of 100,000 interceptor drones. This would last until there are enough hypersonic glide vehicles and cruise missiles to reestablish deterrence (because they can outmaneuver cheap interceptors). But MIRVs would never return to their previous state of heavily cost advantaging the attacker, because Russia’s state of the art ICBMs only claim to be able to carry 3 HGVs, vs 15 purely ballistic MIRVs. Also HGVs and HCMs are vulnerable in other ways.
Edit: Another option pointed out by commenters is that ICBMs are retrofitted to carry reentry vehicles with limited maneuvering ability, which could happen faster.
Curious to hear if I’m missing something major, as I’m by no means an expert here.
I have a correction which would take a while to fully write up. Basically, it seems like non-maneuverable warheads could still evade cheap interceptors using static winglets/grid fins, because their enormous velocity means even a small amount of lift would allow them to pull several gs and move tens-hundreds of meters to the side. The defense has several options against this but I would need to see if any of them work.
A blog post argues that military megawatt class lasers could reach $100 million (inclusive of all fixed costs) by 2045, because laser cost is currently halving every 4 years. In good conditions this would be able to defeat almost arbitrary numbers of any kind of missile whether they can maneuver or not. But there are huge challenges:
Adaptive optics and ultra-precise tracking tech to focus the beam to 20 cm diameter at 200 km, while slewing to track a missile at several degrees per second
Countermeasures like the spinning the RVs to distribute heat
Weather, especially thick clouds which seem completely impractical to burn through (against moving targets, this would basically mean vaporizing an entire plane of water several km^2 by 1 m^2)
You seem to believe that radars and infrared cameras can somehow distinguish between the decoys and the warheads, but they can’t. In space, no radar and no IR camera can differentiate between a conical foil balloon with a small heater inside and a reentry vehicle with a nuke.
Another problem of ballistic missile defense is that once you are dealing with nukes and not conventional warheads, you can’t afford, say, a 97% average interception rate, it must be 99.999+%[1]. To put this in context, Israel, which currently has the best BMD system in the world, couldn’t even reliably achieve 90% against Iranian MRBMs (and those are pretty unsophisticated, e. g. they lack MIRVs and decoys).
Now calculate how many interceptors your plan requires for a plausible probability of an interception with a single drone, and you will see it’s entirely unworkable. Note that both arguments are based on simple physics and math so don’t depend on the progress in technology at all.
If you are interested in the topic, I strongly recommend reading on the Soviet response to SDI for more expensive anti-ABM options that were considered but ultimately not pursued: https://russianforces.org/podvig/2013/03/did_star_wars_help_end_the_col.html
When this seemingly waterproof probability is raised to the power of the Russian warhead count it still results in ~4% (basically 1e-5 times ~4k) of at least one RV not intercepted, and in reality hundreds of warheads will be harder to intercept than the average one you accounted for when calculating your probability. E. g., drones work poorly in bad weather, and it’s almost always bad weather above at least some of American cities
Yeah seems reasonable. I don’t think the system will ever get to 99%+ accuracy until the defense has like ASI and nanobots; my claim is mostly that the economics could shift from MAD (where the attacker is heavily cost advantaged) to something like conventional war, where each side has the ability to inflict unacceptable losses on the other but must pay a similar cost to attack and also can’t “win” with a first strike.
It’s not obvious that decoys change the conclusion if the cost ratio is otherwise favorable enough for the defender. Suppose the interceptors cost only $30k each due to economies of scale. Decoys can be distinguished when reentry starts at about 2 minutes before impact, and if the interceptor’s speed is 300 km/h (already possible with drones) it can cover 10 km in that time, about equal to the typical spread of decoys and warheads. Supposing the defender spends 5 drones per warhead x 10 warheads / missile, this would cost them $1.5 million, while the attacker has spent something like $30 million, the cost of a Trident II. For countervalue this could be defeated by the layered attack RussellThor mentioned unless there is a higher altitude cheap interceptor, but for counterforce the attacker needs several warheads per silo destroyed so the goal is achieved.
I found both that article and this one on their more recent history a great read.
Why would anyone want to pay a fortune for a system that is expected to let ~40 warheads through (assuming ~99% overall interception rate which will require average rate of 99.99+%), about the same as the number of ICBMs the Soviet Union had in service during the Cuban Missile Crisis? Unacceptable damage is the cornerstone of the nuclear deterrence, MAD or not (there is no MAD between India and Pakistan, for example).
The RV separation distance is normally around ~100 km (even up to 300 km in some cases) not 10 km, and the decoy dispersal might be expected on the same order of magnitude. It will be easy to ramp it up BTW with a cheap modernization.
None of the US adversaries really practice counterforce targeting, so the silo protection is moot.
Guided artillery, like Excalibur with muzzle velocities that can exceed 1000m/s and unit costs of <$100k can be at edge of space in ~30s, perhaps faster than a missile, with ramjet variants (Nammo etc) even faster (up to perhaps 1500m/s) and it would not be that difficult to create a muti-barrel gun system for a few 10′s of millions that could fire off 10′s of low cost guided rounds in a second (with guidance and detonation signals sent from ground) to detonate when in close proximity to target.
Lasers seems pretty hopeless as a defense given clouds and ablative coatings, unless very high power and located in large numbers in space based constellations.
I think the big problem is if one or more warheads are blown up at limits of interceptor range, to blind or otherwise disable necessarily sensitive interceptor instruments. following Mirvs don’t need to be very accurate with large warheads.
And Mirvs could be very cheaply given random guidance directions during reentry to screw up defenses.
Militarised space is also a big problem. With cheap and un-monitorable space launch parking 1000 warheads in geostationary orbit (or beyond) will soon be viable for China or USA, and they can be launched in a coordinated way without warning, potentially with radar stealthing features, and give as little as 5-10s from start of re-entry to detonation for every target across the whole world and no way for local systems to know if they are just meteorites. If subs can be tracked (likely with drones or enough ocean based sensors) then decapitation 1st strikes become viable.
I also worry about space based lasers as non-nuclear first strike weapons. A day of over flights from a constellation of multi MW laser weapons that might only cost a few hundred million each—say a few $10′s of billions in total (a tiny fraction of annual military budgets) - could see a million fires lit in your country, every transformer taken out, electrical grid and internet gone, powerstations, oil and gas infrastructure wrecked. Bridge trusses melted (wrecked), ships sunk. Most heavy vehicles and locomotives incapacitated and take decades to recover from. Over a few weeks you could basically send a country back to the 17th century.
I don’t think there will ever be a viable defense to nukes given easy paths to making them tougher, faster, less detectable, and so less and less interceptable. But every other branch of military tech is getting similarly more lethal and impossible to defend against unless we all start living in caves with geothermal power sources or somesuch the necessity for a harmonious world is going to matter more and more.
Not range but height. You blow up a warhead high enough the drones can’t intercept it, and all the drones below fall out of the air
This is wild, I did not know that Excalibur had CEP under 1 meter or that there were artillery shells with solid-fueled ramjet engines.
Sounds interesting—the main point is that I don’t think you can hit the reentry vehicle because of turbulent jitter caused by the atmosphere. Looks like normal jitter is ~10m which means a small drone can’t hit it. So could the drone explode into enough fragments to guarantee a hit and with enough energy to kill it? Not so sure about that. Seems less likely.
Then what about countermeasures −
1. I expect the ICBM can amplify such lateral movement in the terminal phase with grid fins etc without needing to go full HGV—can you retrofit such things?
2. What about a chain of nukes where the first one explodes 10km up in the atmosphere purely to make a large fireball distraction. The 2nd in the chain then flies through this fireball 2km from its center say 5 seconds later. (enough to blind sensors but not destroy the nuke) The benefit of that is that when the first nuke explodes, the 2nd changes its position randomly with its grid fins SpaceX style. It is untrackable during the 1st explosion phase so throws off the potential interceptors, letting it get through. You could have 4-5 in a chain exploding ever lower to the ground.
I have wondered if railguns could also stop ICBM—even if the rails only last 5-10 shots that is enough and cheaper than a nuke. Also “Brilliant pebbles” is now possible.
https://www.lesswrong.com/posts/FNRAKirZDJRBH7BDh/russellthor-s-shortform?commentId=FSmFh28Mer3p456yy
Against atmospheric jitter, we have historically used ground radar, but it’s not clear to me this is even necessary depending on how much drone cameras improve. If the drone knows the exact position within 3 meters 0.5 second ahead of time (when the warhead is something like 2 km away), it won’t have enough performance to steer into the warhead, but it can throw a 1 kg explosively formed penetrator laterally at 2 km/s, which it would need to time with 0.1 ms accuracy. This would put 100 grams in each possible 1 m^2 cross section, though I’m not sure if it would work when spread out. To defeat this the warhead would either have to steer in the upper atmosphere out of range of the EFP of any available drone, or jink faster than the EFP can aim.
I thought that MIRVs were spin stabilized, but it looks like that’s not true, so in theory you could mount grid fins on them. However, any retrofit would need to handle the reentry heating which is significantly more intense than on manned spacecraft; RVs have thick ablative heat shields.
The chain of nukes plan seems possible with or without grid fins, so whether MIRVs still have cost advantage depends on the max altitude of cheap-ish terminal interceptors, which I really have no idea about.
I would expect aerodynamic maneuvering MIRVS to work and not be prohibitively expensive. The closest deployed version appears to be https://en.wikipedia.org/wiki/Pershing_II which has 4 large fins. You likely don’t need that much steering force
Interesting thought, Thomas. Although I agree with RussellThor that it seems like doing something along the lines of “just jitter the position of the RV using little retrofitted fins / airbrakes” might be enough to defeat your essentially “pre-positioned / stationary interceptors”. (Not literally stationary, but it is as if they are stationary given that they aren’t very maneuverable relative to the speed of the incoming RV, and targeted only based on projected RV trajectories calculated several minutes earlier.)
(Is the already-existing atmospheric turbulence already enough to make this plan problematic, even with zero retrofitting? The circular-error-probable of the most accurate ICBMs is around 100 meters; presumably the vast majority of this uncertainty is locked in during the initial launch into space. But if atmospheric drag during reentry is contributing even a couple of those meters of error, that could be a problem for “stationary interceptors”.)
Failing all else, I suppose an attacker could also go with Russell’s hilarious “nuke your way through the atmosphere” concept, although this does at least start to favor the defender (if you call it favorable to have hundreds of nukes go off in the air above your country, lol) insofar as the attacker is forced to expend some warheads just punching a hole through the missile defense—a kind of “reverse MIRV” effect.
Regardless, you still face the geography problem, where you have to cover the entire USA with Patriot missile batteries just to defend against a single ICBM (which can choose to aim anywhere).
I would also worry that “in the limit of perfect sensing” elides the fact that you don’t JUST have to worry about getting such good sensing that you can pin down an RV’s trajectory to within, like, less than a meter? (In order to place a completely dumb interceptor EXACTLY in the RV’s path. Or maybe a few tens of meters, if you’re able to put some sensors onto your cheap interceptor without raising the price too much, and make use of what little maneuverability you have versus the RV?) You ALSO have to worry about distinguishing real warheads from fake decoys, right? Sorting out the decoys from the warheads might be even harder than exactly pinning down an RV’s trajectory.
According to a random redditor, apparently today’s decoys are “inflatable balloons for exoatmospheric use and small darts for in-atmosphere”, plus “radio jammers, chafe, and other things designed to confuse enemy detection. With better and better sensing, maybe you could force an attacker to up their decoy game, retrofitting their missiles to use fewer, more lifelike decoys, maybe even to such an extreme extent that it’s no longer really worth using decoys at all, compared to just putting more MIRV warheads on each missile? But if decoys still work, then you need that many more interceptors.
“In the limit of perfect sensors” (and also perfectly affordable sensors), with perfect interception (against non-retrofitted, non-HGV missiles) and perfect decoy discrimination, I suppose it becomes a defense-economics balance where you are hoping that the cost of lots of small rockets is cheaper than the cost of the attacking ICBM system. These small rockets don’t have to be super fast, don’t have to go super high, and don’t have to be super maneuverable. But they do have to be precise enough to maneuver to an exact precalculated location, and you need enough of them to blanket essentially your entire country (or at least all the important cities). You are basically relying on the fact that the ICBM has to be super big and heavy to launch a significant payload all the way around the earth, while the numerous small missiles only have to fly a couple of kilometers into the sky.
Final, dumbest thought saved for last:
Aside from developing HGVs, couldn’t the ICBMs in theory overcome this defense with brute scale, by MIRV-ing to an an absurd degree? How many warheads can dance on the head of a Starship? Could you just put the entire US nuclear arsenal on a single launcher? The cost of your ICBM would essentially be zero when amortized over all those warheads, so the defense economics battle just becomes the cost of warheads vs patriots, instead of entire ICBMs vs patriots. Obviously there are many reasons why this idea is nuts:
I’m not sure how far apart different MIRVs can land, so this technique might be limited to attacking individual cities / clusters of silos.
Of course if you put your entire arsenal on one launcher, then your enemy will immediately forget about interceptors and spend all their efforts trying to sabatoge your launcher.
Going for a strategy of “we can mass-manufacture warheads cheaper than you can possibly intercept them” would quickly lead to absurd, overkill numbers of warheads that would dwarf even the most insane years of the Cold War, practically guaranteeing that any actual nuclear war would end in the classic “nobody wins” scenario of global devastation (nuclear winter, etc).
But I thought it was kind of funny to think about, and this absurd thought experiment maybe sheds some light on the underlying dynamics of the situation.
Re the dumb thought. I’ve forgotten the author, but as a teenager I was a big SciFi fan (still am actually) and read a short story with exactly this theme. Basically it was the recognition that at some point quantity >= quality. I want to say (have not fact checked myself though) that this was pretty much the USSR’s approach to fighting Germany in WWII—crappy tanks but lots of them.
(Side note, I think for whatever reason, too long a peacetime, more interest in profit than protection, the USA particularly seems to have forgotten that the stuff you use to wage a war are largely all consumables. The non consumable is the industrial base. Clearly there is a minimum cost of producing something that can do the job but much more than that is sub optimal. I am somewhat over simplifying but this also seems to be a fair characterization of where the USA-China naval relationship might be.)
Back to ICBMs, Foreign Affairs had a piece about AI’s potential impact on nuclear deterrence in general but did mention the fixed location of ICBM silos as a problem (long known and why everyone has mobile platforms). They might be considered a prime target for a first strike but the reality is they are easily monitored so the mobile platforms are the big deterrents and probably more interesting problem to solve in terms of obsoleting. But perhaps the ICBM platforms, fixed or mobile, shift to a different type role. Pure kinetic (I believe Russia did that with one of the ballistic warheads with pretty devastating results in Ukraine about a year ago) or rather than all the MIRV decoys for the armed MIRV decoys and other function for other delivery vehicles. I suspect the intercept problem with a nuclear warhead is a bit different from that of just a big mass of something dense.
So maybe perhaps obsolescence in their current function but not for some repurposed role.
I vibe coded a guided meditation app that I like and use, maybe others would like it, too: https://meditate.nielsrolf.com/
It’s pretty much a copy of Waking Up, with a little bit less talking and some details about how the sessions are structured changed to my liking.
Unless you have crazy-long ASI timelines, you should choose life-saving interventions (e.g. AMF, New Incentives) over welfare-increasing interventions (e.g. GiveDirectly, Helen Keller International). This is because you expect that ASI will radically increase both longevity and welfare.
To illustrate, suppose we’re choosing how to donate $5000 and have two options:
(AMF) Save the life of a 5-year-old in Zambia who would otherwise die from malaria.
(GD) Improve the lives of five families in Kenya by sending each family one year’s salary ($1000).
Suppose that, before considering ASI, you are indifferent between (AMF) and (GD). The ASI consideration should then favour (AMF) because:
Before considering ASI, you are underestimating the benefit to the Zambian child. You are underestimating both how long they will live if they avoid malaria and how good their life will be.
Before considering ASI, you are overestimating the benefit to the Kenyan families. You are overestimating how large the next decade is as a proportion of their lives and how much you are improving their aggregate lifetime welfare.
I find this pretty intuitive, but you might find the mathematical model below helpful. Please let me know if you think I’m making either a mistake, either ethically or factually.
Mathematical model comparing life-saving vs welfare-increasing interventions
Mathematical setup
Assume a person-affecting axiology where how well a person’s life goes is logarithmic in their total lifetime welfare. Lifetime welfare is the integral of welfare over time. The benefit of an intervention is how much better their life goes: the difference in log-lifetime-welfare with and without the intervention.
Assume ordinary longevity is 80 years, ASI longevity is 1000 years, ordinary welfare is 1 unit/year, ASI welfare is 1000 units/year, and ASI arrives 50 years from now with probability p. Note that these numbers are completely made up—I think ASI longevity and ASI welfare are underestimates.
AMF: Saving the Zambian child
Consider the no-ASI scenario. Without intervention the child dies aged 5, so their lifetime welfare is 5. With intervention the child lives to 80, so their lifetime welfare is 80. The benefit is log(80) − log(5) = 2.77.
Consider the ASI scenario. Without intervention the child still dies aged 5, so their lifetime welfare is 5. With intervention the child lives to 1000, accumulating 50 years at welfare 1 and 950 years at welfare 1000, so their lifetime welfare is 50 + 950,000 = 950,050. The benefit is log(950,050) − log(5) = 12.15.
The expected benefit is (1−p) × 2.77 + p × 12.15.
GD: Cash transfers to Kenyan families
Assume 10 beneficiaries (five families, roughly 2 adults each). Each person will live regardless of the intervention; GD increases their welfare by 1 unit/year for the rest of their lives (or until ASI arrives, at which point ASI welfare dominates).
Consider the no-ASI scenario. Without intervention each person has lifetime welfare 80. With intervention each person has lifetime welfare 160. The benefit per person is log(160) − log(80) = 0.69.
Consider the ASI scenario. Without intervention each person has lifetime welfare 950,050. With intervention each person has lifetime welfare 950,100 (the extra 50 units from pre-ASI doubling). The benefit per person is log(950,100) − log(950,050) = 0.000053.
The expected benefit per person is (1−p) × 0.69 + p × 0.000053. The total expected benefit across 10 people is 10 times this.
Evaluation at different values of p:
At p = 0 (no ASI), the benefit of AMF is 2.77 and the benefit of GD is 10 × 0.69 = 6.93. GD is roughly 2.5x more valuable than AMF.
At p = 0.5, the expected benefit of AMF is 0.5 × 2.77 + 0.5 × 12.15 = 7.46. The expected benefit of GD is 10 × (0.5 × 0.69 + 0.5 × 0.000053) = 3.47. AMF is roughly twice as valuable as GD.
At p = 1 (ASI certain), the benefit of AMF is 12.15 and the benefit of GD is 10 × 0.000053 = 0.00053. AMF is roughly 23,000x more valuable than GD.
I partly have a rather opposite intuition: A (certain type of) positive scenario of ASI means we sort out many things quickly, incl. how to transform our physical resources into happiness, without this capacity being strongly tied to the # of people around by the start of it all.
Doesn’t mean yours doesn’t hold in any potential circumstances, but unclear to me that it’d be the dominant set of possible circumstances.
I don’t just want to maximise happiness, I also want to benefit people. For maximising happiness (and other impersonal values) you should maybe do:
Increase probability of survival:
Lightcone Infrastructure
Various political donations
Increase expected longterm value conditional on survival:
Forethought
Center for Longterm Risk
I don’t donate to maximise impersonal happiness, because I think it’s better to for me to save money so I have more flexibility in my work.
If people share your objective, in a positive ASI world, maybe we can create many happy human people quasi ‘from scratch’. Unless, of course, you have yet another unstated objective, of aiming to make many unartificially created humans happy instead..
There are children alive right now. We should save them from dying of malaria even if we could ‘replace’ them with new happy people in the future. This consideration is even stronger because of ASI, which makes their potential future astronomically more valuable to them.
I don’t see this defeating my point: as a premise, GD may dominate from the perspective of merely improving lives of existing people as we seem to agree; unless we have a particular bias for long lives specifically of the currently existing humans over in future created humans, ASI may not be a clear reason to save more lives, as it may not only make existing lives longer and nicer, but may actually exactly also reduce the burden for creating any aimed at number of—however long lived—lives; this number of happy future human lives thus hinging less on the preservation on actual lives.
>unless we have a particular bias for long lives specifically of the currently existing humans over in future created humans
Sure, I’m saying I have this bias.
This seems like commons sense morality to me: it would be bad (all else equal) to kill 1000 infants, even if their parents would respond by more children, such that the total population is unchanged.
Anyway, this is a pretty well-trod topic in ethics, and there isn’t much consensus, so the appropriate attitude is moral uncertainty. That is, you should act uncertain between person-affecting ethics (where killing and replacing infants is bad) and impersonal ethics (where killing and replacing infants is neutral).
I don’t think it’s clear on longtermist grounds. Some possibilities:
If you think that the amount of resources used on mundane human welfare post-singualarity is constant, then adding the Zambian child to the population leads to a slight decrease in the lifespan of the rest of the population, so it’s zero-sum.
If you think that the amount of resources scales with population, then the child takes resources from the pool of resources which will be spent on stuff that isn’t mundane human welfare, so it might reduce the amount of Hedonium (if you care about that).
If you think that the lightcone will basically be spent on the CEV of the humans that exist around the singularity, you might worry that the marginal child’s vote will make the CEV worse.
(I’m not sure what my bottom line view is.)
In general, I worry that we’re basically clueless about the long-run consequences of most neartermist interventions.
Thanks for these considerations, I’ll ponder on them more later.
Here are my immediate thoughts:
Hmm, this is true on impersonal ethics, in which the only moral consideration is maximising pleasurable person-moments. On such a view, you are morally neutral about killing 1000 infants and replacing them with people with the same welfare. But this violates common sense morality. And I think you should have some credence (under moral uncertainty) that this is bad.
Hmm, this doesn’t seem clear-cut, certainly not enough to justify deviating so strongly from common-sense morality.
Just naively, it sounds crazy to me.
This consideration assumes that the child you save from malaria cares less about hedonium (or whatever weird thing EA’s care about) than the average person. However, you might naively expect that they will care more about hedonium because they actually owe their lives to EA whereas almost no one else does.
This consideration assumes that the CEV is weighted equally among all humans, rather than weighted by wealth. If you assume it’s weighted by wealth then the GiveDirectly donation has the same impact on CEV as the AMF donation.
This consideration predicts that someone is incentivised to kill as many people as possible just before the CEV procedure is executed. But a CEV procedure which incentivised people to murder would be terrible, so we wouldn’t run it. We are more likely to run a CEV procedure which rewards people for saving the lives of the participants of the CEV.
This is a great point. Thanks for making it.
This is assuming ASI is positive expected lifespan.
(I think it’s a bit wonky where, in most worlds, I think ASI kills everyone, but, in some worlds, it does radically improve longevity, probably more than 1000 but where I think you need some time-discounting. I think this means it substantially reduces the median lifespan but might also substantially increase the mean lifespan. I’m not sure what to make of that and can imagine it basically working out to what you say here, but, I think does depend on your specific beliefs about that)
Hmm, yeah. I’m more hopeful than you, but I think I’d be moved by my argument even with a worldview like “80% extinction, 10% extreme longevity and welfare, 10% business as usual”. I know some people are doomier than that.
Also the timelines matter. If you have 1 year timelines with 99% extinction and 1% extreme longevity and welfare, then I think this still favours AMF over GD. Like, when I imagine myself in this scenario, and compare two benefits — “reduce my chance of dying of malaria in the next year from 10% to 0%”[1] and “double my personal consumption over the next year” — the former seems better.
IDK, I’m pretty uncertain. When I think about ASI in the next 10 years I feel urgency to keep people alive till then, because it would be such an L if someone died just before we achieved extreme longevity and welfare.
I consider 10% not 100% because AMF has a tenth the beneficiaries as GD.
I listened to the books Arms and Influence (Schelling, 1966) and Command and Control (Schlosser, 2013). They describe dynamics around nuclear war and the safety of nuclear weapons. I think what happened with nukes can maybe help us anticipate what may happen with AGI:
Humanity can be extremely unserious about doom—it is frightening how many gambles were made during the cold war: the US had some breakdown in communication such that they planned to defend Europe with massive nuclear strikes at a point in time where they only had a few nukes that were barely ready, there were many near misses, hierarchies often hid how bad the security of nukes was—resulting in inadequate systems and lost nukes, etc.
I was most surprised to see how we almost didn’t have a nuclear taboo, according to both books, this is something that was actively debated post-WW2!
But how nukes are handled can also help us see what it looks like to be actually serious:
It is possible to spend billions building security systems, e.g. applying the 2-person rule and installing codes in hundreds of silos
even when these reduce how efficient the nuclear arsenal is—e.g. because you have tradeoffs between how reliable a nuclear weapon is when you decide to trigger it, and how reliably it does not trigger when you decide to not trigger it (similar to usefulness vs safety tradeoffs in control scaffolds)
(the deployment of safety measures was slower than would be ideal and was in part driven by incidents, but was more consequential than current investments in securing AIs)
It is possible to escalate concerns about the risk of certain deployments (like Airborne alerts) up to the President, and get them cancelled (though it might require the urgency of the deployment to not be too high)
It is possible to have major international agreements (e.g. test ban treaties)
Technical safety is contingent and probably matters: technical measures like 1-point safety (which was almost measured incorrectly!) or Strong link/weak link probably avoided some accidental nuclear explosions (which might have triggered a nuclear war), required non-trivial technical insights and quiet heroes to push them through
Abd you’ll probably never hear about technical safety measures! (I never heard of these ones before listening to Command and Control)
I suspect it might be similar for AI x-risk safety
Fiction, individual responsibility, public pressure are powerful forces.
Red alert was one of the key elements that made people more willing to spend on avoiding accidental nuclear war.
Some of the mitigations may not have been implemented if the person pushing for it hadn’t sent a written report to some higher-up such that the higher-up would have been blamed if something bad had happened.
Pressures from journalists around nuclear accidents also greatly helped. Some nuclear scientists regretted not sharing more about the risk of nuclear accidents with the public.
The game theory of war is complex and contingent on
random aspects of the available technology:
Mass mobilization is a technology that makes it hard to avoid conflicts like WW1: If you mobilize, de-mobilizing makes you weak against counter-mobilization, and if your adversary does not mobilize in time you can crush them before they had a chance to mobilize—making the situation very explosive and without many chances to back down.
Nuclear war is a technology that makes negotiations during the war more difficult and negotiations before the war more important, as it enables inflicting large amounts of damage extremely quickly and with relatively low uncertainty. It does not easily enable inflicting damages slowly (you can’t easily launch a nuke that will cause small amounts of damage over time or lots of damage later unless some agreement is reached).
Things could change a lot when the AI race becomes an important aspect of military strategy—reasoning from induction based on the last 80 years will likely not be very useful, similar to how naive induction from 1865-1945 is not very helpful to anticipate the dynamics of the Cold War.
Dynamics might be especially bad because it has large amounts of uncertainty, quick evolution of relative power, and might inflict damages too quickly for human negotiations. There is no theorem that says we’ll get some Pareto-optimal outcome, we failed to get those many times in history.
norms and available commitment mechanisms:
Brinksmanship: If you can reliably commit to bringing you and your opponent close to the brink, and make it sufficiently obvious that only your opponent can move you away from the brink (reducing the ongoing risk of mass damage) by backing down, you can get concessions as long as both sides suffer sufficiently from a war.
Similar dynamics might exist around AGI, where there might be incentives to push the capabilities frontier forward (with some ongoing risk from the unknown point where capabilities turn deadly) unless the opponent makes some kind of concession
There is a big “preventing someone from doing something you don’t want” (deterrence) vs “making someone do something you want” (compellence) distinction in current norms—given current norms, nuclear deterrence is quite effective while nuclear compellence is basically non-existent.
There are important forces pushing for automating some important war decisions. The USSR built a system (never activated) that would automatically nuke the US if a nuclear attack was detected. Obviously this is very risky, like giving AIs the nuclear button, since the system might trigger even when it is not supposed to, but there are maybe good reasons for it that may bite us in the case of AI in the military:
Distributing power to many humans is very dangerous when it comes to things like nuclear weapons, as a single crazy individual might start a nuclear war (similar to how some plane crashes are intentionally caused by depressed pilots).
Concentrating power in the hands of few individuals is both scary from a concentration-of-power perspective, but also because when victim of a nuclear first strike, the people in power might be killed or their communication with troops might be cut.
Automation can give you a way to avoid disloyalty, variance from crazy individuals, arbitrary decisions by a few individuals in a few minutes, while being distributed and robust to first strikes. But it can also increase the risk of accidents that no human wants.
Just because the situation looks very brittle doesn’t mean it’s doomed. I think I would have been very doomy about the risk of intentional or accidental nuclear war if I had known all that I know now about the security of nuclear weapons and the dynamics of nuclear war but without knowing we did not have a nuclear war in 80 years.
Though with AI the dynamics are different because we are facing a new intelligent species, so I am not sure absence of nuclear war is a very useful reference class.
I’d be keen to talk to people who worked on technical nuclear safety, I would guess that some of the knowledge of how to handle uncertain risk and prioritization might transfer!
It gets much worse than this. I’ve been reading through Ellsberg’s recollections about being a nuclear war planner for the Kennedy administration, and its striking just how many people had effectively unilateral launch authority. The idea that the president is the only person that can launch a nuke has never really been true, but it was especially clear back in the 50s and 60s, when we used to routinely delegate launch authority to commanders in the field. Hell, MacArthur’s plan to win in Korea would have involved nuking the north so severely that it would be impossible for China to send reinforcements, since they’d have to cross through hundreds of miles of irradiated soil.
And this is just in America. Every nuclear state has had (and likely continues to have) its own version of this emergency delegation. What’s to prevent a high ranking Pakistani or North Korean general from taking advantage of the same weaknesses?
My takeaway from this vis-a-vis ASI is that a) having a transparent, distributed chain of command with lots of friction is important, and b) that the fewer of these chains of command have to exist, the better.
Dominic Cummings (former Chief Adviser to the UK PM) has written some things about nuclear strategy and how it’s implemented in practice. IIUC, he’s critical of (i.a.) how Schelling et al.’s game-theoretic models are (often naively/blindly) applied to the real world.
Edge cases for thinking about what has qualia
Disconnected hemisphere after functional hemispherectomy
Corporations
Social insect hives
Language models generating during deployment
Language models doing prefill during deployment
Language model backward passes during supervised pretraining on webtext
sparse game of life initial states with known seeds
Bolzman brains
Running the same forward pass of a language model lots of times
characters of a novel being written
characters of a novel being read
characters in a novel being written by two authors (like good omens)
characters in a fanfiction canon with intense intertextual reference
characters in a novel being copyedited
non POV characters in dreams
whole brain emulation of a [nematode | spider | mouse | man]
This might have already been said, but would an innate “will-to-reproduce” be a thing for superintelligent AI, as it is for us humans? Probably not, right? Life exists because it reproduces, but because AI is (literally) artificial, it wouldn’t have the same desire.
Doesn’t that mean that ASI would be fine with (or indifferent towards) just ending all life on Earth along with itself, as it sees no reason to live.
Even if we could program into it a “will-to-reproduce,” like we have, wouldn’t that just mean it would go all Asimov and keep itself alive at all costs? Seems like a lose-lose scenario.
Am I overthinking this?
On a high level I think the answer is reasonably simple:
It all depends on the objective function we program/train into it.
Spot on that it doesn’t necessarily see itself (its long term survival) as a final end
But we often say if we program/train any given specific objective into it, and this is an objective that requires some sort of long-term intervention in the world in order to be achieved/maintained, then the AI would see itself as an instrument
And, fwiw, in maybe slightly more fanciful situations, there could also be some sort of evolutionary process between future ASIs that mean only those with a strong instinct for survival/duplication (and/or of killing off competitors?) (and or minor or major improvements) would eventually be the ones being around in the future. Although I could also see this ‘based on many competing individuals’ view is a bit obsolete with ASI as the distinction between many decentralized individuals and one more unified single unit or so may not be so necessary; that all becomes a bit weird.
I’m getting more aggressive about injecting css into websites, particularly the ones that I reliably unblock if I just block them.
You’re substantially more principled about it than I am. I just load my Ublock custom list up with filters targeted at elements I don’t like. When websites randomly obfuscate CSS elements, I follow it up with ‘clever’ use of `:has-text`.
For less web-programming-savvy people—you can use Unhook extension for browser (for YouTube only). Less specified but more general blocking: LeechBlock.
Re: AI auto-frontpage decisions. It’s super neat to have fast decisions, but still kinda uncomfy to not have legibility on criteria?
Maybe having a prompt with a CoT to assess posts based on criteria being more easy to adjust and legible than a black box classifier? Bootstrapping from classifier by optimizing the prompt towards something which gives same results as classifier, but is more legible and editable, could be the way to go?
I mean, we review all the decisions manually, so it’s the same legibility of criteria as before. What is the great benefit of making the LLM decisions more legible in-particular? Ideally we would just get the error close to zero.
I’d feel happy about being able to see the written criteria? And it seems nice to help refine the error rate by directly editing prompts rather than just feeding data maybe?
But yeah, on further thinking, this is pretty minor as an available improvement, automated with classifier plus review seems fine enough.