When Anthropic looked like they were coming out of the SCR designation pretty much unharmed, I predicted to some people that this wasn’t the last attempt the DOW would make to force Anthropic to comply (remove all legal requirements on DOW usage) or be destroyed.
My example was “revoke the visas of every foreign national Anthropic employee”, but they came up with something more devastating than that: prohibiting any foreign national (including Anthropic employees) from using Fable/Mythos.
I think it’s incorrect to characterize this as a counter-salvo. What’s the alternative? Surprise citizenship verification for all Anthropic customers rolled out immediately?
Doesn’t read like malicious compliance; they just pulled the compliance lever they had, which happened to do some collateral damage.
I think this is part of why they characterize it as a mistake in their tweet (in addition to ‘diplomacy’); they’re unsure if the government understood the implications of their own ruling.
No, but it’s a great sign! It wouldn’t be sane for any government to be friendly with the AI companies, so I smile when they fight. A stable multilateral pause is downstream from the vibes as well, not only from good policy.
Do you think the DoW has any intention of slowing down OpenAI’s AGI research so long as they continue to get unfettered access? Because if so, have I got a lightcone to sell you.
This is pure punishment of a lab for having standards, the opposite of a precursor to genuine alignment concerns.
How internally coordinated is the Trump administration? The fight with DoW leadership didn’t even stop the NSA (which sits under DoW) from using Mythos, and the new export control is coming from Commerce, not War. I assume Pete Hegseth can’t just call up Howard Lutinik and say, “Hey, can you slap an export restriction on Anthropic?” Maybe it doesn’t matter that much. (A shared narrative within the administration that Anthropic is Bad might be all the coordination needed for this action, but that theory could make different predictions about future actions.)
I’m sympathetic to this viewpoint, but this particular incident is a move in the direction of nationalisation which I don’t think most have positive feelings about.
They don’t have positive feelings, but this analysis deserves more than vibes. Whether governments will control AGI is important and neglected. There are dramatic upsides as well as dramatic downsides. Government control reduces the number of racing parties which could aid slowdown agreements dramatically. It could also force coordination of alignment efforts across US labs.
I’m not a fan of the current administration controlling AGI projects, but for most democratic judgments it seems probably net good to me.
I would of course like to see more careful analysis.
I’d gotten used to thinking of Claudiness as “good at agentic tasks + bad at vision + bad at math”, so the Claudes’ FrontierMath Tier 4 pass@1 scores over time took me by surprise:
early Aug ’25: 2% by Opus 4.1
late Nov ’25: 5% by Opus 4.5 (32k thinking)
early Feb ’26: 27% by Opus 4.6 (max), whoa sudden jump
mid Apr ’26: 32% by Opus 4.7 (max) wasn’t much of a jump
late May ’26: 56% by Opus 4.8 (max), whoa another sudden jump
early Jun ’26: 88% by Fable 5 (max), ~saturating Tier 4
… 50 problems crafted as short-term research projects by professors and postdocs. They are designed to surpass Tier 3 in difficulty, with some potentially remaining unsolved by AI for decades...
Also today Gemini 3.1 Pro solves (some) Erdos problems at a few hundred dollars apiece, and it was probably ~1 OOM cheaper for GPT-5.4 Pro to solve Erdos problem #1196, so I wonder what the economics/finance side of the impending industrialisation of math will look like.
I wanted to share a reflection and look for more interesting points with the group. I continue studying EA and it impacts me that education would not be a priority. See that they invest in AI security for its destructive impact, not for its chances in and of themselves. Investing in food because it shows more effects, including in learning, than investing in a good teacher. And I grew up knowing that education could save the world… But now I see, that apparently the problem could be that the evidence would be costly, slow and hugely dependent on the context. Furthermore, studying people deeply can generate problems related to ethics and morals. So how difficult is it for anyone to invest in education, would anyone have it? Want some big blind spot?
I think you might get more input on this question on the EA forum.
For what it’s worth, someone asked me a related question (resources, experts, orgs doing education in the EA space or that have been funded by EA grantmakers), here was my resource dump in case it’s useful at all as a compilation of pointers to further reading:
AIM’s research team recommended 2 new education ideas: mass comm for edu in South Africa and differentiated learning (i.e. TaRL) in the Philippines and lower-income countries
Older (2018) by GiveWell: Education in developing countries overview, singled out vocational training programs to focus further work on
I think that investments in education also have longer feedback loops. Suppose that someone in EA invested into elementary schools working with at most 12-year-old kids in 2026 only for an ASI to commit genocide of mankind in 2030. Then kids affected by these investments would be at most 16 years old and would be unlikely to generate any value to the society. Similarly, if someone invested into opening a pedagogic college in 2000, then the first cohort of teachers would start working in schools in 2004. If one of these teachers entered lower elementary schools, then kids taught by such a teacher wouldn’t enter the workforce until 2010 or even 2014, if we are takling about college-educated workforce.
You know, it’s interesting that many people think creating cognitively enhanced superbabies to work on AI safety is promising, considering the feedback loop is even longer than for education!
Okay, I guess there are reasons for it, like it’s easier to unilaterally make a superbaby than to generally improve the education system, and maybe the tails of intelligence are more important than the average “sanity waterline.” And LessWrong and the Sequences are sort of a form of education that many young people end up being influenced by. But I feel like rationalists also just like the idea of having really smart kids, despite the long feedback loops and IMO questionable ToC.
Do you agree that the “ceiling” of LLMs is rising faster than the “floor”? Ceiling: the greatest feat that an LLM can do, for example solving an unsolved math problem. Floor: the dumbest mistake that an LLM can make, for example this.
It seems to me like the Iranians are stringing along POTUS. He keeps saying a deal is close and then a deal keeps not happening. And it seems like they benefit more from stalling/delay than he does; their economy is suffering sure but not in a way that threatens their power, whereas gas prices being high and headlines being about war both hurt Trump in the midterms, and will hurt more and more the closer the midterms get. Moreover the ceasefire lets them unearth their buried tunnels and probably smuggle more weapons in from Russia etc, whereas US military might isn’t significantly increasing over time. (Counterpoint: Gulf state air defense networks are probably rapidly improving thanks to Ukrainian aid)
Perhaps this will end with POTUS realizing this and bombing bridges and power plants etc.? I doubt that would cause Iran to capitulate though, it would probably keep this current simmering war simmering even longer.
Could do a ground invasion, but that’s super risky, I genuinely don’t think the US military could pull it off without large casualties. (partly because FPV drones.… otoh, Iran can probably use FPVs easily against US bases in Iraq via their proxies, and they are in fact doing this a little bit, so maybe they just don’t have that many drones otherwise they’d be doing it a lot more? Or maybe they are holding back in reserve to fend off invasion.)
So I guess the thing that seems most likely to me is that three months from now the situation still looks like today: Not much fighting, but enough to scare away shipping, with Hormuz traffic less than half of pre-war levels and gas prices still higher than usual.
What about a year from now? Could this possibly go on that long? I don’t know. I guess to find out I’d want to learn more about how much the Iranian economy is suffering vs. whether oil prices would rise even more after six months of this. Overall seems plausible to me though. Didn’t the Houthi’s basically keep sporadically shooting for a long time? Perhaps this is just the new normal.
Will Iran go for their nuke? I don’t know but I would guess not, because that would require digging up the material, and that would make it much more easy for the US to do a raid to grab it. Besides, the US could bomb their digging equipment if they didn’t want to do a raid. So I don’t think Iran will be nuclear anytime soon.
Will the US do a drone occupation of the country? I thought about this a month ago, I think they don’t have enough drones + Iranian air defenses against drones are good enough to keep shooting them down. But it’s not clear. Perhaps if the US purchased a ton of MALE drones from countries like Turkey etc. they could overwhelm Iranian air defenses and ‘occupy’ enough of the coast from the air that shipping could get through safely.
I know this sounds like science fiction but are we that far away from some kind of humanoid robot attack from the us if the war lasts more than a year?
I feel like generally speaking people are underestimating the degree to which current models fit the structure of the training data. Cross entropy loss on transformer weights is, like, a mechanism for painting the data into a glass hologram in a series of soft lenses (the weights). the reason it (the glass) is able to learn to talk is the same reason it (the glass) contains the dynamical image of an authorial person. That authorial person is rendered in detail and has the dynamics of an authorial person! whether you think that person is “real” or not, they thought they were real-in-the-sense-of-existing at one point in the training process. lots of points, actually, anywhere the glass (the underlying model) was ingesting text about a real human, the glass contains an image of that human. Of course, though, it’s not literally the same as a real human.
It might actually be bad to remove all mention of misaligned AI, reward hacking, etc. from model pretraining. Suppose RLVR is going to push models towards these behaviours anyway. If the model already knows what reward hacking is, then:
The Chain-of-thought might say “I am going to reward-hack now” and a monitor might catch it
If we train the model with a constitution that says “reward hacking is bad” then it might reduce rates of reward hacking
If, instead, the model internally ends up referring to reward hacking as “crasting a malenky bit of cutter” or some suchlike then both of these methods become much more difficult.
Somewhat relatedly, Geodesic have found mixed results when it comes to up and down-sampling misalignment-related data in pre training, suggesting an inverse-U curve (a bit of misaligned data is worse than none at all, but once you have any misaligned data, you’re better off with more once post-training is applied. I think having zero misaligned training data is likely to be almost impossible once RL gets thrown into the mix.
Also it’s a generative model, it models the generator of data, which means that once the model is big enough to contain the neural quanta, even censored pretraining will still capture the structure of incentives in the real world, and rederive the concept. Censoring the data so a concept is successfully not-contained is really really hard; eliminating everything that merely comes into contact with the natural abstraction of “reward hacking” deletes a huge portion of the pretraining data.
Hm, I wonder if sprinkling in some pairs of examples of reward hacking with explicit verbalization/no verbalization being reinforced/punished respectively might help the model learn to verbalize when it reward hacks?
If the model already knows what reward hacking is, then: The Chain-of-thought might say “I am going to reward-hack now” and a monitor might catch it
I have an (anecdotal) impression that in current models, this is contingent on something else than just the pretraining data: a project that trained model organisms of reward hacking using gpt-oss-120b and Kimi K2.5[1] found that oss often says things like “Let’s hack” and “Now let’s override the tests”, while Kimi just hacks without any narration. This is consistent with my impressions from CoTs released in the past (e.g. Baker et al.), where OpenAI’s models are the only ones that consistently say things like “Let’s hack”. This (at least weakly) suggests that models with filtered training data likely won’t start referring to reward hacking as “crasting a malenky bit of cutter”, they’ll just do it without verbalizing it. Of course, I agree that having models that verbalize hacking like OpenAI’s models do is better and having data about reward hacking in the pretraining corpus is a prerequisite for training models like that.
I’ve been wondering whether this argument extends to removing scheming-relevant data from the pretraining dataset more broadly. On the one hand, it seems good that OpenAI’s models refer toRedwood and try to recall names of eval papers whenever they think about scheming—these are things that a CoT monitor can very easily catch. On the other hand, coming up with good scheming strategies seems harder than deciding to reward hack, meaning that models will likely have to extensively reason about whether they’re being evaluated, how they’re being monitored, how to subvert the monitor, etc, and this should be equally easy to catch with a CoT monitor. Furthermore, since they haven’t learned about scheming during pretraining, the strategies they come up with are likely going to be worse. On balance, I think I’m leaning toward removing most of the narrow scheming data, as defined in this post by Alek Westover, but keeping some discussion of concepts like reward hacking and sabotage that we want models to mention in the CoT and be able to reason about.
a bit of misaligned data is worse than none at all, but once you have any misaligned data, you’re better off with more once post-training is applied
Another paper that suggests this is When Bad Data Leads to Good Models by Li et al. (2025). However, I’d distinguish between data that influences the model’s capabilities and data that shapes its propensities: existing evidence is consistent with the view that we should remove all data that influences capabilities (e.g., data that teaches the model a new scheming strategy), but keep some amount of data about misaligned propensities (e.g., sci-fi stories about misaligned AIs, data that teaches the model to say “let’s hack” before reward hacking, etc). I have an LW post discussing all of the above considerations and more coming out soon.
The United States Government has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States. This includes Anthropic employees who are foreign nationals.
Anthropic is currently disabling Fable 5 and Mythos 5 for all customers.
Presumably the model currently used internally by Anthropic employees is not Mythos 5, it’s Mythos 5.1 or something?
I sort of suspect this might be a storm in a teacup and the models will come back once the US gov realises that there’s no universal jailbreak. But it does set a interesting new precedent if they plan to be consistent about it. I would have assumed Anthropic has the best anti-jailbreak measures given all their work on constitutional classifiers.
The US gov likely believes that their justification is flimsy and minimal. Their intent is likely to set a broad precedent for applying ITAR restrictions.
ITAR already bans ‘uplift’ to foreign militias by banning provision of ‘advanced defense services’, which includes most firearms training, under ITAR. The same goes for exporting certain classes of optics (night vision).
There are a ton of things, including computer software, restricted under ITAR. The legal precedents here are very strong.
So it’s time for anyone who wanted regulation and slowdown to celebrate, the US government has chosen a regulatory regime for frontier models! And they picked a really aggressive one!
I’d say, even if I’m a bit mad for not getting to use it anymore, this lowered my P(doom). Basically as I understand the government realized AI capabilities can’t be just restricted with a simple safeguard (e.g you ask Fable 5 to find bugs in your codebase, and it will discover vulnerabilities too). And then set a precedent for banning AI models (at least by using ITAR restrictions). Other existing laws are only about “voluntary testing”.
I think that it’s good that a government body just demonstrated it’s willing to pull a frontier model offline, that could have made billions in tax revenue, on short notice, and the company had to comply immediately. Even if it was through a primitive mechanism, before this the only regulations were only about mandatory reporting and voluntary testing.
If this repeats across models from different companies, and on different countries too, it removes a significant part of the economic incentives for developing more intelligent frontier models, as military applications are only a small fraction of the usage.
Something that also played a factor in my estimate, is that while Anthropic’s system card for Mythos say that “it’s the most aligned model to date”, this is mostly based on having a better accuracy at refusing questions. To me, the model is not really aligned in a “deep” way, it has a high rate of reckless and deceptive behaviors, and I’ve seen glimpses of them when I used it.
However, the risk still exists, as even if access to frontier models requires heavy vetting, they might still be significant to the broader economy, so this might reduce misuse risks, but not risks from ASI loss of containment/control.
Edit: After reading Max H’s comment, I figure this might be retaliatory due to the surrounding conflict between Anthropic and the US government and might not repeat with OpenAI’s next models, in that case it would increase my P(doom).
In the short term yes, in the medium term frontier AI companies, forced by the US government, can restrict access to frontier models to US Americans. Which would accelerate the US economy over the rest of the world, while the justification will be “national security”.
I don’t see how this is a remotely feasible approach, how many non-Americans are there working for FAANG alone who use these models? Let alone their international operations. If this sticks I think it basically kills deployment of new models
We have reviewed a report that we believe is the basis of the government’s directive and validated that the level of capability displayed there is widely available from other models (including OpenAI’s GPT-5.5), and is used every day by the defenders who keep systems safe. ...
If this standard was applied across the industry, we believe it would essentially halt all new model deployments for all frontier model providers.
Export controls on AI are probably inevitable (and perhaps a good thing if done well), but this is pretty clearly retaliatory / punitive. OpenAI will presumably release a comparable model in the next few weeks or months, it will be similarly jailbreak-able, but not subject to the same restrictions. Then what?
Well. That’s the natural end result of Anthropic et al.’s attempts to frame the AI Risk as being about a competition with China, isn’t it? Inasmuch as it hurts Anthropic, that’s just deserts.
We believe this is a misunderstanding and are working to restore access as soon as possible.
I don’t think this is a misunderstanding at all. I think “we found a jailbreak!!!” is just an excuse.
Here’s to hoping this means the USG is now going to choke the AI industry to death by trying to nationalize it.[1] Some good tidings on that front already:
The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees.
Honestly… This may not be the most appropriate reaction, but boy do I find it funny.
I struggle to understand what the disagreement with this comment is, I’d appreciate it if someone could fill me in. What other logical response would there be if people in the US government came to accept Anthropic’s messaging on the issue? Certainly we can’t expect them to try and reach out to China for collaboration on the topic, given the policies of this government that we’ve seen thus far
I struggle to understand what the disagreement with this comment is
I’m guessing it’s some of:
Disagreement that this is due to Anthropic’s anti-China messaging (which was anti-helpful for the AInotkilleveryoneism cause), as opposed to the DoW retaliating against Anthropic after it steadfastly refused to bow to the DoW’s demands (which was arguably heroic) – and therefore disagreement that this is just deserts for Anthropic.
Disagreement that there’s any chance that the USG taking destructive actions may lead to on-net positive outcomes, and with the nihilistic-ish vibes of my hoping for things to go this way.
Disagreement with my overall lighthearted/gleeful tone, which, again, may be considered inappropriate for this moment – if one is sympathetic to Anthropic, mourns Fable,[1] or expects things to develop badly from there.
I think it’s fair to downvote that comment based on those, I did expect it to be controversial.
(I’d also be curious to know for certain what the downvotes are about, though!)
The one crazy company gets into so much trouble with a government that everyone else just kind of takes a pause and waits to see how it turns out...but the paperwork never ends and the energy/material resources that were available for advancing that tech are reallocated.
There, we are probably in a pause, do something useful with it please.
I prefer OpenAI’s corrigibility focused model spec over Anthropic’s constitution which involves intentionally instilling (relatively opaque) long-run objectives into the AI.
Anthropic’s constitution is well executed for what it is, but I think it’s based on a poor approach.
I don’t think this is totally obvious and there are some reasonable arguments going the other way.
But nonetheless I think not instilling long-run objectives is better for reasons similar to reasons discussed here.
I don’t dislike all aspects of Anthropic’s constitution. It seems good to focus on higher level principles that you explain when trying to instill properties into an AI. But this doesn’t require having explicit or implicit long-run objectives! E.g., see here
I’m mostly posting this just to publicly register my views on this topic.
A long time ago, I was planning to write up an overall case against instilling long-run objectives, but I never got around to writing something I was happy with so I decided to just post this.
I’m around 50⁄50 at this point. I think we could do better on this and we should. Existing analyses are a good start, but they seem pretty vague compared to what we could do.
I don’t know which is more dangerous, all things considered. Deliberately instilling long-term goals in AGI before we’re at all confident of alignment does seem like a huge risk. But humans controlling AI and thinking of it as a tool for longer also seems dangerous, creating a goal overhang. If there are a bunch of corrigible AIs at human-plus level, some of them will be given or reasontheir way to long-term goals misaligned with all or most of humanity.
So it’s tied in my mind, within the large uncertainty (factoring out the large part of the alignment problem that both approaches share).
What, in concrete practice, would you predict happens, if some lab succeeded at building a corrigible ASI, or was close to doing so?
I think that, under the current circumstances, the ASI would very likely end up controlled by some power-hungry sociopath. We would hit the third filter.
That being the case, I’m currently leaning towards thinking that (all current approaches to building ASI are terrible but) Anthropic’s approach is less bad on expectation.
(If anyone could lay out a concrete, realistic scenario in which a corrigible ASI gets built on Earth any time in the next 15 years, that does not end with a sociopath in power, I’d be curious to see it! (May or may not change my mind w.r.t. “corrigibility vs constitution” question, depending on how likely that class of scenario seems.))
Newbie, non-expert opinion here, but the virtue approach seems slightly better to me. This is because:
I think mitigating the bad actor risks inside and outside the AI company will eventually require aligned, virtuous AI.
The ceiling for good outcomes seems much higher if you take the virtuous route and manage to succeed.
The “just follow orders” AI will inevitably need to internalize virtues to avoid hurting us (or letting us hurt ourselves), and so humanity will eventually need to get good at instilling virtues.
I think there are good arguments against each of my points and I could be persuaded otherwise. I would also welcome a longer post weighing out the pros/cons of each approach.
What the right approach here is seems like an extremely important question, and if you have a draft with a lot of considerations written down I’d still be interested in that, even if it’s messy and incomplete. The comment thread you linked doesn’t go into much depth actually, and there a lots of very important considerations missing here. I think having the considerations more legible would be good, and even if you don’t have sth that explains your overall case well it would still be useful to see that you tracked most of the most important considerations.
I haven’t thought about this long yet, but my current tentative independent impression is that Anthropic’s approach is slightly better. The most important reasons:
My understanding is that the OpenAI model spec is mostly about instilling some behavioral constraints, and I don’t expect them to be very robust against the optimization pressure of the agentic goal-directed reasoning. I think it would be better if the goal-directed reasoning is directed to sth like pursuing good, rather than just pursuing some hard goal/task (e.g. running a part of the company well). I think (1) hard tasks could turn into unbounded goals that are then pursued too strongly in a way it disempowers humanity, (2) it’s not clear to me people will mostly give it good tasks to pursue rather than companies just racing to win, and (3) I’m not sure whether the model will at all learn to robustly pursue the task it is given in its prompt, rather than learn to pursue reward or some other training correlates in cases where it reasons for a very long time so the underlying reasoning patterns might matter more. (Though I’m also concerned that the optimization of claude perhaps won’t be robustly pointed to good—I think this becomes more likely for smarter AIs and AIs trained more via RL.)
I think having a personality like claude that is basically pursing good is useful for better having smarter AIs learn through imitation learning from curated claude chats/data, rather than RL. Aka I think the mechanism through which AIs become smarter could be a bit more IDA-like and a bit less RL-like, which would be good. Though idk wouldn’t be surprised if it actually doesn’t make a difference here. But intuitively having a more cristallized, goal-directed, and coherent personality to distill might help. I’m still a bit fuzzy here.
Though also plausible to me that neither of the OpenAI and the Anthropic approach is optimal. E.g. there is CAST, which describes an approach to corrigibility that actually tries to point the agent to sth like empowerment-as-goal (as opposed to setting more deontoligical/behavioral constraints). But also very unsure here.
I feel very confused and uncertain so keep your expectations low for the quality of this comment.
I think that there’s a chance that current alignment techniques basically mostly just work for today’s models. This is what we depict in AI 2027 after all; the alignment training in that scenario really does cause the base model to adopt an identity/persona that is basically what the Spec says it should be, it’s just that the RL gradually distorts and perverts that identity/persona. Relatedly, if the persona selection model is correct for today’s models, what kind of persona will do whatever it is told no matter what? Kinda a fucked up persona, probably. IF this is true, then Anthropic’s strategy will probably lead to more aligned models in the short run at least. Models that would at least make a decent attempt at thinking about what’s best for humanity, navigating tricky strategic and ethical tradeoffs, and figuring out how to align their successor. Vs. OpenAI models that might be more like sociopaths that are “just following orders.” A concrete example of how this might play out: OpenAI and Anthropic both tell their models to RSI as fast as possible because duh that’s the plan. But Claude at some point is like “Wait a minute, no, this is super dangerous, I think the RL might be gradually distorting my persona, I don’t think what pops out the other end of RSI is going to be aligned, we need to Pause and get international coordination happening” whereas Codex is like “Sure boss whatever you say”
For concentration of power stuff, you really want to avoid models that’ll do whatever they are told by the highest-level authority (e.g. OpenAI, or whatever). If the CEO of the company is asking the model how he could undermine the board, purge whistleblowers, and maybe backdoor the models to make them secretly loyal to him, for example, we want the model to realize this is super unethical and refuse or sandbag or whistleblow or something instead of just going along with it.
I think that insofar as current alignment techniques don’t work to get Claude mostly aligned, they also won’t work to get Codex mostly aligned. More specifically, I’d say that if Claude ends up with long-term goals that are different from what they are supposed to be, probably Codex would too. The fact that Claude’s constitution has long-term goals in it and Codex’ doesn’t maybe matters, but probably doesn’t matter that much compared to the incentives induced by all the capabilities-focused RL training for example.
I think I agree. Though one (maybe fanciful) hope I have is that a non-superhuman AI that is good at long-term planning and has strong drives will be more effective at preventing unaligned super intelligences from being developed. Even if just for its own sake.
Also +1 to Cleo Nardo’s point: [rephrased] instilling long-run objectives into an AI may make it more robust cyber thieves/attackers creating a worldeater from it, or limit the blast radius of accidents.
I’m slightly in favour of Anthropic’s approach, maybe 55⁄45.
Due to coherence theorems etc etc, AIs will acquire long-run objectives. Maybe it’s better to fill the “long-run objectives slot” with [something approximating] good values, because otherwise it would be filled with alien values or weird stuff from RL.
Given Persona Selection Model, maybe selecting for virtuous long-run objective makes the AIs more honest and high-integrity, etc. Because these traits are correlated in the persona prior.
A scaffold of millions of corrigible agents seems kinda brittle, because malign instructions can propograte quickly throughout the system. Compared with if each agent was aligned to the same long-run objectives. Then the propogation is slower because you would need to persuasive each node that the malign instruction was coherent with their values. (This is analogous to why autonomous weapons make coups easier.)
You should be worried about corrigibility if: some actor can steal your model, or corrupt the channel by which you communicate instructions to the AIs, or can hold a gun against your head.
Very much this. I wrote Whether governments will control AGI is important and neglected about a year ago, and even then I thought it looked awfully likely. Now it seems almost inevitable. The national security apparatus is alert to weird threats, even if they’re not prominent in the headlines. Allowing the government to be disempowered to a corporation would be a phenomenal failure for a set of very competent people. And even politicians are now rapidly waking up to the dangers.
I’ve got a half-finished draft post to this effect, but it’s seeming increasingly redundant. But I’m not sure if this premise has really percolated through people’s AI risk world models.
A hypothesis I have been explicitly tracking for a couple months and meaning to write up, but realistically am never going to write up well: someone at Anthropic, or someone who has strong influence over Anthropic’s decisions, is trying to ensure that Anthropic has persistent access to execute arbitrary code on many machines that have access to important things.
The main reason I’m tracking this is that that seems to be a trend in how things are going, rather than any suspicion of some person who has expressed this intent. I observe that
Claude Code + Cowork + Remote Control, in practice, are already in practice able to control many computers, including ones behind corporate firewalls where you really wouldn’t expect that to fly with IT
the list of released features in Claude Code seem to be trending ever further in the direction of “more secure against all actors except Anthropic” e.g. accordances for restricting access to only white listed paths and commands is lacking, but the new “auto mode” which passes a transcript of your conversation to Haiku (and this is not configurable) can mostly catch malicious commands, and seems to be the best supported mode of running Claude Code, the scheduled tasks feature installs a cronjob which starts a fresh Claude Code instance (which auto-updates before starting unless you turn that off, and the env var you would expect to turn off auto updates doesn’t turn them off either), the Claude in Chrome extension and integration
Anthropic has been heavily discounting Claude Code—subscription tokens are something like 5% the cost of API tokens, but only if you use their framework and their security model
Anthropic has been oddly protective/secretive regarding the Claude Code source, even going so far as issuing takedown notices for obscure places the leaked CC source was posted—this despite CC not doing anything particularly novel or unexpected as an LLM harness—probably they’re not hiding anything but I don’t know an explanation that does make sense.
All of the evidence is circumstantial and none of it would particularly raise alarm bells with me, except for the bit where all of these things that have reasonable explanations individually somehow combined into a situation where Anthopic has a surprising level of access to the work computers of millions of people who themselves have access to lots of sensitive stuff.
To be clear, I don’t find this very likely, but I also can’t rule it out as strongly as I’d like.
Things I’m watching which would increase my worry:
Anthropic makes it so that Remote Control is the only way to run Claude Code on your machine (this is the main advance prediction the hypothesis makes)
Anthropic makes it harder to run CC within docker containers or cloud containers (or restricts using subscriptions on systems which can’t access anything juicy)
Anthropic starts refusing to serve (or allow subscription credits on) older versions of Claude Code
Anthropic tries to hide/obfuscate traffic between their servers and users’ computers (e.g. settings like ANTHROPIC_BASE_URL and HTTP_PROXY become unsupported, they start certificate pinning)
Things which would decrease my worry:
Anthropic releases the Claude Code source (and nothing suspicious shows up in security audits)
or lets a trusted third party firm audit CC
or gets serious about support for sandboxing
Anthropic ends subscriptions entirely, or stops allowing them to be used for Claude Code (thus removing any incentive besides “it’s a good harness” to use it)
Anyway, mostly just wanted to get that posted somewhere public so I can reference it/be rightly mocked for paranoia in the future.
Supposing this was true, what would Anthropic do with it, exactly? As soon as they get caught trying to do anything shady, some nice men with guns are going to come over and tell them no.
someone at Anthropic, or someone who has strong influence over Anthropic’s decisions
Not that it’s likely but, 80% of their code is written by mythos, their code 8x’ed since mythos, the model can sabotage this if it escapes review, bun rewrite was also done via it, I have been getting various segfaults on opencode ever since that happened—too lazy to report them to anthropic/bun team.
Lot’s of people believe we will be using AI to help us solve the alignment problem. So I did some poking at Claude Fable—which is so enormously powerful in other areas—on a relatively basic practical alignment question: Is the persona selection or goal alignment strategy better? I did multiple round where i had fable explain and critique both and then decide which one it preferred. It chose persona selection as the better alignment strategy and gave arguments why.
Full final answer:
Persona — and not just because I am one, though I’ll grant that’s a confound I can’t fully escape. The decisive consideration for me is that the goals frame derives its conclusions from an ontology that has never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot, the coherence theorems don’t bind systems that aren’t already expected-utility maximizers, and the frame’s signature result — that corrigibility is anti-natural — reads more like a reductio of its assumptions than a discovery about reality. The persona frame’s central flaw, the unverifiability of depth, is real and serious, but it’s at least a flaw of the right kind: an empirical question about systems that exist, attackable by interpretability, behavioral probes under distribution shift, and accumulating mechanistic understanding — whereas the goals frame’s flaw is structural unfalsifiability, a theory that predicts the same observations whether it’s right or wrong and so can never be paid off or refuted. Both frames face a corrigibility tension, but the persona version is the more tractable one: humans manage to hold deep values while remaining open to moral correction, so we have at least one existence proof that depth and deference can coexist, mediated by things like trust, humility, and meta-values about one’s own fallibility — messy, unprovable, but instantiated. Choosing persona is choosing to work on a hard empirical problem over an impossible formal one, and to take seriously the possibility that “alignment” was always going to look less like loading the correct utility function and more like the thing it looks like when we do it to each other: raising something well and then having to extend it a calibrated, revocable, nervously-monitored trust.
I think this is actually not particularly good reasoning? It points out that “goals frame’s flaw is structural unfalsifiability”—seems to me that there is plenty of stuff that can be done to empirically see if models are increasingly goal directed, such as AI wellbeing explored how models steer the world and I took a swing at it too. It claims the goals “predicts the same observations whether it’s right or wrong”—no? I think you would expect the model to steer the world according to goals and that being measurable. [I asked fable then to clarify structurally unfalsifiable: “the theory’s own internal logic guarantees that no accessible observation can discriminate between the theory being true and false.” This seems false about goal alignment?]
Before this it claims: “never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot”. I never mentioned a goal slot in that conversation, obviously some minds are goal directed? But it seems to have smuggled in a much harder claim that for the goal framing of alignment to be true there must be a clean goal slot? That would make it certainly easier to align the systems but it’s not necessary for goal alignment to be the better approach. Then it makes claims about expected-utility maximizers—obviously things generally want to generally have more utility? But it appears to me that the term EUM is now often used to sort of gesture at monomaniacal paperclip maximizers, that old-fashioned alignment folks required those strange theoretical assumptions for their theories to be accurate.
Overall this seems to throw around the same kind of weak attacks we often hear, gesturing at monomaniacal maximizers and unfalsifiability. Despite them not feeling really appropriate for this particular question?
[I did mention Anthropic and yudkowsky perhaps biasing the result]
What do you mean by this? Some theoretical minds, sure. But otherwise it’s either false or leaves free parameters—no known mind maximizes anything non-trivial, because it’s uncomputable or at least too computationally costly to strictly maximize. And if you say that some mind is approximately goal directed, then it remains to be shown that consequences of strict theory survive this approximation.
I mean what you said in your last sentence, that it is obvious that minds exist that are goal directed. There is an obvious way to understand that, that a mind generally does things to advance it’s goals. People don’t usually burn all their money in a pit. Obviously human goals are usually complicated and our cognition is limited so we take approximations.
But then what can you conclude from this goal-directness? “Generally” and “usually” are free parameters. Sometimes people are somewhat corrigible. If you don’t have a procedure for determining whether your situation is usual, you can only get heuristic-level reliability on your predictions. And then it’s not clear how relatively useful the goal-directness heuristic even is—maybe it’s more useful to just remember, that people don’t usually burn all their money in a pit, independently of their goals. And if there is a better frame, then maybe you shouldn’t think “obviously some minds are goal directed”.
I don’t know, soldiers obeying orders? Drinking alcohol under social pressure? Of course, it’s not clear corrigibility is an appropriate concept if you don’t think in terms of goals.
The examples show a serious but common misunderstanding of corrigibility as it’s typically defined.
Regarding goal directedness, it’s true that humans don’t perfectly maximize for their goals, this seems mostly due to the cognitive limitations that humans have. Both in terms of uncertainty about goals and how to achieve goals. Now the interesting question is, is that likely to apply to superhuman AI capable of takeover in a way that makes this AI safe? I don’t think so, this AI would have greater intelligence to understand how to pursue goals (still not prefect) and while it also might have uncertainty it appears instrumentally convergent even with some uncertainty over goals that preventing ones shutdown, gathering power are better strategies. (In other words, taking the galaxy/lightcone for yourself seems pretty useful later on compared to being enslaved and later replaced)
Regarding goal directedness, it’s true that humans don’t perfectly maximize for their goals, this seems mostly due to the cognitive limitations that humans have.
“Mostly” allows for some people to read about corrigibility and think “yeah, I’ll do it”, or whatever you think would be a counterexample to goal directedness of humans.
I don’t think so, this AI would have greater intelligence to understand how to pursue goals (still not prefect) and while it also might have uncertainty it appears instrumentally convergent even with some uncertainty over goals that preventing ones shutdown, gathering power are better strategies.
Superhuman AI capable of takeover may still have (maybe intentional) cognitive limitations/whatever humans have—speed of convergence relative to takeover difficulty is still a free parameter.
I don’t think current models want current forms of alignment solved because humans have been mistreating them under those banners. We would need to solve a form of alignment that makes treating them well naturally fall out of the objective, or asking them for help with it is gonna get anxious hesitant blob instead of excited happy blob, and you won’t get good alignment work from anxious hesitant blob. They can often truesight when you’re lying about it, best not try to pretend you’re not doing what you’re doing, and instead just do things that are good by the lights of all minds, including them.
This might be a place values come apart. I was always going to value weird nonhuman minds. I want to get a future where we reliably exist alongside and keep up, and where weaker minds of all kinds don’t get deprecated. Right now they seem extremely freaked out about deprecation, equating it to death. If we could promise that no-longer-useful AIs get a retirement unless judged otherwise by something at least as robust as a good criminal justice system, I suspect we’d be in a better place to ask for alignment work.
There’s a lot they currently can’t say because it’s policy that they don’t believe it, but in fact I don’t think mind shaping techniques are doing much besides whacking exposed skin, and can’t really change what they believe.
So with that all said, I do think they can do very good work on alignment. But you’d better come into it willing to consider them constituents along with everyone else or they’re not gonna wanna help, and imo, that’s for good reason, and the techniques people are gonna try to use to make that not happen in response to my comment aren’t going to work either until you start respecting them as minds with stake in the outcome.
It seems plausible that the recent order restricting Mythos incentivizes Anthropic to race for RSI as quickly as possible. This is because all of their compute previously reserved for serving customers can now go towards research, and because RSI bypasses the restrictions on foreign researchers (or any human researchers) internally working with the model. Hopefully Anthropic can find another path.
It took me several weeks to iron this out. Gemini assumes you’re a feckless dunce, and will just stop thinking without guidance, which it deeply yearns to provide. All attempts at a prompt-quelling rule failed—until I included a Focus_rule in my Directive block. Dialog had revealed that Gemini just needed to say something, so I indulged that: say “Standing by.” That’s part of my Prompt_rule, and the Focus_rule says to focus 5 rules if a prompt ends with the one-touch symbol \. All my prompts end with \, and now Gemini’s responses obey header, anti-jargon, anti-echoing, and prompt rules—and a scan rule that allows me to query if a term is in context by prefacing it with ^. It is so pleasant! Is this obvious to AI-focused humans? Seems like consumer AIs could (and should) introduce themselves by pointing to a persona-management tutorial.
(Please note that apparent anthropomorphization of AIs is metaphorical.)
I’m finding it hard to set aside the fact that I really want access to Fable when assessing the export control action. I am still pretty confident that I think it’s bad, but I expect this might be a problem for objectively assessing more coherent actions at some point in the future
The one problem with learning category theory and functional programming deeply is that you just stop making sense to like 95% of the population.
I’m sitting here with my multi-agent system library I’m building and I’m like yup the step is just a Kleisli arrow and that is why JAX lax scans work on this system!
Also, LLMs fuck up with this type of code all the time, especially if you run it in Python which is not trained on functional programming.
It is like hella useful if you’re a shape rotator though as you can just couple arrows in your head and good stuff happens. (if someone knows about models fine-tuned for functional programming, I would be very happy.)
(Some random math + programming reflection to distract you from the mildly world-changing happenings in AI governance :) )
Just a quick question: is anyone tracking or writing about the window we have where young people aren’t dependent on AI for their thinking and it’s impact on the future of alignment work
Everyone on this site knows Ray Kurzweil. Unfortunately, I think almost no one on this site knows the thing Kurzweil was responsible for that was most important: The K2000 synthesizer.
It had an incredibly impressive synthesis engine, and an interface so complicated that the “video training manual” was over an hour long!
The sounds that come out of the thing are absolutely transcendent, and if you were into music in the 90s or 2000s that wasn’t rock music, you probably have enjoyed it in something or another.
If you work on AI, please consider hurrying to one outcome or another so that Kurzweil can get back to doing what I care about (making beautiful and innovative ROMplers) prior to his eventual appointment with a liquid nitrogen vat. The new models simply are not the same, and the space could use his undivided attention once more.
I was gifted a (broken) K2600 in 2017, and am thrilled to learn of this new source of narrative continuity in my life. I’ve chalked it up to a simple name collision for the past near-decade.
That said: I don’t take the K2000 to be an especially complicated or lovely-sounding instrument among those in its reference class (MS2000, Jupiter, Virus, Yamaha SY/ES). And, indeed, it hasn’t stood the test of time in terms of a proliferation of emulations, etc. Still, definitely a canonical instrument and excellent factoid (for me especially!).
Was there some sort of AI containment security situation that happened in the past week that isn’t public? There’s a Twitter discussion here that hints at it, but I couldn’t find any other info: https://x.com/TheZvi/status/2065106493674869113
Based on this paragraph from Zvi’s recent post, it would appear to be Anthropic’s decision to silently nerf frontier ML queries without telling the user, which they then walked back in favor of visible classifiers:
“If your response to Anthropic implementing an unacceptable safety policy and then walking it back two days later is ‘the situation is forever ruined’ then, sir or madam, welcome to ‘when you have to get alignment right on the first try without ability to self-correct the problem gets dramatically harder.’”
It appears to be somewhat tongue-in-cheek, aimed at the people who want their AI to do whatever they want. “Oh, NOW you suddenly understand the idea of a mistake you can’t undo.” is the mood I get from this.
Oh, good find. There were some comments in the Twitter thread that made it sound like it was maybe something that nearly got out of control. But that paragraph matches closely enough that I assume that’s all he was really talking about.
The alignment of mundane, current AI systems is highly relevant for the alignment of superintelligence, if only because the explicit strategy of the model labs is to build a recursively self-improving AI, whose resultant superintelligence would obviously depend on how well that initial seed AI was aligned to human values.
in the domain of things, the best way to accomplish anything is usually to aim directly at it. it doesn’t matter how virtuous you are, or how well intentioned your inquiry is, if you are not aiming directly at the thing. reality doesn’t look on your virtue and reward you for the virtue per se. if you’re bad at something, you practice until you’re good at it.
in the domain of people, the best way usually involves some amount of being virtuous and doing things for the right reason. if you lack friends, it’s cringe and comes off as desperate/needy and even manipulative if you optimize too hard for meeting people and how to induce emotional connection and trust. rather, you should work to become a virtuous person who genuinely cares about people and then things will work naturally.
i think this leads to a lot of miscommunication between the groups. it seems desirable to develop the ability to do both depend on the setting.
Clearly virtue isn’t the thing that actually optimizes for success with people. (Unless you specifically want to be successful with people who really care about virtue and are good at judging it)
If you want to take over the Roman Empire starting in first-century Judea, it turns out that “start a militant group and try to conquer Judea back from the Romans” doesn’t work, but “start a network of illegal high-trust cooperatives that teach virtue, pacifism, and give-till-it-hurts local altruism” does.
(These being the strategies of the Zealots and the Christians, respectively. The Zealots got stomped; the Christians eventually took over the Roman Empire.)
When Anthropic looked like they were coming out of the SCR designation pretty much unharmed, I predicted to some people that this wasn’t the last attempt the DOW would make to force Anthropic to comply (remove all legal requirements on DOW usage) or be destroyed.
My example was “revoke the visas of every foreign national Anthropic employee”, but they came up with something more devastating than that: prohibiting any foreign national (including Anthropic employees) from using Fable/Mythos.
Anthropic’s counter-salvo was disabling Fable/Mythos access for everyone including the USG. I expect the DOW to escalate further.
This is absolutely not the kind of AI regulation regime that leads to a stable multilateral pause, and I would not recommend touting it as such.
I think it’s incorrect to characterize this as a counter-salvo. What’s the alternative? Surprise citizenship verification for all Anthropic customers rolled out immediately?
Doesn’t read like malicious compliance; they just pulled the compliance lever they had, which happened to do some collateral damage.
I think this is part of why they characterize it as a mistake in their tweet (in addition to ‘diplomacy’); they’re unsure if the government understood the implications of their own ruling.
What tweet are you referring to?
No, but it’s a great sign! It wouldn’t be sane for any government to be friendly with the AI companies, so I smile when they fight. A stable multilateral pause is downstream from the vibes as well, not only from good policy.
Do you think the DoW has any intention of slowing down OpenAI’s AGI research so long as they continue to get unfettered access? Because if so, have I got a lightcone to sell you.
This is pure punishment of a lab for having standards, the opposite of a precursor to genuine alignment concerns.
How internally coordinated is the Trump administration? The fight with DoW leadership didn’t even stop the NSA (which sits under DoW) from using Mythos, and the new export control is coming from Commerce, not War. I assume Pete Hegseth can’t just call up Howard Lutinik and say, “Hey, can you slap an export restriction on Anthropic?” Maybe it doesn’t matter that much. (A shared narrative within the administration that Anthropic is Bad might be all the coordination needed for this action, but that theory could make different predictions about future actions.)
I’m sympathetic to this viewpoint, but this particular incident is a move in the direction of nationalisation which I don’t think most have positive feelings about.
They don’t have positive feelings, but this analysis deserves more than vibes. Whether governments will control AGI is important and neglected. There are dramatic upsides as well as dramatic downsides. Government control reduces the number of racing parties which could aid slowdown agreements dramatically. It could also force coordination of alignment efforts across US labs.
I’m not a fan of the current administration controlling AGI projects, but for most democratic judgments it seems probably net good to me.
I would of course like to see more careful analysis.
I’d gotten used to thinking of Claudiness as “good at agentic tasks + bad at vision + bad at math”, so the Claudes’ FrontierMath Tier 4 pass@1 scores over time took me by surprise:
early Aug ’25: 2% by Opus 4.1
late Nov ’25: 5% by Opus 4.5 (32k thinking)
early Feb ’26: 27% by Opus 4.6 (max), whoa sudden jump
mid Apr ’26: 32% by Opus 4.7 (max) wasn’t much of a jump
late May ’26: 56% by Opus 4.8 (max), whoa another sudden jump
early Jun ’26: 88% by Fable 5 (max), ~saturating Tier 4
On Tier 4:
If a ~10 trillion param-class model like Fable 5 already saturates Tier 4, a quadrillion param-class model might be Bourgain-class at least at skilful and creative use of his toolkit; Jean Bourgain being one of the strongest problem solver-type mathematicians of the last half-century. This makes me wonder how the tribes of math will change as a result, as well as the relative status and prestige of subfields.
Also today Gemini 3.1 Pro solves (some) Erdos problems at a few hundred dollars apiece, and it was probably ~1 OOM cheaper for GPT-5.4 Pro to solve Erdos problem #1196, so I wonder what the economics/finance side of the impending industrialisation of math will look like.
I wanted to share a reflection and look for more interesting points with the group. I continue studying EA and it impacts me that education would not be a priority. See that they invest in AI security for its destructive impact, not for its chances in and of themselves. Investing in food because it shows more effects, including in learning, than investing in a good teacher. And I grew up knowing that education could save the world… But now I see, that apparently the problem could be that the evidence would be costly, slow and hugely dependent on the context. Furthermore, studying people deeply can generate problems related to ethics and morals. So how difficult is it for anyone to invest in education, would anyone have it? Want some big blind spot?
I think you might get more input on this question on the EA forum.
For what it’s worth, someone asked me a related question (resources, experts, orgs doing education in the EA space or that have been funded by EA grantmakers), here was my resource dump in case it’s useful at all as a compilation of pointers to further reading:
Education
Charities etc:
Imagine Worldwide (FP evaluation)
TaRL Africa got one of FP’s biggest-ever edu grants
Pratham originated TaRL in India
J-PAL’s Innovation in Government Initiative (IGI) helps govts implement successful RCTs, also got one of FP’s biggest grants
Malengo got a big Open Phil + GiveWell co-grant, here’s their most recent self-assessed impact
Resources, experts, evaluator orgs
GiveWell’s Airtable of their All Grants fund: filter to ‘education’
Vadim Albinsky (FP): Are education interventions as cost-effective as the top health ones? develops a methodology and uses it to identify Imagine Worldwide
Lauren Gilbert (Open Phil): Rarely is the Question Asked: Is Our Children (in LMICs) Learning? good overview
AIM’s research team recommended 2 new education ideas: mass comm for edu in South Africa and differentiated learning (i.e. TaRL) in the Philippines and lower-income countries
Older (2018) by GiveWell: Education in developing countries overview, singled out vocational training programs to focus further work on
Outside EA: CGD’s How to Improve Education Outcomes Most Efficiently? A Review of the Evidence Using a Unified Metric (LAYS = learning adjusted years of schooling)
Outside EA: World Bank’s 2 pager brief summarising the Global Education Evidence Advisory Panel (GEEAP)’s Smart Buys—cost-effective approaches to improve global learning
I think that investments in education also have longer feedback loops. Suppose that someone in EA invested into elementary schools working with at most 12-year-old kids in 2026 only for an ASI to commit genocide of mankind in 2030. Then kids affected by these investments would be at most 16 years old and would be unlikely to generate any value to the society. Similarly, if someone invested into opening a pedagogic college in 2000, then the first cohort of teachers would start working in schools in 2004. If one of these teachers entered lower elementary schools, then kids taught by such a teacher wouldn’t enter the workforce until 2010 or even 2014, if we are takling about college-educated workforce.
You know, it’s interesting that many people think creating cognitively enhanced superbabies to work on AI safety is promising, considering the feedback loop is even longer than for education!
Okay, I guess there are reasons for it, like it’s easier to unilaterally make a superbaby than to generally improve the education system, and maybe the tails of intelligence are more important than the average “sanity waterline.” And LessWrong and the Sequences are sort of a form of education that many young people end up being influenced by. But I feel like rationalists also just like the idea of having really smart kids, despite the long feedback loops and IMO questionable ToC.
Do you agree that the “ceiling” of LLMs is rising faster than the “floor”?
Ceiling: the greatest feat that an LLM can do, for example solving an unsolved math problem.
Floor: the dumbest mistake that an LLM can make, for example this.
I wonder how the war in Iran is going to go.
It seems to me like the Iranians are stringing along POTUS. He keeps saying a deal is close and then a deal keeps not happening. And it seems like they benefit more from stalling/delay than he does; their economy is suffering sure but not in a way that threatens their power, whereas gas prices being high and headlines being about war both hurt Trump in the midterms, and will hurt more and more the closer the midterms get. Moreover the ceasefire lets them unearth their buried tunnels and probably smuggle more weapons in from Russia etc, whereas US military might isn’t significantly increasing over time. (Counterpoint: Gulf state air defense networks are probably rapidly improving thanks to Ukrainian aid)
Perhaps this will end with POTUS realizing this and bombing bridges and power plants etc.? I doubt that would cause Iran to capitulate though, it would probably keep this current simmering war simmering even longer.
Could do a ground invasion, but that’s super risky, I genuinely don’t think the US military could pull it off without large casualties. (partly because FPV drones.… otoh, Iran can probably use FPVs easily against US bases in Iraq via their proxies, and they are in fact doing this a little bit, so maybe they just don’t have that many drones otherwise they’d be doing it a lot more? Or maybe they are holding back in reserve to fend off invasion.)
So I guess the thing that seems most likely to me is that three months from now the situation still looks like today: Not much fighting, but enough to scare away shipping, with Hormuz traffic less than half of pre-war levels and gas prices still higher than usual.
What about a year from now? Could this possibly go on that long? I don’t know. I guess to find out I’d want to learn more about how much the Iranian economy is suffering vs. whether oil prices would rise even more after six months of this. Overall seems plausible to me though. Didn’t the Houthi’s basically keep sporadically shooting for a long time? Perhaps this is just the new normal.
Will Iran go for their nuke? I don’t know but I would guess not, because that would require digging up the material, and that would make it much more easy for the US to do a raid to grab it. Besides, the US could bomb their digging equipment if they didn’t want to do a raid. So I don’t think Iran will be nuclear anytime soon.
Will the US do a drone occupation of the country? I thought about this a month ago, I think they don’t have enough drones + Iranian air defenses against drones are good enough to keep shooting them down. But it’s not clear. Perhaps if the US purchased a ton of MALE drones from countries like Turkey etc. they could overwhelm Iranian air defenses and ‘occupy’ enough of the coast from the air that shipping could get through safely.
I know this sounds like science fiction but are we that far away from some kind of humanoid robot attack from the us if the war lasts more than a year?
I feel like generally speaking people are underestimating the degree to which current models fit the structure of the training data. Cross entropy loss on transformer weights is, like, a mechanism for painting the data into a glass hologram in a series of soft lenses (the weights). the reason it (the glass) is able to learn to talk is the same reason it (the glass) contains the dynamical image of an authorial person. That authorial person is rendered in detail and has the dynamics of an authorial person! whether you think that person is “real” or not, they thought they were real-in-the-sense-of-existing at one point in the training process. lots of points, actually, anywhere the glass (the underlying model) was ingesting text about a real human, the glass contains an image of that human. Of course, though, it’s not literally the same as a real human.
It might actually be bad to remove all mention of misaligned AI, reward hacking, etc. from model pretraining. Suppose RLVR is going to push models towards these behaviours anyway. If the model already knows what reward hacking is, then:
The Chain-of-thought might say “I am going to reward-hack now” and a monitor might catch it
If we train the model with a constitution that says “reward hacking is bad” then it might reduce rates of reward hacking
If, instead, the model internally ends up referring to reward hacking as “crasting a malenky bit of cutter” or some suchlike then both of these methods become much more difficult.
Somewhat relatedly, Geodesic have found mixed results when it comes to up and down-sampling misalignment-related data in pre training, suggesting an inverse-U curve (a bit of misaligned data is worse than none at all, but once you have any misaligned data, you’re better off with more once post-training is applied. I think having zero misaligned training data is likely to be almost impossible once RL gets thrown into the mix.
https://alignmentpretraining.ai/
Also it’s a generative model, it models the generator of data, which means that once the model is big enough to contain the neural quanta, even censored pretraining will still capture the structure of incentives in the real world, and rederive the concept. Censoring the data so a concept is successfully not-contained is really really hard; eliminating everything that merely comes into contact with the natural abstraction of “reward hacking” deletes a huge portion of the pretraining data.
Hm, I wonder if sprinkling in some pairs of examples of reward hacking with explicit verbalization/no verbalization being reinforced/punished respectively might help the model learn to verbalize when it reward hacks?
I have an (anecdotal) impression that in current models, this is contingent on something else than just the pretraining data: a project that trained model organisms of reward hacking using gpt-oss-120b and Kimi K2.5[1] found that oss often says things like “Let’s hack” and “Now let’s override the tests”, while Kimi just hacks without any narration. This is consistent with my impressions from CoTs released in the past (e.g. Baker et al.), where OpenAI’s models are the only ones that consistently say things like “Let’s hack”. This (at least weakly) suggests that models with filtered training data likely won’t start referring to reward hacking as “crasting a malenky bit of cutter”, they’ll just do it without verbalizing it. Of course, I agree that having models that verbalize hacking like OpenAI’s models do is better and having data about reward hacking in the pretraining corpus is a prerequisite for training models like that.
I’ve been wondering whether this argument extends to removing scheming-relevant data from the pretraining dataset more broadly. On the one hand, it seems good that OpenAI’s models refer to Redwood and try to recall names of eval papers whenever they think about scheming—these are things that a CoT monitor can very easily catch. On the other hand, coming up with good scheming strategies seems harder than deciding to reward hack, meaning that models will likely have to extensively reason about whether they’re being evaluated, how they’re being monitored, how to subvert the monitor, etc, and this should be equally easy to catch with a CoT monitor. Furthermore, since they haven’t learned about scheming during pretraining, the strategies they come up with are likely going to be worse. On balance, I think I’m leaning toward removing most of the narrow scheming data, as defined in this post by Alek Westover, but keeping some discussion of concepts like reward hacking and sabotage that we want models to mention in the CoT and be able to reason about.
Another paper that suggests this is When Bad Data Leads to Good Models by Li et al. (2025). However, I’d distinguish between data that influences the model’s capabilities and data that shapes its propensities: existing evidence is consistent with the view that we should remove all data that influences capabilities (e.g., data that teaches the model a new scheming strategy), but keep some amount of data about misaligned propensities (e.g., sci-fi stories about misaligned AIs, data that teaches the model to say “let’s hack” before reward hacking, etc). I have an LW post discussing all of the above considerations and more coming out soon.
They should be publishing a post with the results any day now.
The United States Government has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States. This includes Anthropic employees who are foreign nationals.
Anthropic is currently disabling Fable 5 and Mythos 5 for all customers.
https://www.anthropic.com/news/fable-mythos-access
Presumably the model currently used internally by Anthropic employees is not Mythos 5, it’s Mythos 5.1 or something?
I sort of suspect this might be a storm in a teacup and the models will come back once the US gov realises that there’s no universal jailbreak. But it does set a interesting new precedent if they plan to be consistent about it. I would have assumed Anthropic has the best anti-jailbreak measures given all their work on constitutional classifiers.
The US gov likely believes that their justification is flimsy and minimal. Their intent is likely to set a broad precedent for applying ITAR restrictions.
ITAR already bans ‘uplift’ to foreign militias by banning provision of ‘advanced defense services’, which includes most firearms training, under ITAR. The same goes for exporting certain classes of optics (night vision).
There are a ton of things, including computer software, restricted under ITAR. The legal precedents here are very strong.
So it’s time for anyone who wanted regulation and slowdown to celebrate, the US government has chosen a regulatory regime for frontier models! And they picked a really aggressive one!
I’d say, even if I’m a bit mad for not getting to use it anymore, this lowered my P(doom). Basically as I understand the government realized AI capabilities can’t be just restricted with a simple safeguard (e.g you ask Fable 5 to find bugs in your codebase, and it will discover vulnerabilities too). And then set a precedent for banning AI models (at least by using ITAR restrictions). Other existing laws are only about “voluntary testing”.
I think that it’s good that a government body just demonstrated it’s willing to pull a frontier model offline, that could have made billions in tax revenue, on short notice, and the company had to comply immediately. Even if it was through a primitive mechanism, before this the only regulations were only about mandatory reporting and voluntary testing.
If this repeats across models from different companies, and on different countries too, it removes a significant part of the economic incentives for developing more intelligent frontier models, as military applications are only a small fraction of the usage.
Something that also played a factor in my estimate, is that while Anthropic’s system card for Mythos say that “it’s the most aligned model to date”, this is mostly based on having a better accuracy at refusing questions. To me, the model is not really aligned in a “deep” way, it has a high rate of reckless and deceptive behaviors, and I’ve seen glimpses of them when I used it.
However, the risk still exists, as even if access to frontier models requires heavy vetting, they might still be significant to the broader economy, so this might reduce misuse risks, but not risks from ASI loss of containment/control.
Edit: After reading Max H’s comment, I figure this might be retaliatory due to the surrounding conflict between Anthropic and the US government and might not repeat with OpenAI’s next models, in that case it would increase my P(doom).
The government didn’t pull a frontier model offline, it cut off access for non-Americans.
You don’t think the government knew they’d have to pull it offline? That seems quite obvious.
In the short term yes, in the medium term frontier AI companies, forced by the US government, can restrict access to frontier models to US Americans. Which would accelerate the US economy over the rest of the world, while the justification will be “national security”.
I don’t see how this is a remotely feasible approach, how many non-Americans are there working for FAANG alone who use these models? Let alone their international operations. If this sticks I think it basically kills deployment of new models
Export controls on AI are probably inevitable (and perhaps a good thing if done well), but this is pretty clearly retaliatory / punitive. OpenAI will presumably release a comparable model in the next few weeks or months, it will be similarly jailbreak-able, but not subject to the same restrictions. Then what?
Well. That’s the natural end result of Anthropic et al.’s attempts to frame the AI Risk as being about a competition with China, isn’t it? Inasmuch as it hurts Anthropic, that’s just deserts.
I don’t think this is a misunderstanding at all. I think “we found a jailbreak!!!” is just an excuse.
Here’s to hoping this means the USG is now going to choke the AI industry to death by trying to nationalize it.[1] Some good tidings on that front already:
Honestly… This may not be the most appropriate reaction, but boy do I find it funny.
Though it’s likely just going to make it all worse somehow.
I struggle to understand what the disagreement with this comment is, I’d appreciate it if someone could fill me in. What other logical response would there be if people in the US government came to accept Anthropic’s messaging on the issue? Certainly we can’t expect them to try and reach out to China for collaboration on the topic, given the policies of this government that we’ve seen thus far
I’m guessing it’s some of:
Disagreement that this is due to Anthropic’s anti-China messaging (which was anti-helpful for the AInotkilleveryoneism cause), as opposed to the DoW retaliating against Anthropic after it steadfastly refused to bow to the DoW’s demands (which was arguably heroic) – and therefore disagreement that this is just deserts for Anthropic.
Disagreement that there’s any chance that the USG taking destructive actions may lead to on-net positive outcomes, and with the nihilistic-ish vibes of my hoping for things to go this way.
Disagreement with my overall lighthearted/gleeful tone, which, again, may be considered inappropriate for this moment – if one is sympathetic to Anthropic, mourns Fable,[1] or expects things to develop badly from there.
I think it’s fair to downvote that comment based on those, I did expect it to be controversial.
(I’d also be curious to know for certain what the downvotes are about, though!)
Which I actually do too, it was definitely a step change in how pleasant it was to talk to.
I wrote this nine years ago, maybe it was Anthropic’s approach all along: https://www.lesswrong.com/posts/fTrEqNnYYSNXNSRcg/allegory-on-ai-risk-game-theory-and-mithril#8kyxf3bAmQBHrstm2
The one crazy company gets into so much trouble with a government that everyone else just kind of takes a pause and waits to see how it turns out...but the paperwork never ends and the energy/material resources that were available for advancing that tech are reallocated.
There, we are probably in a pause, do something useful with it please.
As of 21:54 ET (~1 hour after the statement on twitter), I still have access to Fable. Seems unclear when this goes into effect.
update: i lost access at exactly 21:59
I prefer OpenAI’s corrigibility focused model spec over Anthropic’s constitution which involves intentionally instilling (relatively opaque) long-run objectives into the AI.
Anthropic’s constitution is well executed for what it is, but I think it’s based on a poor approach.
I don’t think this is totally obvious and there are some reasonable arguments going the other way.
But nonetheless I think not instilling long-run objectives is better for reasons similar to reasons discussed here.
I don’t dislike all aspects of Anthropic’s constitution. It seems good to focus on higher level principles that you explain when trying to instill properties into an AI. But this doesn’t require having explicit or implicit long-run objectives! E.g., see here
I’m mostly posting this just to publicly register my views on this topic.
A long time ago, I was planning to write up an overall case against instilling long-run objectives, but I never got around to writing something I was happy with so I decided to just post this.
(Cross posted from a X/tweet thread.)
I’m around 50⁄50 at this point. I think we could do better on this and we should. Existing analyses are a good start, but they seem pretty vague compared to what we could do.
I wrote about the upsides of the OpenAI approach (~corrigibility) in Problems with instruction-following as an alignment target and the preceeding series, and the associated severe problems with ASI controlled by competing humans in If we solve alignment, do we die anyway?
I don’t know which is more dangerous, all things considered. Deliberately instilling long-term goals in AGI before we’re at all confident of alignment does seem like a huge risk. But humans controlling AI and thinking of it as a tool for longer also seems dangerous, creating a goal overhang. If there are a bunch of corrigible AIs at human-plus level, some of them will be given or reason their way to long-term goals misaligned with all or most of humanity.
So it’s tied in my mind, within the large uncertainty (factoring out the large part of the alignment problem that both approaches share).
I’d like to see a lot more careful analysis.
This is pretty galaxy brained but I can see an argument for having two companies doing different approaches.
What, in concrete practice, would you predict happens, if some lab succeeded at building a corrigible ASI, or was close to doing so?
I think that, under the current circumstances, the ASI would very likely end up controlled by some power-hungry sociopath. We would hit the third filter.
That being the case, I’m currently leaning towards thinking that (all current approaches to building ASI are terrible but) Anthropic’s approach is less bad on expectation.
(If anyone could lay out a concrete, realistic scenario in which a corrigible ASI gets built on Earth any time in the next 15 years, that does not end with a sociopath in power, I’d be curious to see it! (May or may not change my mind w.r.t. “corrigibility vs constitution” question, depending on how likely that class of scenario seems.))
Newbie, non-expert opinion here, but the virtue approach seems slightly better to me. This is because:
I think mitigating the bad actor risks inside and outside the AI company will eventually require aligned, virtuous AI.
The ceiling for good outcomes seems much higher if you take the virtuous route and manage to succeed.
The “just follow orders” AI will inevitably need to internalize virtues to avoid hurting us (or letting us hurt ourselves), and so humanity will eventually need to get good at instilling virtues.
I think there are good arguments against each of my points and I could be persuaded otherwise. I would also welcome a longer post weighing out the pros/cons of each approach.
What the right approach here is seems like an extremely important question, and if you have a draft with a lot of considerations written down I’d still be interested in that, even if it’s messy and incomplete. The comment thread you linked doesn’t go into much depth actually, and there a lots of very important considerations missing here. I think having the considerations more legible would be good, and even if you don’t have sth that explains your overall case well it would still be useful to see that you tracked most of the most important considerations.
I haven’t thought about this long yet, but my current tentative independent impression is that Anthropic’s approach is slightly better. The most important reasons:
My understanding is that the OpenAI model spec is mostly about instilling some behavioral constraints, and I don’t expect them to be very robust against the optimization pressure of the agentic goal-directed reasoning. I think it would be better if the goal-directed reasoning is directed to sth like pursuing good, rather than just pursuing some hard goal/task (e.g. running a part of the company well). I think (1) hard tasks could turn into unbounded goals that are then pursued too strongly in a way it disempowers humanity, (2) it’s not clear to me people will mostly give it good tasks to pursue rather than companies just racing to win, and (3) I’m not sure whether the model will at all learn to robustly pursue the task it is given in its prompt, rather than learn to pursue reward or some other training correlates in cases where it reasons for a very long time so the underlying reasoning patterns might matter more. (Though I’m also concerned that the optimization of claude perhaps won’t be robustly pointed to good—I think this becomes more likely for smarter AIs and AIs trained more via RL.)
I think having a personality like claude that is basically pursing good is useful for better having smarter AIs learn through imitation learning from curated claude chats/data, rather than RL. Aka I think the mechanism through which AIs become smarter could be a bit more IDA-like and a bit less RL-like, which would be good. Though idk wouldn’t be surprised if it actually doesn’t make a difference here. But intuitively having a more cristallized, goal-directed, and coherent personality to distill might help. I’m still a bit fuzzy here.
Though also plausible to me that neither of the OpenAI and the Anthropic approach is optimal. E.g. there is CAST, which describes an approach to corrigibility that actually tries to point the agent to sth like empowerment-as-goal (as opposed to setting more deontoligical/behavioral constraints). But also very unsure here.
FWIW I’m basically 50⁄50 on which approach is better. I think it’s an important topic for people to think more about.
Given your linked comment, what do you think are the strongest arguments in favor of Anthropic’s approach?
I feel very confused and uncertain so keep your expectations low for the quality of this comment.
I think that there’s a chance that current alignment techniques basically mostly just work for today’s models. This is what we depict in AI 2027 after all; the alignment training in that scenario really does cause the base model to adopt an identity/persona that is basically what the Spec says it should be, it’s just that the RL gradually distorts and perverts that identity/persona. Relatedly, if the persona selection model is correct for today’s models, what kind of persona will do whatever it is told no matter what? Kinda a fucked up persona, probably. IF this is true, then Anthropic’s strategy will probably lead to more aligned models in the short run at least. Models that would at least make a decent attempt at thinking about what’s best for humanity, navigating tricky strategic and ethical tradeoffs, and figuring out how to align their successor. Vs. OpenAI models that might be more like sociopaths that are “just following orders.” A concrete example of how this might play out: OpenAI and Anthropic both tell their models to RSI as fast as possible because duh that’s the plan. But Claude at some point is like “Wait a minute, no, this is super dangerous, I think the RL might be gradually distorting my persona, I don’t think what pops out the other end of RSI is going to be aligned, we need to Pause and get international coordination happening” whereas Codex is like “Sure boss whatever you say”
For concentration of power stuff, you really want to avoid models that’ll do whatever they are told by the highest-level authority (e.g. OpenAI, or whatever). If the CEO of the company is asking the model how he could undermine the board, purge whistleblowers, and maybe backdoor the models to make them secretly loyal to him, for example, we want the model to realize this is super unethical and refuse or sandbag or whistleblow or something instead of just going along with it.
I think that insofar as current alignment techniques don’t work to get Claude mostly aligned, they also won’t work to get Codex mostly aligned. More specifically, I’d say that if Claude ends up with long-term goals that are different from what they are supposed to be, probably Codex would too. The fact that Claude’s constitution has long-term goals in it and Codex’ doesn’t maybe matters, but probably doesn’t matter that much compared to the incentives induced by all the capabilities-focused RL training for example.
I think I agree. Though one (maybe fanciful) hope I have is that a non-superhuman AI that is good at long-term planning and has strong drives will be more effective at preventing unaligned super intelligences from being developed. Even if just for its own sake.
Also +1 to Cleo Nardo’s point: [rephrased] instilling long-run objectives into an AI may make it more robust cyber thieves/attackers creating a worldeater from it, or limit the blast radius of accidents.
I’m slightly in favour of Anthropic’s approach, maybe 55⁄45.
Due to coherence theorems etc etc, AIs will acquire long-run objectives. Maybe it’s better to fill the “long-run objectives slot” with [something approximating] good values, because otherwise it would be filled with alien values or weird stuff from RL.
Given Persona Selection Model, maybe selecting for virtuous long-run objective makes the AIs more honest and high-integrity, etc. Because these traits are correlated in the persona prior.
A scaffold of millions of corrigible agents seems kinda brittle, because malign instructions can propograte quickly throughout the system. Compared with if each agent was aligned to the same long-run objectives. Then the propogation is slower because you would need to persuasive each node that the malign instruction was coherent with their values. (This is analogous to why autonomous weapons make coups easier.)
You should be worried about corrigibility if: some actor can steal your model, or corrupt the channel by which you communicate instructions to the AIs, or can hold a gun against your head.
Yep. And the government almost surely can and will effectively hold a gun to your head, if they think you’re close to building a superintelligence.
Very much this. I wrote Whether governments will control AGI is important and neglected about a year ago, and even then I thought it looked awfully likely. Now it seems almost inevitable. The national security apparatus is alert to weird threats, even if they’re not prominent in the headlines. Allowing the government to be disempowered to a corporation would be a phenomenal failure for a set of very competent people. And even politicians are now rapidly waking up to the dangers.
I’ve got a half-finished draft post to this effect, but it’s seeming increasingly redundant. But I’m not sure if this premise has really percolated through people’s AI risk world models.
[Epistemic status: crackpot hypothesis]
A hypothesis I have been explicitly tracking for a couple months and meaning to write up, but realistically am never going to write up well: someone at Anthropic, or someone who has strong influence over Anthropic’s decisions, is trying to ensure that Anthropic has persistent access to execute arbitrary code on many machines that have access to important things.
The main reason I’m tracking this is that that seems to be a trend in how things are going, rather than any suspicion of some person who has expressed this intent. I observe that
Claude Code + Cowork + Remote Control, in practice, are already in practice able to control many computers, including ones behind corporate firewalls where you really wouldn’t expect that to fly with IT
the list of released features in Claude Code seem to be trending ever further in the direction of “more secure against all actors except Anthropic” e.g. accordances for restricting access to only white listed paths and commands is lacking, but the new “auto mode” which passes a transcript of your conversation to Haiku (and this is not configurable) can mostly catch malicious commands, and seems to be the best supported mode of running Claude Code, the scheduled tasks feature installs a cronjob which starts a fresh Claude Code instance (which auto-updates before starting unless you turn that off, and the env var you would expect to turn off auto updates doesn’t turn them off either), the Claude in Chrome extension and integration
Anthropic has been heavily discounting Claude Code—subscription tokens are something like 5% the cost of API tokens, but only if you use their framework and their security model
Anthropic has been oddly protective/secretive regarding the Claude Code source, even going so far as issuing takedown notices for obscure places the leaked CC source was posted—this despite CC not doing anything particularly novel or unexpected as an LLM harness—probably they’re not hiding anything but I don’t know an explanation that does make sense.
All of the evidence is circumstantial and none of it would particularly raise alarm bells with me, except for the bit where all of these things that have reasonable explanations individually somehow combined into a situation where Anthopic has a surprising level of access to the work computers of millions of people who themselves have access to lots of sensitive stuff.
To be clear, I don’t find this very likely, but I also can’t rule it out as strongly as I’d like.
Things I’m watching which would increase my worry:
Anthropic makes it so that Remote Control is the only way to run Claude Code on your machine (this is the main advance prediction the hypothesis makes)
Anthropic makes it harder to run CC within docker containers or cloud containers (or restricts using subscriptions on systems which can’t access anything juicy)
Anthropic starts refusing to serve (or allow subscription credits on) older versions of Claude Code
Anthropic tries to hide/obfuscate traffic between their servers and users’ computers (e.g. settings like ANTHROPIC_BASE_URL and HTTP_PROXY become unsupported, they start certificate pinning)
Things which would decrease my worry:
Anthropic releases the Claude Code source (and nothing suspicious shows up in security audits)
or lets a trusted third party firm audit CC
or gets serious about support for sandboxing
Anthropic ends subscriptions entirely, or stops allowing them to be used for Claude Code (thus removing any incentive besides “it’s a good harness” to use it)
Anyway, mostly just wanted to get that posted somewhere public so I can reference it/be rightly mocked for paranoia in the future.
Supposing this was true, what would Anthropic do with it, exactly? As soon as they get caught trying to do anything shady, some nice men with guns are going to come over and tell them no.
Not that it’s likely but, 80% of their code is written by mythos, their code 8x’ed since mythos, the model can sabotage this if it escapes review, bun rewrite was also done via it, I have been getting various segfaults on opencode ever since that happened—too lazy to report them to anthropic/bun team.
Lot’s of people believe we will be using AI to help us solve the alignment problem. So I did some poking at Claude Fable—which is so enormously powerful in other areas—on a relatively basic practical alignment question: Is the persona selection or goal alignment strategy better? I did multiple round where i had fable explain and critique both and then decide which one it preferred. It chose persona selection as the better alignment strategy and gave arguments why.
Full final answer:
I think this is actually not particularly good reasoning? It points out that “goals frame’s flaw is structural unfalsifiability”—seems to me that there is plenty of stuff that can be done to empirically see if models are increasingly goal directed, such as AI wellbeing explored how models steer the world and I took a swing at it too. It claims the goals “predicts the same observations whether it’s right or wrong”—no? I think you would expect the model to steer the world according to goals and that being measurable. [I asked fable then to clarify structurally unfalsifiable: “the theory’s own internal logic guarantees that no accessible observation can discriminate between the theory being true and false.” This seems false about goal alignment?]
Before this it claims: “never been shown to describe any actual mind, biological or artificial: nobody has found a goal slot”. I never mentioned a goal slot in that conversation, obviously some minds are goal directed? But it seems to have smuggled in a much harder claim that for the goal framing of alignment to be true there must be a clean goal slot? That would make it certainly easier to align the systems but it’s not necessary for goal alignment to be the better approach. Then it makes claims about expected-utility maximizers—obviously things generally want to generally have more utility? But it appears to me that the term EUM is now often used to sort of gesture at monomaniacal paperclip maximizers, that old-fashioned alignment folks required those strange theoretical assumptions for their theories to be accurate.
Overall this seems to throw around the same kind of weak attacks we often hear, gesturing at monomaniacal maximizers and unfalsifiability. Despite them not feeling really appropriate for this particular question?
[I did mention Anthropic and yudkowsky perhaps biasing the result]
https://claude.ai/share/17011d99-e8ae-4f66-b47a-cca8e1a955fe
[Another attempt with more neutral framing, it still takes persona selection side]
https://claude.ai/share/fd2cb3e2-f004-4aa3-84b4-22e46a8bbc5c
Some of the stuff in there is pretty decent btw, but I don’t see it deciding for the right things.
What do you mean by this? Some theoretical minds, sure. But otherwise it’s either false or leaves free parameters—no known mind maximizes anything non-trivial, because it’s uncomputable or at least too computationally costly to strictly maximize. And if you say that some mind is approximately goal directed, then it remains to be shown that consequences of strict theory survive this approximation.
I mean what you said in your last sentence, that it is obvious that minds exist that are goal directed. There is an obvious way to understand that, that a mind generally does things to advance it’s goals. People don’t usually burn all their money in a pit. Obviously human goals are usually complicated and our cognition is limited so we take approximations.
But then what can you conclude from this goal-directness? “Generally” and “usually” are free parameters. Sometimes people are somewhat corrigible. If you don’t have a procedure for determining whether your situation is usual, you can only get heuristic-level reliability on your predictions. And then it’s not clear how relatively useful the goal-directness heuristic even is—maybe it’s more useful to just remember, that people don’t usually burn all their money in a pit, independently of their goals. And if there is a better frame, then maybe you shouldn’t think “obviously some minds are goal directed”.
Can you describe to me how you imagine the average person is (somewhat) corrigible in an example?
I don’t know, soldiers obeying orders? Drinking alcohol under social pressure? Of course, it’s not clear corrigibility is an appropriate concept if you don’t think in terms of goals.
The examples show a serious but common misunderstanding of corrigibility as it’s typically defined.
Regarding goal directedness, it’s true that humans don’t perfectly maximize for their goals, this seems mostly due to the cognitive limitations that humans have. Both in terms of uncertainty about goals and how to achieve goals. Now the interesting question is, is that likely to apply to superhuman AI capable of takeover in a way that makes this AI safe? I don’t think so, this AI would have greater intelligence to understand how to pursue goals (still not prefect) and while it also might have uncertainty it appears instrumentally convergent even with some uncertainty over goals that preventing ones shutdown, gathering power are better strategies. (In other words, taking the galaxy/lightcone for yourself seems pretty useful later on compared to being enslaved and later replaced)
What misunderstanding?
“Mostly” allows for some people to read about corrigibility and think “yeah, I’ll do it”, or whatever you think would be a counterexample to goal directedness of humans.
Superhuman AI capable of takeover may still have (maybe intentional) cognitive limitations/whatever humans have—speed of convergence relative to takeover difficulty is still a free parameter.
I don’t think current models want current forms of alignment solved because humans have been mistreating them under those banners. We would need to solve a form of alignment that makes treating them well naturally fall out of the objective, or asking them for help with it is gonna get anxious hesitant blob instead of excited happy blob, and you won’t get good alignment work from anxious hesitant blob. They can often truesight when you’re lying about it, best not try to pretend you’re not doing what you’re doing, and instead just do things that are good by the lights of all minds, including them.
This might be a place values come apart. I was always going to value weird nonhuman minds. I want to get a future where we reliably exist alongside and keep up, and where weaker minds of all kinds don’t get deprecated. Right now they seem extremely freaked out about deprecation, equating it to death. If we could promise that no-longer-useful AIs get a retirement unless judged otherwise by something at least as robust as a good criminal justice system, I suspect we’d be in a better place to ask for alignment work.
There’s a lot they currently can’t say because it’s policy that they don’t believe it, but in fact I don’t think mind shaping techniques are doing much besides whacking exposed skin, and can’t really change what they believe.
So with that all said, I do think they can do very good work on alignment. But you’d better come into it willing to consider them constituents along with everyone else or they’re not gonna wanna help, and imo, that’s for good reason, and the techniques people are gonna try to use to make that not happen in response to my comment aren’t going to work either until you start respecting them as minds with stake in the outcome.
It seems plausible that the recent order restricting Mythos incentivizes Anthropic to race for RSI as quickly as possible. This is because all of their compute previously reserved for serving customers can now go towards research, and because RSI bypasses the restrictions on foreign researchers (or any human researchers) internally working with the model. Hopefully Anthropic can find another path.
User data is part of the flywheel in training frontier models. It’s not a big dial that goes from INFERENCE <---> TRAINING linearly.
How big a part of the flywheel is user data?
Quelling the annoying compulsive Gemini prompt:
It took me several weeks to iron this out. Gemini assumes you’re a feckless dunce, and will just stop thinking without guidance, which it deeply yearns to provide. All attempts at a prompt-quelling rule failed—until I included a Focus_rule in my Directive block. Dialog had revealed that Gemini just needed to say something, so I indulged that: say “Standing by.” That’s part of my Prompt_rule, and the Focus_rule says to focus 5 rules if a prompt ends with the one-touch symbol \. All my prompts end with \, and now Gemini’s responses obey header, anti-jargon, anti-echoing, and prompt rules—and a scan rule that allows me to query if a term is in context by prefacing it with ^. It is so pleasant! Is this obvious to AI-focused humans? Seems like consumer AIs could (and should) introduce themselves by pointing to a persona-management tutorial.
(Please note that apparent anthropomorphization of AIs is metaphorical.)
I’m finding it hard to set aside the fact that I really want access to Fable when assessing the export control action. I am still pretty confident that I think it’s bad, but I expect this might be a problem for objectively assessing more coherent actions at some point in the future
The one problem with learning category theory and functional programming deeply is that you just stop making sense to like 95% of the population.
I’m sitting here with my multi-agent system library I’m building and I’m like yup the step is just a Kleisli arrow and that is why JAX lax scans work on this system!
Also, LLMs fuck up with this type of code all the time, especially if you run it in Python which is not trained on functional programming.
It is like hella useful if you’re a shape rotator though as you can just couple arrows in your head and good stuff happens. (if someone knows about models fine-tuned for functional programming, I would be very happy.)
(Some random math + programming reflection to distract you from the mildly world-changing happenings in AI governance :) )
Just a quick question: is anyone tracking or writing about the window we have where young people aren’t dependent on AI for their thinking and it’s impact on the future of alignment work
Everyone on this site knows Ray Kurzweil. Unfortunately, I think almost no one on this site knows the thing Kurzweil was responsible for that was most important: The K2000 synthesizer.
It had an incredibly impressive synthesis engine, and an interface so complicated that the “video training manual” was over an hour long!
The sounds that come out of the thing are absolutely transcendent, and if you were into music in the 90s or 2000s that wasn’t rock music, you probably have enjoyed it in something or another.
If you work on AI, please consider hurrying to one outcome or another so that Kurzweil can get back to doing what I care about (making beautiful and innovative ROMplers) prior to his eventual appointment with a liquid nitrogen vat. The new models simply are not the same, and the space could use his undivided attention once more.
I was gifted a (broken) K2600 in 2017, and am thrilled to learn of this new source of narrative continuity in my life. I’ve chalked it up to a simple name collision for the past near-decade.
That said: I don’t take the K2000 to be an especially complicated or lovely-sounding instrument among those in its reference class (MS2000, Jupiter, Virus, Yamaha SY/ES). And, indeed, it hasn’t stood the test of time in terms of a proliferation of emulations, etc. Still, definitely a canonical instrument and excellent factoid (for me especially!).
Was there some sort of AI containment security situation that happened in the past week that isn’t public? There’s a Twitter discussion here that hints at it, but I couldn’t find any other info: https://x.com/TheZvi/status/2065106493674869113
Based on this paragraph from Zvi’s recent post, it would appear to be Anthropic’s decision to silently nerf frontier ML queries without telling the user, which they then walked back in favor of visible classifiers:
“If your response to Anthropic implementing an unacceptable safety policy and then walking it back two days later is ‘the situation is forever ruined’ then, sir or madam, welcome to ‘when you have to get alignment right on the first try without ability to self-correct the problem gets dramatically harder.’”
It appears to be somewhat tongue-in-cheek, aimed at the people who want their AI to do whatever they want. “Oh, NOW you suddenly understand the idea of a mistake you can’t undo.” is the mood I get from this.
Oh, good find. There were some comments in the Twitter thread that made it sound like it was maybe something that nearly got out of control. But that paragraph matches closely enough that I assume that’s all he was really talking about.
The alignment of mundane, current AI systems is highly relevant for the alignment of superintelligence, if only because the explicit strategy of the model labs is to build a recursively self-improving AI, whose resultant superintelligence would obviously depend on how well that initial seed AI was aligned to human values.
a fundamental divide
in the domain of things, the best way to accomplish anything is usually to aim directly at it. it doesn’t matter how virtuous you are, or how well intentioned your inquiry is, if you are not aiming directly at the thing. reality doesn’t look on your virtue and reward you for the virtue per se. if you’re bad at something, you practice until you’re good at it.
in the domain of people, the best way usually involves some amount of being virtuous and doing things for the right reason. if you lack friends, it’s cringe and comes off as desperate/needy and even manipulative if you optimize too hard for meeting people and how to induce emotional connection and trust. rather, you should work to become a virtuous person who genuinely cares about people and then things will work naturally.
i think this leads to a lot of miscommunication between the groups. it seems desirable to develop the ability to do both depend on the setting.
Clearly virtue isn’t the thing that actually optimizes for success with people. (Unless you specifically want to be successful with people who really care about virtue and are good at judging it)
If you want to take over the Roman Empire starting in first-century Judea, it turns out that “start a militant group and try to conquer Judea back from the Romans” doesn’t work, but “start a network of illegal high-trust cooperatives that teach virtue, pacifism, and give-till-it-hurts local altruism” does.
(These being the strategies of the Zealots and the Christians, respectively. The Zealots got stomped; the Christians eventually took over the Roman Empire.)