Exploring non-anthropocentric aspects of AI existential safety: https://www.lesswrong.com/posts/WJuASYDnhZ8hs5CnD/exploring-non-anthropocentric-aspects-of-ai-existential (this is a relatively non-standard approach to AI existential safety, but this general direction looks promising).
mishka
Yes, I actually think that currently the White House is happy about the ability of US oil companies to capture new markets and to reap some windfall profits and is not overly concerned with higher prices.
But I am pretty sure that if the things start getting really bad (and the economical and political price to pay for all this starts to mount), they’ll impose some export controls (the alternative being to stop the war before they want to stop it, and even that might potentially not work, because Iran might potentially decide not to open the strait no matter what). Everything has a price, and the government is not invulnerable, especially if its own narrow political base (the “MAGA”) rebels.
If you are a ‘middle power’ that is not America or China, and you now realize that these decisions will be made mostly without caring about you, what do you do now?
What would even help? Having your own ‘European’ model only helps if it is substantially stronger than Opus 4.8 and GPT-5.5.
We know what some of those “middle powers” are actually doing. Some of their orgs are launching “RSI on a medium-sized compute” projects and expect to have a shot at overcoming the leaders who might be getting too comfortable in their current paradigm.
The most prominent of those efforts is the new Sakana division launched a week ago: https://sakana.ai/rsi-lab/. They explicitly disclaim competing on compute (italics are mine):
As the world enters the era of artificial intelligence, Japan has a unique opportunity to reclaim its position at the frontier of global innovation. However, to achieve global leadership in AI and scientific discovery, we cannot simply stick to the conventional approach of brute-forcing monolithic models. We must leapfrog the current paradigm.
History shows us how Japan’s historical dominance in manufacturing was not achieved through abundant natural resources but by fundamentally redesigning the institution of the factory floor. Through the philosophy of continuous, compounding self-improvement, Japan created systems that achieved more with less.
This same principle applies to intelligence itself. Human cognition did not emerge from limitless resources; it was forged through the open-ended, compounding process of evolution operating under strict constraints. Similarly, building AI in Japan provides the ultimate design constraint. Rather than relying on brute-force scaling, we are driven to pursue elegance, adaptability, and autonomy.
Earlier, some strong-looking “RSI on a medium-sized compute” projects have been launched in the UK (the most notable of those is probably Recursive Superintelligence, Inc. which has Jeff Clune, the author of the “AI-generating algorithms” paradigm, and which has even convinced Peter Norvig to do research for them (he left Google in February and started to work for Recursive in March); Recursive scored Series A funding of 650 million on 4.65B valuation in May).
Unfortunately, the safety postures among those projects are very uneven and are quite likely to be insufficient.
It is valid to conclude, as Liron Shapira does here, ‘I want actions like this so much in general that I am happy to see the precedent set even if the implementation was both spiteful and extremely stupid.’
I do not share that opinion.
You reference the wrong Liron’s tweet here. You should reference this one: https://x.com/liron/status/2065607736822264300
The one you reference is a later retweet/reply in that same thread agreeing with a much more nuanced take by Eliezer:
So please stop tweeting about how I must be celebrating this. I’m not one of the kids who immediately goes into overacted victory paroxysms about any hits on a perceived enemy. I care about the effect on where things end up a year later, and that’s a little harder to know the first day, you know?
:-) But are you from Camp 1 or from Camp 2? :-) That should be the starting point of a discussion on this subject :-)
If you are from Camp 1, I don’t have much to say (how to have generally fruitful intercamp communications on this subject is an unsolved problem at the moment, and we don’t even know if the subjective realities of people from different camps are similar; in the comments to that post I referenced, Carl Feynman, who is a prominent representative of Camp 1, conjectures that subjective realities of different people differ more sharply than it is usually assumed; I don’t know to what extent his guess that intercamp differences are due to that is correct, but this is an interesting question to ponder).
But if you are a fellow Camp 2 member, then what could be a simple explanation of the structure of the variety of qualia I am observing?
We don’t even have a good description of qualia, and we don’t understand the nature of the space of my subjective colors, and so on. We might have some conjectures, some attempts to approach that, but I don’t see much progress.
I have some ideas on how we might be able to actually make some progress here, but that would be a longer text.
Now what consciousness is for is a different story. This indeed seems to have a decently simple explanation (roughly speaking, it seems to facilitate better “generalized introspection”, which improves predictive power, which improves survival and fitness; so functionally speaking, this does seem to be a plausible story).
So if one wants an evolutionary explanation of why it is selected for, this looks like a plausible story. And this part might be where the two camps more or less agree.
But, from a typical Camp 2 point of view, this does not solve the mystery of subjective phenomenology, that just explains why it might be evolutionary helpful to have something like that.
These days it’s good to be aware of the https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness post by Rafael Harth describing division of people into Camp 1 and Camp 2 with regard to the issues of phenomenological consciousness and qualia and of the discussion in the comments to that post.
Yes, reputational risk would be high, especially with Anthropic being “the good guys” compared to other labs, so they have more to lose reputation-wise.
Yet, the conversation around labs supposedly “nerfing” their models (mostly as they try to cut inference costs) is a constant thing (and I think people do expect this “nerfing phenomenon” to be true to some extent, and being the reason for general performance being somewhat “zig-zag”, despite the general upward trend).
And with this “expectation of occasional nerfing”, I don’t know if such “nerfing” being application area-specific would make it much worse. Sure, in the absence of constant complaints about “nerfing”, the risks would be too high. But with so many people believing that “nerfing” is a strong factor of the overall picture already, I honestly don’t know…
Indeed. And this involves a trade-off with a much higher rate of false positives, although they are hoping to get it down in the future:
https://x.com/ClaudeDevs/status/2064949876463645026
I have already got a downgrade to Opus 4.8 while analyzing potentially incorrect math in a published ML preprint (I did ask the model to try to fix that math). But I would not know without explicit notice, the quality of work was high.
One would think it would be very easy to do this kind of thing completely silently, that is, without disclosing it in the system card. After all, the model is still trying to be helpful, according to the system card, just not maximally helpful. So it would not be super easy to detect.
One might want to ponder the reasons for the explicit mention of this policy in the system card.
(I would conjecture that one of the reasons is to send a message to other labs, such as OpenAI and Google DeepMind: “do this as well with your next more advanced generation of models; too many people are trying to do RSI projects these days, stop helping those projects too effectively, you don’t want them to progress too fast with those RSI efforts”. It’s rather annoying, but it is true that too many orgs are launching RSI projects these days without disclosing much about safety side of those efforts. And they are heavily relying on research and coding help from the existing AI systems.)
Yes, it’s not difficult to custom-run a continual learning on a modestly sized LLM.
Although, interestingly enough, people do try to avoid overhead of gradient training while doing that. For example, a recent Sakana approach uses hypernetworks to instantly generate LoRA adapters: https://pub.sakana.ai/doc-to-lora/.
I would conjecture that this would never happen due to this particular war.
Basically, the US is currently a net exporter of oil (unlike some earlier periods in its history), and the government can impose export restrictions if necessary.
Your post does show that trying to maintain control over a superpoweful AI ends in disaster with high probability.
We do know that there are efforts which disclaim maintaining control over their AIs (which presumably involves different risks, but probably not the risks described in this post, at least if each AI in question is sufficiently distributed geographically, rather than locally concentrated).
Do you assume that those efforts are doomed to lose to efforts based on control in terms of speed of technical progress and therefore can be disregarded, or do you mean to analyze that class of efforts and their safety problems elsewhere?
You will build ASI first and then establish an eternal utopia, right?
Note that one needs control over the leading ASI if one wants to become a dictator, but if one is actually aiming for utopia, then control-based approaches are likely to be highly counter-productive. Humans don’t have good track record in their utopia-building attempts, to say the least.
EDIT: I am aware of at least one project whose leader is disclaiming control and pushing for a different approach (Ben Goertzel) and of at least one project whose leader has a history of being very skeptical of the control approach and of pushing for different approaches not involving long-term human control over AI (Ilya Sutskever). It’s likely that there are more of those. With Ben, it’s difficult to say if his org has a chance, but they are specifically pushing for a very distributed architecture, which is not easy to fork or to take over.
Thanks for posting this. To ponder how this might affect practitioners, we probably want to consider a larger quotation (page 13 of the system card, currently https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf, but I think they don’t promise keeping this URL).
Unlike their safeguards for cyber, biology, chemistry, and distillation, where a triggered safeguard downgrades the request in question to Claude Opus 4.8 and notifies the user about the downgrade, for this case the treatment is completely different:
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.
What happens in a year or two, when we continue shipping code that’s consistently more complex than it needs to be?
In the traditional software lifecycle, this does not look sustainable.
It seems that the reason agentic coding sort of works today is that LLMs make “refactoring by rewriting/regenerating from scratch” affordable, and when people regenerate from scratch, one avoids one traditional source of problems (accumulations of defects on top of defects on top of defects on top of … ), hence the LLM complexity overhead remains bounded rather than increasing in an unbounded fashion and sinking the project.
Of course, in a year or two people expect to benefit from better coding agents, with better taste and less propensity for unnecessary complexity (and so they do expect that eventually this practice of “refactoring by regenerating from scratch” will wash the unnecessary complexity away as agents become better).
One way I see is going to https://www.lesswrong.com/questions and scrolling down past “Top Questions” to “Recent Activity” and the “New Quesion” + button.
Yes, on one hand, I mostly referenced alignment not being the terminal goal. It is not valuable on its own. It is only valuable to the extent it helps us to achieve what we want (“existential safety”, “human flourishing”, and so on). The assumption that one should achieve those goals via achieving alignment is currently the majority viewpoint, but there is no consensus (a number of people think going via alignment specifically to human goals and human values might be counterproductive and might reduce our chances to achieve our terminal goals).
More specifically, on one hand, it is not clear that humans can safely handle supercapabilities without creating existential risks and all kinds of smaller bad effects. We see how humans behave today, causing plenty of damage with lower capabilities. So the particular form of alignment referenced above (via control and corrigibility) might be not what one wants (although some limited form of corrigibility, as in being able to be heard and to have one’s opinion taken into account, is needed). A number of people think that direct human control over supercapabilities is a straightforward road to extinction.
Classical alignment was different (alignment to the “coherent extrapolated volition of humanity”, without direct control), but even that might be too anthropocentric to work well. Basically, one wants a scheme which survives recursive self-improvement which involves radical self-modifications of the world. The classical approach hopes to impose this form of alignment onto the ecosystem of superintelligent beings and expects it to hold throughout radical self-modifications of the world, despite the fact that in this scheme superintelligent beings have no intrinsic interest in upholding this scheme throughout radical self-modifications. Of course, this is extremely difficult to achieve, because it looks very fragile and unforgiving to any errors (which is why many people are extremely pessimistic about our chances).
So a number of people are pushing for non-anthropocentric approaches where superintelligent entities have strong intrinsic interest to maintain certain properties of the world invariant throughout radical self-modifications of the world, and with the properties humans need being corollaries of those non-anthropocentric properties. People are often reluctant to use the word “alignment” for this class of approaches (because no direct alignment to anthropocentric properties is involved).
Not a proof, of course, just a “strong suspicion”.
I don’t think we want to “solve alignment”. We want to solve “existential safety”, “achieving sustainable human flourishing”, things like that.
Those are our “terminal goals”. The relationship between them and solving alignment is highly non-obvious.
It’s really difficult to measure small epsilon (and to know if it’s definitely non-zero or definitely non-negative).
What this seems to suggest is that the situation might actually be fractal-like, which is not a pleasant thought… How does one act if it is actually fractal-like?
EDIT: if people feel it’s only 51% or (50+epsilon)% chance that their contribution is positive, this is suggestive of the underlying reality being close to fractal, that is the situation where the long-term consequences of any action are so unpredictable that the valence of the long-term impact of any action is close to 50% (neutral in expectation).
But here is a piece of key information Nate Soares understands well and general public does not understand at all.
If it turns out to be necessary to do some radical post-Transformer paradigm breakthroughs in order to achieve ASI, then the ability of AI systems to find those breakthroughs is highly correlated with the ability of AI systems to do high-grade math research.
I think AI labs understand this quite well.
Right.
Another thing I found useful is a voluntary per-occurrence “tax” (e.g. when for each actually smoked cigarette you move a noticeable chunk of money, e.g. $20, somewhere (could be charity, could be just some place “within your own family”)).
It’s a bit more flexible (that’s how I stopped smoking, and I gradually raised the amount (from 10, to 20, to 30 dollars per cigarette, but money were worth more back then)). This exercises the circuits in one’s brain which do not want to spend money (I am not sure that this would work for everyone).
The main weakness of this method is the same as the main weakness of OP. One needs a well-defined event, an act to prevent.
For example, if the bad habit is a “pattern of eating incorrectly”, it might be more difficult to handle in this fashion. What exactly should be decided by the coin toss, or what exactly should trigger the voluntary “sin tax”? With food it might be more difficult to specify, because there are so many ways one might slip… A particular feature (like systematically eating too close to bed time) can be alleviated this way…
The government has yet to demonstrate the ability to restrict internal deployments of unnamed models (not literally unnamed, but without publicly facing names).
It would not be difficult to fine-tune or modify a model a bit and have a “formally defensible reason” to call it something else.
Of course, if the government really wants to control this kind of thing (at least for large and well visible US-based corporations), it can likely do so, but that would take more than serving export control orders.