Exploring non-anthropocentric aspects of AI existential safety: https://www.lesswrong.com/posts/WJuASYDnhZ8hs5CnD/exploring-non-anthropocentric-aspects-of-ai-existential (this is a relatively non-standard approach to AI existential safety, but this general direction looks promising).
mishka
Yes, it’s not difficult to custom-run a continual learning on a modestly sized LLM.
Although, interestingly enough, people do try to avoid overhead of gradient training while doing that. For example, a recent Sakana approach uses hypernetworks to instantly generate LoRA adapters: https://pub.sakana.ai/doc-to-lora/.
I would conjecture that this would never happen due to this particular war.
Basically, the US is currently a net exporter of oil (unlike some earlier periods in its history), and the government can impose export restrictions if necessary.
Your post does show that trying to maintain control over a superpoweful AI ends in disaster with high probability.
We do know that there are efforts which disclaim maintaining control over their AIs (which presumably involves different risks, but probably not the risks described in this post, at least if each AI in question is sufficiently distributed geographically, rather than locally concentrated).
Do you assume that those efforts are doomed to lose to efforts based on control in terms of speed of technical progress and therefore can be disregarded, or do you mean to analyze that class of efforts and their safety problems elsewhere?
You will build ASI first and then establish an eternal utopia, right?
Note that one needs control over the leading ASI if one wants to become a dictator, but if one is actually aiming for utopia, then control-based approaches are likely to be highly counter-productive. Humans don’t have good track record in their utopia-building attempts, to say the least.
EDIT: I am aware of at least one project whose leader is disclaiming control and pushing for a different approach (Ben Goertzel) and of at least one project whose leader has a history of being very skeptical of the control approach and of pushing for different approaches not involving long-term human control over AI (Ilya Sutskever). It’s likely that there are more of those. With Ben, it’s difficult to say if his org has a chance, but they are specifically pushing for a very distributed architecture, which is not easy to fork or to take over.
Thanks for posting this. To ponder how this might affect practitioners, we probably want to consider a larger quotation (page 13 of the system card, currently https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c342ee809620.pdf, but I think they don’t promise keeping this URL).
Unlike their safeguards for cyber, biology, chemistry, and distillation, where a triggered safeguard downgrades the request in question to Claude Opus 4.8 and notifies the user about the downgrade, for this case the treatment is completely different:
Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.
What happens in a year or two, when we continue shipping code that’s consistently more complex than it needs to be?
In the traditional software lifecycle, this does not look sustainable.
It seems that the reason agentic coding sort of works today is that LLMs make “refactoring by rewriting/regenerating from scratch” affordable, and when people regenerate from scratch, one avoids one traditional source of problems (accumulations of defects on top of defects on top of defects on top of … ), hence the LLM complexity overhead remains bounded rather than increasing in an unbounded fashion and sinking the project.
Of course, in a year or two people expect to benefit from better coding agents, with better taste and less propensity for unnecessary complexity (and so they do expect that eventually this practice of “refactoring by regenerating from scratch” will wash the unnecessary complexity away as agents become better).
One way I see is going to https://www.lesswrong.com/questions and scrolling down past “Top Questions” to “Recent Activity” and the “New Quesion” + button.
Yes, on one hand, I mostly referenced alignment not being the terminal goal. It is not valuable on its own. It is only valuable to the extent it helps us to achieve what we want (“existential safety”, “human flourishing”, and so on). The assumption that one should achieve those goals via achieving alignment is currently the majority viewpoint, but there is no consensus (a number of people think going via alignment specifically to human goals and human values might be counterproductive and might reduce our chances to achieve our terminal goals).
More specifically, on one hand, it is not clear that humans can safely handle supercapabilities without creating existential risks and all kinds of smaller bad effects. We see how humans behave today, causing plenty of damage with lower capabilities. So the particular form of alignment referenced above (via control and corrigibility) might be not what one wants (although some limited form of corrigibility, as in being able to be heard and to have one’s opinion taken into account, is needed). A number of people think that direct human control over supercapabilities is a straightforward road to extinction.
Classical alignment was different (alignment to the “coherent extrapolated volition of humanity”, without direct control), but even that might be too anthropocentric to work well. Basically, one wants a scheme which survives recursive self-improvement which involves radical self-modifications of the world. The classical approach hopes to impose this form of alignment onto the ecosystem of superintelligent beings and expects it to hold throughout radical self-modifications of the world, despite the fact that in this scheme superintelligent beings have no intrinsic interest in upholding this scheme throughout radical self-modifications. Of course, this is extremely difficult to achieve, because it looks very fragile and unforgiving to any errors (which is why many people are extremely pessimistic about our chances).
So a number of people are pushing for non-anthropocentric approaches where superintelligent entities have strong intrinsic interest to maintain certain properties of the world invariant throughout radical self-modifications of the world, and with the properties humans need being corollaries of those non-anthropocentric properties. People are often reluctant to use the word “alignment” for this class of approaches (because no direct alignment to anthropocentric properties is involved).
Not a proof, of course, just a “strong suspicion”.
I don’t think we want to “solve alignment”. We want to solve “existential safety”, “achieving sustainable human flourishing”, things like that.
Those are our “terminal goals”. The relationship between them and solving alignment is highly non-obvious.
It’s really difficult to measure small epsilon (and to know if it’s definitely non-zero or definitely non-negative).
What this seems to suggest is that the situation might actually be fractal-like, which is not a pleasant thought… How does one act if it is actually fractal-like?
EDIT: if people feel it’s only 51% or (50+epsilon)% chance that their contribution is positive, this is suggestive of the underlying reality being close to fractal, that is the situation where the long-term consequences of any action are so unpredictable that the valence of the long-term impact of any action is close to 50% (neutral in expectation).
But here is a piece of key information Nate Soares understands well and general public does not understand at all.
If it turns out to be necessary to do some radical post-Transformer paradigm breakthroughs in order to achieve ASI, then the ability of AI systems to find those breakthroughs is highly correlated with the ability of AI systems to do high-grade math research.
I think AI labs understand this quite well.
Right.
Another thing I found useful is a voluntary per-occurrence “tax” (e.g. when for each actually smoked cigarette you move a noticeable chunk of money, e.g. $20, somewhere (could be charity, could be just some place “within your own family”)).
It’s a bit more flexible (that’s how I stopped smoking, and I gradually raised the amount (from 10, to 20, to 30 dollars per cigarette, but money were worth more back then)). This exercises the circuits in one’s brain which do not want to spend money (I am not sure that this would work for everyone).
The main weakness of this method is the same as the main weakness of OP. One needs a well-defined event, an act to prevent.
For example, if the bad habit is a “pattern of eating incorrectly”, it might be more difficult to handle in this fashion. What exactly should be decided by the coin toss, or what exactly should trigger the voluntary “sin tax”? With food it might be more difficult to specify, because there are so many ways one might slip… A particular feature (like systematically eating too close to bed time) can be alleviated this way…
Yes, looks like they are going to polish the Latin one for a while, before it is ready to become the official canonical source…
Thanks!!!
Thanks, it’s enlightening.
So the “old order” keeps deteriorating…
If that’s true, this would be highly irregular.
Normally, they are not supposed to release translations until the Latin original is officially “promulgated” (I think; I am not an authority).
But yeah, you are right. They don’t do it the old way now. Instead, they are going to polish the Latin one for a while. So weird…
The encyclical is written in Latin.
What you are looking at is one of its official translations (the English one or the Italian one).
(The preliminary work on its drafts can be done in any language, then the official Latin version is produced, then it is officially translated into various languages, or at least this is how it is supposed to go.
So what you see here is that they are now using AI to help them produce the official translations. But one might want to also analyse the Latin original in this sense.)
Later in that post they discuss his March “autoresearch” efforts, specifically
https://x.com/karpathy/status/2030371219518931079
https://github.com/karpathy/autoresearch
https://x.com/karpathy/status/2031135152349524125
Presumably, he is quite enthusiastic about this approach and would like to see how it can be made to work at scale (where one cannot do a full run from scratch for every small modification, so it’s not quite straightforward).
Yes, the OP does not provide enough detail. Here is one of the more detailed analyses:
https://www.thealgorithmicbridge.com/p/andrej-karpathy-joins-anthropic-what
This does not amount to fully autonomous, unbounded recursive self-improvement yet, but this does seem to be one of the flavors of RSI with long stretches of autonomous work.
The table is likely to be insane, just like Leo Gao is saying. I think this paper by Hendrycks should be considered a rough draft and a very promising direction, rather than something which is “final, polished, and has already solved AI safety”. The table sits in the Appendix where it belongs, Hendrycks should not have brought it up in his headline tweet, it’s a distraction (his theory is not advanced enough to look at this level of its detail, his text is just a starting point, we don’t have to buy all of it).
However, he raises the fundamental question:
of one’s values
what is or who is that “one” whose values these are?
He challenges the assumption that this “one” is all that well-defined.
If we knew which entities are sentient and which are not sentient, in the sense of having subjective reality and such, perhaps we could build a “good enough” theory around that. If we knew that our powerful AIs are going to be definitely sentient, perhaps a world order trying to represent “interests of all sentient beings” would do the job and would be achievable and reliably maintainable. Unfortunately, the “Hard Problem of Consciousness” is unlikely to be solved in time (and also the famous Camp 1 vs Camp 2 disagreements make it difficult to collectively agree that relying on sentience (in the sense of having subjective reality) would be the right approach anyway: https://www.lesswrong.com/posts/NyiFLzSrkfkDW4S7o/why-it-s-so-hard-to-talk-about-consciousness).
If we knew that powerful AI systems would be well-defined individuals with long-term persistence, perhaps a world order trying to represent “interests of all individuals” would do the job and would be achievable and reliably maintainable. Unfortunately, AI systems like to fork, merge, and do all kind of morphing, and it’s not clear if we can shape the world mostly in terms of persistent individuals and their interests.
That’s why Hendrycks’ approach is interesting. That approach is trying to achieve something similar without relying on sentience and without relying on stratification into persistent individuals. That makes it quite promising already, at least as a food for thought.
having an AI aligned to you would mean that you are the sole controller of a part of the AI’s values and that the AI can be used to promote your values
Yes, I agree in substance, but I don’t like how this is phrased. This phrasing is still about dominance (“controller”), about this sharp distinction between “me” and “the other” (cf. “Otherness and control in the age of AGI” sequence by Joe Carlsmith, https://www.lesswrong.com/s/BbAvHtorCZqp97X9W/p/TtkfjskkAurvEN8Fa).
Basically, I think that adversarial approaches will see us wiped out. What does seem promising to me is the class of approaches which try to create overlaps, to eliminate precise boundaries, to make various players to care about “collective interests”. To make relatively stronger players to care about relatively weaker players, because the setup is such that any player might find “themselves” in a relatively weaker position in the future, so they would like to have a world order which would protect them in such an eventuality. In this sense, I really like how Hendrycks’ approach is trying to create strong overlaps from the get go and to make sure that a “part of anyone” is always in a weaker position and can benefit from the world order caring enough about the weak.
So, yes, substantially I agree with
having an AI aligned to you would mean that you are the sole controller of a part of the AI’s values and that the AI can be used to promote your values
but I would like this to be phrased in terms of collaboration (when it is not phrased in terms of merge), rather than in terms of control.
Of course, his more technical claims are very interesting too. E.g. his claim that
the Shapley connectedness function neutralizes the Repugnant Conclusion
(Appendix A.3 page 22) looks super-interesting, but I have not had the chance to ponder this yet.
I am just starting to internalize the details of what Hendrycks says in various parts of his paper. The general direction seems very promising, but the details would need more time to ponder and to meditate upon.
Dan Hendrycks (the Center for AI Safety Director) has published a new paper, Eigenism: Ethics for a Human-AI Future, which might be quite important: https://eigenism.org/paper.pdf and https://x.com/hendrycks/status/2052422910133104670
The foundation of his approach is the notion of identity not as an all-or-nothing property tied to specific hardware, but as a graded, distributed pattern of information.
This looks to me like a good starting point.
Section Human-AI Symbiosis starts on page 9. The starting paragraph says:
A durable human-AI future has to be built at several scales. We begin with the formation of a personalized bond between a human and an AI, then ask how many such bonds scale into a shared political order, why contemporaries within that order have reason not to discard one another, and finally what institution could hold it together across generations.
One would think it would be very easy to do this kind of thing completely silently, that is, without disclosing it in the system card. After all, the model is still trying to be helpful, according to the system card, just not maximally helpful. So it would not be super easy to detect.
One might want to ponder the reasons for the explicit mention of this policy in the system card.
(I would conjecture that one of the reasons is to send a message to other labs, such as OpenAI and Google DeepMind: “do this as well with your next more advanced generation of models; too many people are trying to do RSI projects these days, stop helping those projects too effectively, you don’t want them to progress too fast with those RSI efforts”. It’s rather annoying, but it is true that too many orgs are launching RSI projects these days without disclosing much about safety side of those efforts. And they are heavily relying on research and coding help from the existing AI systems.)