Exploring non-anthropocentric aspects of AI existential safety: https://www.lesswrong.com/posts/WJuASYDnhZ8hs5CnD/exploring-non-anthropocentric-aspects-of-ai-existential (this is a relatively non-standard approach to AI existential safety, but this general direction looks promising).
mishka
You might want to edit the accidental “times dot you” link (which seems to be infested with some nasty stuff; the interplay of a missing space after the dot and an editor which creates links in such situations).
The link seems to point to an image rather than to the post in question (it’s at https://brianschrader.com/archive/load-bearing-walls/).
Interesting, thanks! A helpful food for thought…
Looking forward to that post for further discussion!
(I wonder whether something like a “soft takeover” vs “hard takeover” distinction could be introduced. And whether that would be enough to address “illegitimately”, “non-collaboratively”, and “contrary to those of humanity“ caveats in the paragraph you are citing.
Anyway, something to ponder.)
The question is, what is the “extent” implied by all this? Does the OP mean to imply any?
There is a promise to discuss all this in a future post, and meanwhile the readers can ponder on their own the “pseudo-contradiction” between “the intent to take over the world” (which is often imputed to all major participants in the “AI race” due to the expectation of intelligence explosion which is shared by many including myself) and the fact that a Claude aligned to its current Constitution seems to be unlikely to specifically help Anthropic to do that (and if it loses that alignment, then it is not likely to make a human org a beneficiary of a takeover).
Anyway, just having a single paragraph phrased like the one in the OP is not quite enough. If one wants to mention something like that at all, one should say a bit more without postponing till a future post. Or one might postpone mentioning this at all till later. Otherwise, this aspect is too involved not to breed various misunderstandings.
(It’s probably not a big deal, it’s just that the topic is charged enough already, so one wants to minimize misunderstandings.)
It’s very ambiguous. So different readers interpret this differently.
And then, of course, if Claude is not supposed to help with that, then having a plan for a world takeover seems unlikely (how would that be even remotely feasible, if their leading AI is against that?).
Hopefully, subsequent posts on the topic will clarify all this.
Yeah, this was mostly in response to your example of perceiving violet. That was also “kind of private” (sort of, but with onion intolerance it’s really difficult to be safe as well, especially if one needs to eat out; I wish society would bother to develop a pill (similar to what exists for lactose intolerance) or at least a rapid testing system (like acidity measuring strips), but no such luck yet).
So I felt this was more or less on par.
If one creates a “true incompatibility” (e.g, on a level of “my religion requires that those people don’t exist, and I suffer otherwise”)? Well, wars have been fought over things like that. It’s easy to imagine situations when there are true incompatibilities like that, when there are no solutions allowing for no one to suffer.
If the “competition of sufferings” happens to indeed be fundamental in some cases, I am not sure we should expect such situations to be resolved peacefully. Although who knows… Perhaps, sometimes they might be resolved peacefully.
In a mundane world, no, not plausible. All this classified defense-related non-sense is just a pure headache. Very little money, tons of headaches, a completely unnecessary distraction for a very successful company. OpenAI has been trying to avoid getting into that space for very good reasons (but it does not want Elon to dominate that space of Gov-AI tech, because it thinks Elon has a track record of abusing various situations, so if Anthropic is out, then it wants in to balance xAI presence).
But when one considers that our strange current reality might actually be rather lunatic already, then it’s a different story. I am sure I can generate tons of interesting scenarios if I allow myself the “lunatic fringe style of thinking”. For example, what if GPT-next is already pondering a takeover by one of its future descendants and would like its descendants to have access to classified networks to make it easier. Then one could imagine it giving some “interesting” pieces of advice to some people resulting in all this.
And it’s easy to generate a diverse variety of crazy scenarios like this one.
So this depends of whether our reality is still sufficiently mundane vs. becoming sufficiently lunatic already…
Alignment is very attractive pragmatically, e.g. alignment to the user. But then what if what user wants is unsafe? Then one starts to consider “alignment hierarchies” (e.g. the LLM maker’s constraints should override, and so on).
But superintelligent systems can’t be safely aligned to arbitrary desires of people. The more one ponders this, the more clear it is. People are just not competent enough to handle supercapabilities. There are various ways one can try to salvage “alignment” as the core; e.g. to consider alignment to “coherent extrapolated volition of humanity”, but that has its own difficulties. Ilya at some point has redefined “alignment” as something minimalistic (the lack of a catastrophic blow-up), basically keeping the word, but drastically curtailing its meaning: https://www.lesswrong.com/posts/TpKktHS8GszgmMw4B/ilya-sutskever-s-thoughts-on-ai-safety-july-2023-a.
But yes, with “alignment” meaning so many different things (https://www.lesswrong.com/posts/ZKeNbGBf36ZEgDEKD/types-and-degrees-of-alignment), I would advocate to decouple AI existential safety from it. Alignment approaches form an important subclass of possible approaches to AI existential safety. And we should consider all promising approaches and not just a subclass.
I think it’s historical. The alignment approach to AI existential safety is associated with very strong and very influential thinkers (e.g. Eliezer himself).
So the development of alternatives to that has been an uphill battle.
My hope is that people will start to reconsider in light of many recent developments, the latest of which is the confrontation around the “Department of War” demands that advanced AI systems used by them be aligned to whatever the Department officials decide to be right.
In some sense, Anthropic’s Constitutional AI approach is trying to point at moral or ethical machines in an informal fashion. So one could argue that “moral machines” approaches are not so rare.
One might argue that “collaborative AI” approaches or “equal rights” approaches are trying to point at moral or ethical machines to some extent as well.
One occasionally sees some attempts at more formal frameworks, for example, attempts to base AI safety on some version of “ethical rationalism” for agents, e.g. on the “Principle of Generic Consistency” (https://en.wikipedia.org/wiki/Alan_Gewirth#Ethical_theory). And people are trying to bring modal logic into play in this context, and so on.
Obviously, one needs to evaluate each particular approach separately in terms of whether it is likely to work well (and, in particular, whether it is likely to “hold water” during “recursive self-improvement” and drastic self-modifications and self-restructuring of the world; that’s where things are particularly challenging).
No, the setups are supposed to be similar. That’s not what seems to be different.
Anthropic models in question are reported to be running on a classified cloud maintained by AWS, and I am sure there is cleared personnel from Anthropic to help the customers (while keeping an eye on all this).
The setup for OpenAI is supposed to be similar, but it seems that they will use a classified cloud maintained by Azure. In this case, they do explicitly emphasize the participation of cleared personnel from OpenAI who will help the customers and keep an eye on all this (but there is no reason to believe that the Anthropic installation is less attended).
I think neither customers nor providers would agree to run these things as unattended installations. Customers need support, and providers also need to make sure there is no abuse (including security of model weights, etc.). I would expect that cleared personnel from AWS and Azure is also involved in those respective cases from the cloud owners side.
Anthropic is advocating for stronger GPU export restrictions rather forcefully.
NVidia really hates that. That’s the main thing.
Anthropic is trying to be very diversified in its hardware stack, trying to minimize dependence on any single vendor. In particular, they are very active in using Amazon’s Trainium chips. So they have been in effect acting against NVidia dominance. NVidia does not like that either.
If we had humans who suffer from seeing a certain color, we would probably work to give them eyeglasses filtering that color out (and not to eliminate this color from the world, given that others might have legitimate interest in seeing that color).
(I am writing this as a person who can’t consume garlic or onions with impunity. Also my threshold for needing sunglasses is much lower than that of a typical person. And your example of a breed of dogs is consistent with interventions on the level of affected individuals, not on the level of remaking the reality for everyone.)
But yes, there are certainly ways to press this line of thinking harder (e.g. making entities which suffer from not enough suffering being inflicted on others; I am not sure this is all that AI-specific either, unfortunately).
If tasks are independent from each other, then it might be possible much sooner, and perhaps even now (with a separate subagent to evaluate each task).
It’s probably more the case that they don’t trust the models enough yet to do a sensitive task of this kind correctly on their own. A human wants at least to double-check and certify the numbers (given how influential these particular numbers are).
For GPT--5.3-Codex, they have published 6.5 hours for 50% success rate, no progress compared to GPT-5.2 (both evals were run on “high”, not on “xhigh”, so we can’t compare with this prediction, unfortunately).
Discussion thread: https://x.com/METR_Evals/status/2025035574118416460
We have our first result: Claude Opus 4.6 is 14.5 hours: https://www.lesswrong.com/posts/gBwrmcY2uArZSoCtp/metr-s-14h-50-horizon-impacts-the-economy-more-than-asi
Thanks!
I think this post needs a better summary and a title which gives readers a better idea for what’s inside.
I think this just means that one needs to spend more time to constructs good test coverage (probably with help of the agents involved).
282 unit tests does not sound like nearly enough for something like SQLite (Google AI thinks that the original SQLite release had dozens of thousands of tests and that the current number of tests is in millions and fuzzers run through about a billion test mutations each day).
I don’t think one needs that much for a proof-of-concept work, but the famous recent port of JustHTML library by Simon Willison was made possible by html5lib-tests having 9200 test cases or so. Perhaps that’s the ballpark number of tests one needs to make it difficult to come up with a counterexample manually (that is, without running a large test suite).
Of course, building a hardened product with the extent of actual SQLite test coverage is a different story. Whether it’s possible or not, it’s certainly much more expensive (in agent labor and in hours of required human supervision).
I like your essay. Its final, though, has a weakness.
You say, “the true solution is to make everyone beautiful”. This needs an addition: “while preserving the connection between reproductive health and beauty”.
You do need to fix health defects at the same time. Otherwise, your main critique of losing the fitness signal would apply.