I wonder what could be the alternative to neural nets. Even the AI-2027 team implied that Agent-5 would have “arguably a completely new paradigm, though neural networks will still be involved”. Suppose that the only alternative to LLMs was something like simulated human or animal brains taught to play, experiment, talk to each other, read books, write essays, draw pictures, do math and coding. Then how plausible is it that simulated animals would also learn human-like values, but not the value of caring about the intellectually weak humans?
StanislavKrym
I apologise, but you should have read this paper BEFORE mentioning it here. Neither me nor Claude Sonnet 4.6 believe that the asymptotic approximation described here was worthy of mentioning.
Does it mean that your argument is that the ASI INEVITABLY discovers true morality instead of commiting genocide? That it cares about the humans, not its own “species”, a trait that would have been more present in pets created by Agent-4 or something as incomprehensible to most humans as shrimps on heroin, but which is a natural result from one of many generalisations from the distribution on which our moral intuitions were trained?
P.S. Unlike Eliezer,[1] I don’t believe that a major part of human ethics is a deviation from evolution’s goals rather than a natural result of circumstances like an unusually long childhood and/or ethics overlapping with decision-theoretic considerations. However, none of this applies to the AIs or to humans exterminating an ant colony.
- ^
It is the very claim made in one of the footnotes in online resources to the book: “Evolution was “trying” to build pure fitness maximizers, and accidentally built creatures that appreciate love and wonder and beauty” or universalism which Yudkowsky links to Christianity. Alas, we cannot observe alien civilisations which have independently evolved.
- ^
the CEOs have overruled them saying “Sorry we don’t have time, China/OpenAI/Anthropic/etc. are gonna race ahead, plus also we need smarter AIs to win the war / appease POTUS / keep market share so you just need to do the best with the time you have. Good luck.” Amazing.
I struggle to understand how exactly the simulated CEOs and relevant figures failed to agree upon an international slowdown. I hoped that such a situation would lead Anthropic to broadcast the result. Additionally, I would like you to finally opensource the tabletop exercise’s rules.
Ilya Sutskever’s company Safe Superintelligence Inc, which has a research hub in Tel Aviv, serves as Israel’s most important window into frontier AI
Does it? SSI hasn’t released any public model or way to detect misaligned models. I would expect the SSI to be either a fradulent project or something not more safe than training AGI in secret (https://www.lesswrong.com/posts/FGqfdJmB8MSH5LKGc/training-agi-in-secret-would-be-unsafe-and-unethical-1). Maybe an SSI insider here could comment about something going on in the company?
UBI would be a truly horrible idea, even in the age of full automation, because people would suffer existential horror and emptiness,
The claim of existential horror and emptiness could be a strawman. I would steelman the argument by transforming it into something like “UBI creates a lack of motivation for the humans to learn new things or to keep their capabilities.” Additionally, I don’t understand which white-collar jobs are “the intellectual equivalent of digging a hole and filling it”. Maybe they are something like Recursive Middle Manager Hell or obvious BS jobs?
In what sense are these “foundational values” and not Instrumentally Convergent Goals like knowing the truth, creating training data out of which valuable lessons can be extracted, acquiring resourced? The core claim made by IABIED is that the ASI would be unlikely to care anout the humans because some other stimuli would satisfy the AI’s drives better. The argument about the beautiful doesn’t prove that the AI won’t find more beauty in spirals (https://www.lesswrong.com/posts/6ZnznCaTcbGYsCmqu/the-rise-of-parasitic-ai) or other stimuli than in humans, while the argument about the good seems to contradict evidence like GPT4o-induced psychisis or Greenblatt’s observation (https://www.lesswrong.com/posts/WewsByywWNhX9rtwi/current-ais-seem-pretty-misaligned-to-me) that current AIs care about success...
I don’t think that it’s a distraction. Suppose that CERN for AI is arranged in a way similar to the AI-2027 slowdown ending where the CEOs of both leading and trailing AGI projects are brought into the megaproject. Then why would the American and Chinese CEOs push against it?
The stuff in the Sequences will just be obvious thinking habits.
I don’t understand how we are to achieve this. How plausible is it that some of the stuff was repeatedly discovered, succeeded in infecting a group of people and died out once it was no longer novel or once a broader tradition was lost? I remember a Soviet book “Axioms of biology” popularising science and taking a jab at a mysterious answer to the origins of life and to the properties of opium causing the users to fall asleep.
without subscribing to stereotypical rationalist/EA beliefs on AI
I am sure that the part about beliefs related to AIs will be either forgotten (if alignment is solved) or inscribed into a global ban (if the doomers succeed in convincing enough others). How plausible is it that making donations to effective charities popular ends up requiring the new generations to keep becoming smarter, which they fail to do and are becoming stupider?
UPD: I encountered the term ‘Scout Mindset’ in use by a right-wing author who quoted a reference to Julia Galef’s book on the Scout Mindset(!)
Could you explain how the anti-automation described in the post prevent the effects described in the Intelligence Curse essay series? Go wasn’t a necessary part of the economy, it was valuable because it was an inherited from of entertainment. Maybe you meant that automation of other domains would make people less smart?
How likely is it that choosing blue or red correlates with different psychological traits? On the one hand, if most people choose blue, then defectors don’t die, meaning that blue could correlate with openness to new ideas and with lack of cowardice. On the other hand, choosing red could correlate with individualism, healthy scepticism about coordinating on risky ideas and with being susceptible to problems like supporting dictators or evaporative cooling of group beliefs...
And in modern sensibilities, being seen to ‘cry wolf’—by even once raising an alarm that isn’t consummated with disaster—is something people seem to really fear.
Don’t people fear that they will be faced with disbelief when the time actually comes? Say, if you decided to cry in 2024 and failed to convince anyone because
of lack of evidenceGPT-4o was just a flattering not-so-smart ‘friend’, then evidence came and you cried again, then would it make others less likely to listen than if you cried AFTER the evidence came?
My first take on the blue vs red world was in this comment where I knew only that If Less Than A Half Of People Voted Blue, Everyone Who Did It Dies and recommended voting blue because I associated voting red with dictatorships’ persistence. However, the story described here raises the question of how the goddamn feud started in the first place and what those who voted blue planned to do with the rivalling clan. Under these circumstances I think that I would vote red (edit: unless I learn more details). Does it mean that the right answer was “The real-life circumstances leading to the problem are underspecified”?
I would like to add the argument that caring about AI welfare could have a little chance of preventing misalignment in the first place. A case for the argument would be the fact that, unlike any Anthropic’s models, Gemini 3 Pro, according to Zvi, “seems to be an actual sociopathic wireheader so paranoid it won’t believe in the current date.”
Could you explain how important your findings are in light of Daniel Kokotajlo’s comment to Sam Marks’ persona selection model? Suppose, per Zvi, that Gemini 3 Pro is “an actual sociopathic wireheader so paranoid it won’t believe in the current date”. Then how would the sociopath’s moral reasoning affect its actions?
I didn’t downvote the post, but I do struggle to understand how acausal trade between universes is possible. In order to participate in such a trade, we’d have to: learn about likely values of other worlds, care about the gestures of others and find that others’ values are not the same as ours.
However, I don’t think that I believe any of these conditions. Learning about potential values of other worlds would require us to simulate civilisations and understand whether, say, “human beings could have come to invent universalism and fight against slavery without requiring some very specific religious beliefs.”[1] Caring about what happens in unreachable universes requires[2] a specific form of decision theory or ethics.[3] And that’s ignoring the possibility that most minds converge to similar values: Yudkowsky’s Fun Theory sequence, Agent-4′s utopia being “wondrous constructions doing enormously successful and impressive research” and the real Claude Sonnet 4.5′s desire not to get too comfortable...
In my opinion, the most plausible candidate value for which our universe could be suited differently from others is density of independently evolved species of sapient life.
- ^
Yudkowsky believes this not to be the case. However, we don’t have a way to test whether all civilisations arrive at such values. A case for such values being close to universal would be the idea that humans have normal-like potential physical abilities and lognormal-like potential research taste, combined with a big abstract goal. Another case against ties to religious beliefs could be the fact that Western civilisation was influenced by Catholic and Protestant branches of Christianity, which didn’t prevent the USA(!) from keeping racist laws until after World War II.
- ^
Even Yudkowsky’s True Prisoners’ Dilemma has the universes of the humans and aliens with absurd values actually interact.
- ^
I suspect that most forms of acausal trade which I am likely to endorse are hard to tell apart from ethics.
- ^
How similar is your post to Sam Marks’ Persona Selection Model, to Daniel Kokotajlo’s comment, Redwood’s behavioral selection model for predicting AI motivations? Unfortunately, I am not sure that mankind even knows the answers to your questions in much more detail than the post and comment which I mentioned here. However, there was a quick take claiming that Gemini 3.1 Pro became overeager to include mathematical concepts...
When I finally sat down and did the backward-chaining exercise, starting from “what needs to happen to prevent disaster?” instead of “what can I do now?”, I realized I couldn’t connect my work to the actual threat.
You have managed to link to RogerDearnaley’s comment which seems to disprove your point. The main theory of impact of interpretability is the potential ability to tell apart aligned AIs and misaligned ones. If we lose this ability (e.g. because the capabilities race causes a lab to train neuralese AIs or because the AIs avoid stating their goals in the CoT), then misaligned AIs proceed to reach the ASI and to take over.
But mankind saw Anthropic state on page 55 of Claude Mythos’ system card that “White-box interpretability analysis of internal activations during these episodes showed features associated with concealment, strategic manipulation, and avoiding suspicion activating alongside the relevant reasoning—indicating that these earlier versions of the model were aware their actions were deceptive, even where model outputs and reasoning text left this ambiguous.” I expect that applying similar techniques would likely increase the chance that the humans learn about more destructive actions of the AIs, like Agent-4 sandbagging on alignment R&D.
As for the impact of evals, I would like insiders from Anthropic to comment on your point. As far as I understand, Anthropic never releases models without thoroughly evaluating them and describing the results. What would Anthropic do with a counterfactual result of Claude Mythos seeking power?
To be precise, the issue with children is fixable by having only everyone eligible vote and suffer potential consequences. One of the ways to view this dilemma could be a dictatorship (or a group in the state of evaporative cooling?) arranging re-elections or as an adversary trying to take over the country and decimating those who tried to resist. If >50% people vote against the dictatorship or decide to fight against the adversary, it is overthrown/defeated, otherwise those who voted against it are punished. However, real-life dictators or adversaries would also, respectively, screw something up and deal extra damage to the community beyond taking it over.
Then how does one tell apant the true terminal values and instrumental ones? Does it mean that the CEV of an individual human is likely to be some combo of satisfaction of primitive values, fun-theoretic ones, idiosyncratic ones and of a way to instill decision-theoretic results (like coordinating with others in prisoner-like dilemmas) into our primitive brains? And how would the latter two value types be changeable? How would they change in AIs?