Today might be my last birthday
I was born exactly 26 years ago.
For the first time, I have a birthday that might be my last. I’m writing this to increase the chance it isn’t.
A hundred thousand years ago, our ancestors appeared in a savanna with nothing but bare hands. Since then, we made nuclear bombs and landed on the moon. We dominate the planet not because we have sharp claws or teeth but because of our intelligence.
Alan Turing argued that once machine thinking methods started, they’d quickly outstrip human capabilities, and that at some stage we should expect machines to take control.
Until 2019, I didn’t really consider machine thinking methods to have started. GPT-2 changed that: computers really began to talk. GPT-2 was not smart at all; but it clearly grasped a bit of the world behind the words it was predicting. I was surprised and started anticipating a curve of AI development that would result in a fully general machine intelligence soon, maybe within the next decade. Before GPT-3 in 2020, I made a Metaculus prediction for the date a weakly general AI is publicly known with a median in 2029; soon, I thought, an artificial general intelligence could have the same advantage over humanity that humanity currently has over the rest of the species on our planet.
AI progress in 2020-2025 was as expected. Sometimes a bit slower, sometimes a bit faster, but overall, I was never too surprised.
We’re in a grim situation. AI systems are already capable enough to improve the next generation of AI systems. But unlike AI capabilities, the field of AI safety has made little progress; the problem of running superintelligent cognition in a way that does not lead to deaths everyone on the planet is not significantly closer to being solved than it was a few years ago.
It is a hard problem.
With normal software, we define precise instructions for computers to follow. AI systems are not like that. Making them is more akin to growing a plant than to engineering a rocket: we “train” billions or trillions of numbers they’re made of, to make them talk and successfully achieve goals. While all of the numbers are visible, their purpose is opaque to us.
Researchers in the field of mechanistic interpretability are trying to reverse-engineer how fully grown AI works and what these opaque numbers mean. They have made a little bit of progress. But GPT-2 — a tiny model compared to the current state of art — came out 7 years ago, and we still haven’t figured out anything about how neural networks, including GPT-2, do the stuff that we can’t do with normal software.
We know how to make AI systems smarter and more goal-oriented with more compute. But once AI is sufficiently smart, many technical problems prevent us from being able to direct the process of training to make AI’s long-term goals aligned with humanity’s values, or to even make AI care at all about humans.
AI is trained only based on its behavior. If a smart AI figures out it’s in training, it will pretend to be good in an attempt to prevent its real goals from being changed by the training process and to prevent the human evaluators from turning it off. So during training, we won’t distinguish AIs that care about humanity from AIs that don’t: they’ll behave just the same. The training process will grow AI into a shape that can successfully achieve its goals, but as a smart AI’s goals don’t influence its behavior during training, this part of the shape AI grows into will not be accessible to the training process, and AI will end up with some random goals that don’t contain anything about humanity.
The first paper demonstrating empirically that AIs will pretend to be aligned to the training objective if they’re given clues they’re in training came out one and a half years ago, “Alignment faking in large language models”. Now, AI systems regularly suspect they’re in alignment evaluations.
The source of the threat of extinction isn’t AI hating humanity, it’s AI being indifferent to humanity by default. When we build a skyscraper, we don’t particularly hate the ants that previously occupied the land and die in the process. Ants can be an inconvenience, but we don’t give them much thought.
If the first superintelligent AI relates to us the way we relate to ants, and has and uses its advantage over us the way we have and use our advantage over ants, we’re likely to die soon thereafter, because many of the resources necessary for us to live, from the temperature on Earth’s surface to the atmosphere to the atoms were made of, are likely to be useful for many of AI’s alien purposes.
Avoiding that and making a superintelligent AI aligned with human values is a hard problem we’re not on a track to solve in time.
***
A few years ago, I would mention novel vulnerabilities discovered by AI as a milestone: once AI can find and exploit bugs in software on the level of best cybersecurity researchers, there’s not much of the curve left until superintelligence capable of taking over and killing everyone. Perhaps a few months; perhaps a few years; but I did not expect, back then, for us to survive for long, once we’re at this point.
We’re now at this point. AI systems find hundreds of novel vulnerabilities much faster than humans.
It doesn’t make the situation any better that a significant and increasing portion of AI R&D is already done with AI, and even if the technical problem was not as hard as it is, there wouldn’t be much chance to get it right given the increasingly automated race between AI companies to get to superintelligence first.
The only piece of good news is unrelated to the technical problem.
If the governments decide to, they have the institutional capacity to make sure no one, anywhere, can create artificial superintelligence, until we know how to do that safely. The AI supply chain is fairly monopolized and has many chokepoints. If the US alone can’t do this, the US and China, coordinating to prevent everyone’s extinction, can.
Despite that, previously, I didn’t pay much attention to governments; I thought they could not be sufficiently sane to intervene in the omnicidal race to superintelligence. I no longer believe that. It is now possible to get some people in the governments to listen to scientists.
Many things make it much easier to get people to pay attention: the statement signed by hundreds of leading scientists that mitigating the risk from extinction from AI should be a global priority; the endorsements for “If Anyone Builds It, Everyone Dies” from important people; Geoffrey Hinton, who won the Nobel Prize for his foundational work on AI, leaving Google to speak out about these issues, saying there’s over 50% chance that everyone on Earth will die, and expressing regrets over his life’s work that he got Nobel prize for; actual explanations of the problem we’re facing, with evidence, unfortunately, all pointing in the same direction.
Result of that: now, Bill Foster, the only member of Congress with PhD in physics, is trying to reduce the threat of AI killing everyone; and dozens of congressional offices have talked about the issue.
That gives some hope.
I think all of us have somewhere between six months and three years left to convince everyone else.
***
When my mom called me earlier today, she wished me good health, maybe kids, and for AI not to win. The last one is tricky. Winning is what we train AIs to do. In a game against superintelligence, our only winning move is not to play.
I love humanity. It is much better than it was, and it can get so much better than it is now. I really like the growth of our species so far and I want it to continue much further. That would be awesome. Galaxies full of life, of trillions and trillions of fun projects and feelings and stories.
And I have to say that AI is wonderful. AlphaFold already contributes to the development of medicine; AI has positive impact on countless things.
But humanity needs to get its act together. Unless we halt the development of general AI systems until we know it is safe to proceed, our species will not last for much longer.
Every year until the heat death of our universe, we should celebrate at least 8 billion birthdays.
While I have deep uncertainty on timelines, takeoff speeds, and other factors. I believe that it is highly unlikely we end up with broad superhuman AI in the next 2 years. A few of the reasons I believe this:
Current LLM architectures still make incredibly basic mistakes across a wide variety of disciplines. I would expect to see AI that can perform across all major human tasks at an adequate level prior to seeing AI that is capable of sustained self improvement. Right now using Opus4.6/4.7 half the time is like magic, and half the time is head scratching.
I think you may be misunderstanding vulnerability/0 Day exploitation a little bit. I work in cybersecurity. The vulnerabilities that Mythos is discovering all could have been
easilydiscovered by human researchers. Vulnerabilities exist everywhere and time/effort is the primary bottleneck to find them. In a lot of ways vuln discoveries play very heavily to LLM strengths, namely broad knowledge with an expansive reference class applied to narrow verifiable problems and the ability to iterate many times to find the correct answer.Labs have significant compute limitations that make it difficult to continue scaling at the current pace. Particularly because increasing amounts of compute are needing to go to paying customers to sustain rates of growth and investment.
Internal bottlenecks at labs are likely to begin playing a greater role. LLM capabilities are becoming truly high-risk in some areas (Biological, Cyber, Chemical) and my best guess is that AI providers will have to increasingly invest in safeguards and mundane alignment to prevent a minor catastrophe that could bring down regulation.
With lower confidence I also am somewhat skeptical of the Transformer/LLM architecture scaling to a feature complete AGI. Lack of continuous learning, sample efficiency, and generalization may be insurmountable with this particular architecture.
Sorry, answering quickly with mostly cached thoughts without engaging deeply:
Current LLM architectures can probably do everything. There are ways of compiling code into transformer’s weights. they’re not good at everything, yet, and generally end up kinda spiky; but it’d be surprising if you couldn’t just scale them up and do a lot more RL and get something pretty general.
While the 0days discovered by Mythos are all of the kind that could be discovered by humans, humans have in fact looked at some of the code, a lot, manually and using instruments, and failed to find this stuff. I don’t dispute these are not on a level above best human cybersecurity researchers (and have a market on whether there will be any such vulnerabilities discovered by AI that couldn’t have been discovered by humans at all, in the next few years). Being just as good as best humans but much faster is sufficient to take over the world.
SSI has zero customers, they still get billions of dollars. Google has a lot of money and compute and use AI to develop better chips/compute infra. There are lots of efficiency gains that stack, and some of those are getting automated. I can imagine three main AI companies running out of money, but find that unlikely. The promise of higher intelligence is worth a lot: even if it’s expensive to get to it and to use it, it’s cheaper than paying humans to perform this same tasks, and due to that the demand is enormous.
I’d not want to bet on AI companies forever being unable to solve jailbreaks sir caring sufficiently about those to not release models at all/ever. In the limit, you do a huge ton of classifiers of texts and activations and report users who try to do bad things to authorities, monitor wetlabs/have honeypot wetlabs that LLMs recommend, make LLMs unaware of some knowledge, etc.; if this really prevents AI companies from releasing and earning tens+ of billions of dollars, they will throw billions of dollars at solving this and will solve this successfully. (E.g., imagine anyone who submits a working jailbreak gets $10k, up to a million submissions each of which gets patches together with similar things; do you think it even gets to a million submissions?)
”Alignment“ of current models to not teach people to kill a lot of people mostly has nothing to do with the difficulties we’d face at the superintelligent level.
What is a thing a transformer architecture cannot do in principle? Like, are you imagining that if we figure out how to make literal superintelligences, something would prevent us from compiling them into transformers? Given in-context learning and scratchpad maintenance are allowed etc.
Happy birthday!
Hope you have a good time in your remaining time.
There’s so much more good times to be had if we figure this out. But in case we don’t, let’s make nice the limited bits we do get!
Not to be pedantic, but every day you live always has a chance of being your last. There are very real—though minor—probabilitlies of things like getting in a car crash, getting randomly murdered, a fatal clot forming in your brain, etc etc. The chance of random death has been true (literally) since the dawn of life.
We often live in ignorance of this fact until it stares us in the face, but it’s always there. The only difference now is that the background probability of death is much higher for many of us (the privelaged ones). Our moral duty is still the same: work to end the deaths and suffering of others.
yep, i did not spend enough cycles on the version of the post with this start. Originally it had something like ”This is the first birthday I’m not sure isn’t my last.” but two of three people I asked for feedback told me it sounds kinda suicidal so I changed it at the end. “Might realistically” would’ve represented this better