I would absolutely like a post on the densing law from you, so consider me interested.
Noosphere89
I’d make an even stronger claim, in that the quoted claim is basically false for a lot of things parents would hope to persuade their children away from, and that the actual reason we have laws to prevent people from say conversion therapy isn’t because we are concerned conversion therapists or parents will persuade people to be no longer LGBTQ (for an example), but rather to prevent people from torturing LGBTQ children or making environments where they will reliably kill themselves in the attempt to convert them away from LGBTQ.
Similar things happen for anti-grooming laws or anti-abuse laws, where in practice it prevents people from attempting abusive tactics to persuade the child to have sex or do something else that a parent or other person wants them to do, and the reason for concern isn’t because the child will be persuaded, but rather to prevent people from damaging the child’s health in pursuit of attempting to persuade them.
One of my core takeaways on the heritability of a lot of traits is that this means parents have very little control over what the child will actually be like, and most claims that parents can reliably persuade them to do something against their genetics is bullshit.
On the original topic of AIs being able to persuade humans to easily change their preferences, I do agree that if you let technology advance far enough, humans would be able to be persuadable into arbitrary things, so some form of paternalism is likely to be necessary in the long run, but that thankfully is well past the critical period where we need to handle x-risks from AI, where x-risks either have already played out into full-blown existential catastrophe or we manage to get into existential security, so the sorts of superpersuasion/changing people’s preferences massively mostly don’t matter from my perspective.
I agree that Anthropic will attempt to add more RL tasks and potentially update Mythos’s weights even after the pre-train, and this could effect doubling times, but my point was that Mythos suggests that you can in fact just scale upwards in parameter and pre-training compute multiple times, and that the memes of compute/pre-training scaling being dead weren’t correct at all.
Indeed, 500x the compute of GPT-4 to train a GPT-6-level model by pre-training is probably possible by 2028 if you were willing to avoid using RL (though when it’s deployed it’s more likely to be in 2029 due to RL and inference soaking up compute), and if the scale-up of AI in 2026 is as potent as people believe (and indeed this is likely to happen once the nerfed models release sometime this year), AI companies can get enough revenue to comfortably build GPT-7, which would have 5000x the compute of GPT-4, which I suspect would be built in 2031.
But there is something I want to say here, and that is how the shift to ever larger post-training/RL could enable us to incrementally solve continual learning/learning on the job/continual weight updates like brains. One of the gifts of AI 2027 is that it points out in the January 2027 section that modulo some very important details like catastrophic forgetting of earlier tasks when later tasks are RLed in, fast enough weight updates for a long enough time via perpetually adding in more RL tasks is essentially equivalent to human-level continual learning, and while Agent-2 isn’t quite there yet, it is on the path, and it’s already able to continuously learn from the world all the time with it’s weights, it’s just slower than humans at this:
With Agent-1’s help, OpenBrain is now post-training Agent-2. More than ever, the focus is on high-quality data. Copious amounts of synthetic data are produced, evaluated, and filtered for quality before being fed to Agent-2.42 On top of this, they pay billions of dollars for human laborers to record themselves solving long-horizon tasks.43 On top of all that, they train Agent-2 almost continuously using reinforcement learning on an ever-expanding suite of diverse difficult tasks: lots of video games, lots of coding challenges, lots of research tasks. Agent-2, more so than previous models, is effectively “online learning,” in that it’s built to never really finish training. Every day, the weights get updated to the latest version, trained on more data generated by the previous version the previous day.
I expect something like this to ultimately be done once continual learning is targeted, albeit at a slower pace than what AI 2027 describes.
I don’t think that the recent overall evidence has clearly been positive about misalignment risk. For instance, timelines are looking quite short now, which IMO is the main negative update to counteract the positive one I discuss in this comment.
I’d soften this update, but this is reasonable.
The evidence does point towards 2-5 year timelines being reasonable to think about, but I wouldn’t yet argue that 2-5 year timelines are a median case. The bigger takeaway from the AI boom is that we have very good reason to believe that the median is in this century, and probably in the earlier half of the century when talking about AI research, and almost all of the tail scenarios where AGI is so difficult it takes hundreds of years to develop are no longer consistent with the evidence we have.
This is because as it turned out, compute scaling could largely substitute for finding human brain special sauce, because more compute scaling helps you run bigger and better experiments to find algorithmic progress (indeed Gundlach found out that basically all of algorithmic progress is basically downstream of certain algorithms being able to be more efficient as compute scales up).
Thus, if I somehow condition on the current paradigm not working out like the AI companies think, my median would change to the 2043 median of the CCF model described by Scott Alexander here.
An issue here re the CEV of powerful people like Vladimir Putin not being that bad compared to nothingness is that currently, even dictators require most people in at least somewhat functional states in order to have power, and more generally the generosity of capitalism is at least in part based on humans being useful when fed and educated.
But in a world where AI automates away human labor, the incentives for powerful people (absent even mild values of compassion/generosity) go towards just not giving humans anything they need to survive, because they are useless compared to more efficient AI systems and instead have loyal AI servants do everything for them.
It’s basically the same problem as AIs that don’t value you having instrumental incentives to kill you and everyone else to take their resources.
The best piece on this subject is Defining The Intelligence Curse, with a link here (though Jan Bentley makes the point that since AIs can keep improving, you don’t actually want to be a rentier, and you instead just keep charging forward, leading to something like the ascended economy scenario described by Scott Alexander here.)
Now to be clear, I think Vladimir Putin wouldn’t straight up kill/not give ethnic Russians what they need to survive, because for ethnic Russians, I’d expect he slightly cares about them, and slightly caring about people is enough to make people fantastically wealthy in the AI era, but 1) this is not something I would say for all world leaders and 2) the fact that AI empowered humans or AIs themselves only need to care about humans a little to prevent this scenario also makes AI alignment less important from a survival perspective, and means even methods that fail to precisely shape AI goals may be effective enough from a survival perspective, and also at least partially defuses the counting argument for misalignment being deadly, and in practice makes alignment relatively easy even for incompetent humans who don’t know what they are doing.
This is why I think eval awareness tests should be complemented by tests on how many false positives AI gets, to test whether AIs can recognize when they aren’t in an eval, rather than when they are in an eval.
Shout-out to John Wentworth who predicted something close to this failure mode in the post The Case Against AI Control Research here.
Re: “FWIW, I don’t think Mythos is a qualitative step change as much as a quantitative one”
I think this is half-true, half-false. IMO the actually qualitative step change was finding a way to turn vulnerabilities into exploits, which neither Opus or Sonnet did, combined with Mythos doing the vulnerability and exploit analysis autonomously without knowing in advance about the vulnerabilities, and only very basic scaffolding was used. But yes, there were definitely quantitative changes as well.
On this:
and the quantitative gap is going to close in a few months (compared to ambient (open-weights) capabilities).
This is cruxy, as my guess is that open-source would likely take at least a year to replicate the capabilities, and it’s very plausible that it instead takes 2 years. One of the reasons for this is that Mythos’s capabilities were probably enabled by just going back to pre-training, and compute will probably be given far more to frontier labs than to open-source, and a lot of open-source fundamentally tends to be compute-poor relative to frontier labs.
Re: this take you mentioned, I think my divergence from Jan Kuivelt is that I consider the methods used by Aisle to be sufficiently damning enough that I basically anti-updated from their claim, and in particular think that the fact that they had to disclose that their false-positive rates were at least much higher than Mythos likely has, combined with giving the vulnerability to the model and then asking them to find the vulnerability has basically trivialized the problem to the point of it being a useless evaluation, and the core thing Mythos did which they didn’t test was whether it could find the vulnerability in the code without straight-up giving the individual pieces of the vulnerability to the model.
Overall, this has increased my confidence that open-source cannot reasonably replicate what Mythos did anytime soon.
I disagree that it’s one-time, and think parameter scaling has a long way to go still.
We will eventually hit a limit, but that limit is measured in ASML machines, and nothing else matters nearly as much as ASML machines, which Dwarkesh talked here in the podcast.
In essence, Mythos is a return to form, where parameter scaling/compute scaling matters as much as data scaling, if not more.
An apple picking model for AI R&D
Most people continue to not want to die to misaligned AI. Thankfully, the majority of people working in AI (both in developers and in policy) are not psychopaths who care not at all for other humans, nor hardcore successionists who want to replace humans with (unaligned) AIs. Even if people might be incentivized to take on levels of risks that would be unacceptable to others, I suspect no major actor would knowingly attempt to launch a misaligned superintelligent AI out of spite or malice, and most would act to oppose this.
I mostly agree with this, with the big caveat that Elon Musk is far closer to unconditional successionism, and while he does still believe that humans should make it into his glorious future, I do think he is much closer to a hardcore successionist than people realize, and in particular XAI/Elon Musk if it was a company having access to misaligned superintelligent AI, there’s an uncomfortably high chance that they would release it. I agree everyone else would react and oppose this, but in such worlds AI x-risk is quite high, since slowing down/pausing is absolutely infeasible while we deal with XAI’s superintelligence.
I got this impression from this section of the Elon Musk podcast by Dwarkesh Patel.
The US public continues to be incredibly skeptical of AI and big tech. Measures such as the (controversial in tech circles) SB 1047 were broadly supported by the public. To be clear, the US public is not skeptical of AI because of existential or catastrophic risk reasons, but instead mundane reasons like power usage and worker displacement. Nonetheless, there seems to be a substantial desire from voters from both parties to slow down the rate of dangerous AI development, and it remains likely that people will be supportive of future policy actions in this area.
I’ll flag here that I’m worried about 3 developments.
1 is that political polarization starts to heat up towards the 2028 primaries and 2032 primaries, with at least 1 party being pro-AI.
The second one is I’m worried about incidents like Sam Altman being attacked with Molotov Cocktails by people who are extremely loosely associated with AI safety/existential risk, or even just not associated at all reducing public support for the AI safety cause.
The third issue is that even if neither happens, the things average people want fundamentally at best are likely not to lead to existential risk reduction, and at worst increases existential risk by making AI x-risk reducing policy harder in the future. This is admittedly more diffuse in my evidence sources, but the best one here is in Anton Leicht’s post about preemption deals worth making (Go to the sections titled Two Levels of AI debate and A tale of two PACs).
I don’t think we can reduce the risk of the first 2, but as Anton Leicht says, we can reduce the risk of the 3rd by making useful preemption deals, which Anton Leicht talks about in the sections at and below The Path Forward.
One big difference from quite a few other people is that I don’t think AI safety actually benefits from vague anti-AI populism becoming dominant compared to every other faction.
I don’t necessarily agree with the idea that ASI will create a specifically global dictatorship, but yeah ASI, or even AI that automates away human labor likely does gradually or suddenly destroy liberal democracy, and empowers truly one-man dictatorships in a way that no other technology does.
One of the things I have come to accept/internalize is that democracy as a form of government probably doesn’t survive the 21st century, and I’m instead thinking about how to have prosperity in a world where democracy is gone.
The key point of the post is that the legitimacy of the population becomes less necessary than you think, because goals will shift towards less ideological goals and more resource goals, and this is driven by people becoming less necessary to the economy and politics than raw materials and robots.
You don’t need to convince a population of the legitimacy of your government, so long as you have robots that can secure the mines/resources necessary to build more robots and goods, and the decoupling of legitimacy/popularity from military/economic effectiveness is the key transition that AI and robotics allows.
I disagree with this conclusion, actually, because I didn’t say that AI developers or AIs themselves would attempt to exterminate humanity, I only said that my analysis was compatible with that outcome, and so was more general than you thought.
In order to reach this conclusion, you also need opinions on how likely this is to happen.
I agree that strictly speaking, they don’t need to keep them alive anymore, and to be clear, this analysis holds almost as well if you replaced people with AI, with the exception of the points on violence, so most of the analysis doesn’t depend on people being around to live in it or being commanded.
A take on values of the future, assuming AIs automate away politics and economics from humans.
There has been exploration on this topic before, like Jim Buhler’s What Values Will Control The Future Sequence and the appendix of relevant work here, as well as books like Foragers, Farmers and Fossil Fuels, which full disclosure, influenced a lot of my views on how values evolve, and gives a much more plausible picture than pictures that view value evolution as converging to CEV/moral truth, or that emphasize arbitrary societal factors for value evolution rather than energy considerations.
Indeed, I think it’s so plausible that we can actually non-trivially constrain post-AGI society values even without much empirical evidence.
However, given that AI that automates away humans is likely coming within at most the next 20-30 years, it’s worth thinking about what values will be dominant in the future least a bit.
While we still mostly don’t have very good predictions on what the post-AGI era will look like, we have uncovered some answers, and also have honed in on some important hinge questions, such that we aren’t completely blind to what values will be dominant in the future.
One of the central questions for a lot of value evolution boils down to “does acausal trade actually become practical for AIs to do in such a way that the constraints of previous governments requiring them to hold only territory that they can send armies to faster than rebellions can overthrow them?”
If the answer is yes, then AI values become more arbitrary and value lock-in, could in theory affect the entire accessible universe.
If the answer is no, then it’s a lot easier to constrain what the AI values, and value lock-in doesn’t matter, and alignment also matters less as a problem.
One particular example of this is that we can be pretty confident that people will probably be fine with ludicrously large amounts of inequality in both the political and economic dimensions, compared to any other societal type we had in history, and this includes even the farming era of human history, and the reason for this is that with advanced AI, the mechanisms that keep wealth and income inequalities in check will weaken to the point of no longer existing, and the ability for anyone else like developing countries to catch-up will end because their labor stops mattering, and natural resources can and are already owned by other rich actors in the world like corporations, which Phil Trammell talks about a lot more here.
The economic inequality could alone let us evolve into us valuing political inequality, via the wealthy buying up land to give themselves powers reserved to states, and them being able to defend their riches via robots, but one other issue is that while it probably isn’t a problem in the short to medium run, and is probably overrated as a problem from an alignment perspective, AIs that can genuinely persuade massive amounts of people to do stuff IRL/be superpersuasive is probably going to come in the longer-term, via 2 effects:
More citizens of states will be uploads by default, and uploaded brains are probably easier to hijack/jailbreak than current biological brains because you can reset them arbitrarily to a known and potentially even maximally vulnerable state, which isn’t possible to do for a biological brain so far (indeed a lot of jailbreaks/adversarial examples like KataGo adversarial attacks rely on the fact that it’s super easy to trick AIs/reset them continuously).
It’s probably going to be easier to modify citizens using all sorts of tools like genetics, nanotech, uploading and more, and this means rulers can erode the values of the population to the values their rulers want.
Also, AI will break the pattern of no one person ruling alone, at least assuming alignment is solved, because you can automate the police and militaries that would usually check your power away.
The level of inequality in economics and politics that many people will probably accept is closer to the inequalities between superheroes in modern comics vs the average citizen or mythic/non-Abrahamic religious gods ruling over a normal citizen class than basically any other society we’ve had in history, and the old deal described below will come back, but far, far more intensely and closer to the limiting process:
Especially revered was the “Old Deal”, Morris’ term for the generalised social contract between classes in agrarian societies: that some have the duty to be commanders (or “shepherds of the people”, in the preferred phrasing of many a king), others to obey those commands, and if everyone follows this script then things work fine.
Gender/sex inequality is an area where I expect the exact opposite trend to happen, and will continue the industrialist era trends, mostly because it will become more arbitrary and divorced from economic usefulness (indeed, in a fully automated away AI economy, gender roles do not matter anymore, and we don’t even need to reach the limiting process to have big impacts)
Attitudes to violence might be polarized, because on the large scale wars are inefficient and will get more inefficient relative to other outcomes like peaceful trade or defined borders which neither side will trespass, but on smaller scales war/murder/violence in general will have lower costs than ever before in all of history because of atomic precision manufacturing + backups making the average citizen way, way harder to kill (because now you have to destroy all backups rather than just ending their lives) compared to the benefits, which means the murder rate and the assault/serious violence/rape rate could end up diverging a lot.
That said, this isn’t as trivial to determine without empirical evidence, and thus I’m way less confident in this prediction than basically all of my other predictions to date.
This is pretty straightforwardly not true, there are plenty of academics (for example) who are as smart as rationalists but don’t do very broad instrumental reasoning.
Fair point, I was generalizing too much here.
I agree with the literal claim that plenty of people don’t fantasize about becoming all-powerful dictators, but I’d say the percentage of people who don’t fantasize (including in their heads and not speaking about it) becoming dictators or don’t believe that an all-powerful dictator is necessary to solve problems/have a good future is much closer to 25-30% than 90% or more here, and this is more of an upper bound than a lower bound.
The reasons for why this is the case partially delve into politics that would cause way more heat than light if I discussed it on here, but one of the reasons for this is that for a lot of citizens, they don’t want to get involved in politics and want someone else to solve their problems for them, and one of the unique traits of a lot of non-dictatorial systems of government is that the average person has to be more involved with politics, and lots of people hate doing this.
An all-powerful dictator where average citizens make none of the decisions in a new world order doesn’t require them to pay attention to their government/politicians, and a lot of people genuinely want the ability to not care about politics at all.
I basically agree with what you notice, and think that this is what you’d expect if rationalists were mostly normal relative to other people in their goals, which are mostly selfish and dicatorial, but are more intelligent and can think farther ahead about what instrumental goals that they imply for their terminal goals.
Or put another way, the thing that rationalists are doing here are things lots of other people would likely do if they were more intelligent, and the truth of the matter is that most people just like all-powerful dictatorships, almost no matter their ideology.
I agree that current coding agents aren’t good enough and tend to focus on adding more code than it’s worth it for a lot of current projects, and the old wisdom that programs are written for humans to read is still correct, mostly because coding agents are complementary to humans, and you can’t fully automate SWE yet.
But if future coding agents fully automate SWEs away, which could happen in the next 2-4 years, then vibecoding will probably be superior to human coding precisely because they are willing to make code longer and more complex.
One big part of the reason is that to a large extent, users hate having to learn the rules of a system and expect the code to work all the time, and their use-cases for programs are very, very general and this combined with incentives to fully automate work away means that in a compute-limited world, it’s inevitable that there will be many lines of code and lots of complexity because they have to deal with the complexity of reality, and the approaches that attempt to simplify it ala Solomonoff Induction rely way too much on brute-force simulation, which won’t happen in the next 50-100 years even if AI fully automates the economy and politics (80% chance).
I like this portion of a comment by JDP describing the situation:
Here’s the thing about something like Microsoft Office. Alan Kay will always complain that he had word processing and this and that and the other thing in some 50,000 or 100,000 lines of code — orders of magnitude less code. And here’s the thing: no, he didn’t. I’m quite certain that if you look into the details, what Alan Kay wrote was a system. The way it got its compactness was by asking the user to do certain things — you will format your document like this, when you want to do this kind of thing you will do this, you may only use this feature in these circumstances. What Alan Kay’s software expected from the user was that they would be willing to learn and master a system and derive a principled understanding of when they are and are not allowed to do things based on the rules of the system. Those rules are what allow the system to be so compact.
You can see this in TeX, for example. The original TeX typesetting system can do a great deal of what Microsoft Word can do. It’s somewhere between 15,000 and 150,000 lines of code — don’t quote me on that, but orders of magnitude less than Microsoft Word. And it can do all this stuff: professional quality typesetting, documents ready to be published as a math textbook or professional academic book, arguably better than anything else of its kind at the time. And the way TeX achieves this quality is by being a system. TeX has rules. Fussy rules. TeX demands that you, the user, learn how to format your document, how to make your document conform to what TeX needs as a system.
Here’s the thing: users hate that. Despise it. Users hate systems. The last thing users want is to learn the rules of some system and make their work conform to it.
The reason why Microsoft Word is so many lines of code and so much work is not malpractice — it would only be malpractice if your goal was to make a system. Alan Kay is right that if your goal is to make a system and you wind up with Microsoft Word, you are a terrible software engineer. But he’s simply mistaken about what the purpose of something like Microsoft Word is. The purpose is to be a virtual reality — a simulacrum of an 80s desk job. The purpose is to not learn a system. Microsoft Word tries to be as flexible as possible. You can put thoughts wherever you want, use any kind of formatting, do any kind of whatever, at any point in the program. It goes out of its way to avoid modes. If you want to insert a spreadsheet into a Word document anywhere, Microsoft Word says “yeah, just do it.”
It’s not a system. It’s a simulacrum of an 80s desk job, and because of that the code bloat is immense, because what it actually has to do is try to capture all the possible behaviors in every context that you could theoretically do with a piece of paper. Microsoft Word and PDF formats are extremely bloated, incomprehensible, and basically insane. The open Microsoft Word document specification is basically just a dump of the internal structures the Microsoft Word software uses to represent a document, which are of course insane — because Microsoft Word is not a system. The implied data structure is schizophrenic: it’s a mishmash of wrapped pieces of media inside wrapped pieces of media, with properties, and they’re recursive, and they can contain other ones. This is not a system.
For that reason, you wind up with 400 million lines of code. And what you’ll notice about 400 million lines of code is — hey, that’s about the size of the smallest GPT models. You know, 400 million parameters. If you were maximally efficient with your representation, if you could specify it in terms of the behavior of all the rest of the program and compress a line of code down on average to about one floating point number, you wind up with about the size of a small GPT-2 type network. I don’t think that’s an accident. I think these things wind up the size that they are for very similar reasons, because they have to capture this endless library of possible behaviors that are unbounded in complexity and legion in number.
This is close to correct, and is the reason why the control agenda is focused around interventions before you catch the AI, because after you catch the AI, the situation becomes easier in hard-to-predict ways.
1 caveat to this is that the AI likely has more tries than just 1 try, but it’s not unlimited, and is plausibly on the order of 10-1000 (though we probably don’t need this many real tries because of proliferation).
But yes, especially in the regime where we need to automate AI safety research, we probably get multiple tries if we can play our cards well, and the AI doesn’t have nearly as many iteration attempts to take over as is often assumed.