Thanks for the detailed response! It clarifies some of my concerns and I think we have a lot of agreement overall. I’m also going to go in near reverse order,
To a first approximation, compute_cost = size*speed. If AGI requires brain size, then the first to cross the finish line will likely be operating not greatly faster than the minimum speed, which is real-time. But this does not imply the agents learn at only real time speed, as learning is parallelizable across many agent instances. Regardless, noneof these considerations depend on whether the AGI is trained in a closed simbox or an open sim with access to the internet.
To me the time/cost issue with the simboxes you proposed is in the data you need to train the AGIs from within the sim to prevent information leakage. Unlike with current training, we can’t just give it the whole internet, as that will contain loads of information about humans, how ML works, that it is in a sim etc which would be very dangerous. Instead, we would need to recapitulate the entire *data generating process* within the sim, which is what would be expensive. Naively, the only way to do this would be to actually simulate a bunch of agents interacting with the sim world for a long time, which would be at minimum simulated-years for human-level data efficiency and much much longer for current DL. It is possible, I guess, to amortise this work and create one ‘master-sim’ which so that we can try various AGI designs which all share the same dataset, and this would be good experimentally to isolate the impact of architecture/objective vs dataset, but under the reward-proxy learning approach, a large factor in the success in alignment depends on the dataset, which would be very expensive to recreate in sim without information transfer from our reality.
Training current ML models is very fast because they can use all the datasets already generated by human civilisation. To bootstrap to similar levels of intelligence in a sim without wholesale transfer of information from our reality, will require a concomitant amount of computational effort more like simulating our civilisation than simulating a single agent.
The ideal baseline cost of simboxing is only O(N+1) vs O(N) without—once good AGI designs are found, the simboxing approach requires only one additional unboxed training run (compared to never using simboxes). We can estimate this additional cost: it will be around or less than 1e25 ops (1e16 ops/s for brain-size model * 1e9s seconds for 30 years equivalent), or less than $10 million dollars (300 gpu years) using only todays gpus, ie nearly nothing
I don’t understand this. Presumably we will want to run a lot of training runs in the sim since we will probably need to iterate a considerable number of times to actually succeed in training a safe AGI. We will also want to test across a large range of datasets and initial conditions, which will necessitate the collection of a number of large and expensive sim-specific datasets here. It is probably also necessary to simulate reasonable sim populations as well, which will also increase the cost.
But let’s suppose there still is significant optimization slack, then in a sense you’ve almost answered your own question . .. we can easily incorporate new algorithmic advances into new simboxes or even upgrade agents mid-sim using magic potions or what not.
Perhaps I’m missing something here but I don’t understand how this is supposed to work. The whole point of the simbox is that there is no information leakage about our reality. Having AGI agents doing ML research in a reality which is close enough to our own that its insights transfer to our reality defeats the whole point of having a sim, which is preventing information leakage about our reality! On the other hand, if we invent some magical alternative to the intelligence explosion, then us the simulators won’t necessarily be able to invent the new ML techniques that are ‘invented’ in the sim.
Secondly, the algorithms of intelligence are much simpler than we expected, and brains already implement highly efficient or even near pareto-optimal approximations of the ideal universal learning algorithms.
To the extent either of those major points are true, rapid FOOM is much less likely; to the extent both are true (as they appear to be), then very rapid FOOM is very unlikely.
I agree that FOOM is very unlikely from the view of the current scaling laws, which imply a strongly sublinear returns on investment. The key unknown quantity at this point is the returns on ‘cognitive self improvement’ as opposed to just scaling in terms of parameters and data. We have never truly measured this as we haven’t yet developed appreciably self-modifying and self-improving ML systems. On the outside view, power-law diminishing returns are probably likely in this domain as well but we just don’t know.
Similarly, I agree that if contemporary ML is already on its asymptotically optimal scaling regime—i.e. if it is a fundamental constraint of the universe that intelligence can do no better than power law scaling (albeit with potentially much better coefficients than now), then FOOM is essentially impossible and I think that some form of humanity stands a pretty reasonable chance of survival. There is some evidence that ML is in the same power-law scaling regime as biological brains as well as a lot of algorithms from statistics, but I don’t think the evidence is conclusively against the possibility of a radically better paradigm which perhaps both us and evolution haven’t found. Potentially because it requires some precise combination of both highly parallel brain and a fast serial CPU-like processor which couldn’t be built by evolution with biological components. Personally, and it would be great if you convince me otherwise, that there are a lot of unknown unknowns in this space and the evidence from current ML and neuroscience isn’t that strong against there being unknown and better alternatives that could lead to FOOM. Ideally, we would understand the origins of scaling laws well enough we could figure out computational complexity bounds on the general capabilities of learning agents.
But even without rapid FOOM, we still can have disaster—for example consider the scenario of world domination by a clan of early uploads of some selfish/evil dictator or trillionaire. There’s still great value in solving alignment here, and (to my eyes at least) much less work focused on that area.
Yes of course, solving alignment in this regime is extremely valuable. With any luck, reality will be such that we will end up in this regime and I think alignment is actually solvable here while I’m very pessimistic in a full FOOM scenario. Indeed, I think we should spend a lot of effort in figuring out if FOOM is even possible and if it is trying to figure out how to stop the agents we build from FOOMing since this scenario is where a large amount of p(doom) is coming from.
Assume there was 1.) large algorithmic slack, and 2.) some other approach that was both viable and significantly different, then it would have to:
not use adequate testing of alignment (ie simboxes)
or not optimize for product of intelligence potential and measurable alignment/altruism
If there is enough algorithmic slack such that FOOM is likely, then I think that our capabilities to simulate such an event in simboxes will be highly limited and so we should focus much more on designing general safe objectives which, ideally, we can mathematically show can scale over huge capability gaps, if such safe objectives exist at all. We should also spend a lot of effort into figuring out how to constrain AGIs such that they don’t want to or can’t FOOM. I completely agree though that in general we should spend a lot of effort in building simboxes and measurably testing for alignment before deploying anything.
I think there are two fundamental problems with the extensive simboxing approach. The first is just that, given the likely competitive dynamics around near-term AGI (i.e. within the decade), these simboxes are going to be extremely expensive both in compute and time which means that anybody unilaterally simboxing will probably just result in someone else releasing an unaligned AGI with less testing.
If we think about the practicality of these simboxes, it seems that they would require (at minimum) the simulation of many hundreds or thousands of agents over relatively long real timelines. Moreover, due to the GPU constraints and Moore’s law arguments you bring up, we can only simulate each agent at close to ‘real time’. So years in the simbox must correspond to years in our reality, which is way too slow for an imminent singularity. This is especially an issue given that we must maintain no transfer of information (such as datasets) from our reality into the sim. This means at minimum years of sim-time to bootstrap intelligent agents (taking humans data-efficiency as a baseline). Also, each of these early AGIs will be likely be incredibly expensive in compute so that maintaining reasonable populations of them in simulation will be very expensive and probably infeasible initially. If we could get policy coordination on making sure all actors likely to develop AGI go through a thorough simboxing testing regimen, then that would be fantastic and would solve this problem.
Perhaps a more fundamental issue is that simboxing does not address the fundamental cause of p(doom) which is recursive self improvement of intelligence and the resulting rapid capability gains. The simbox can probably simulate capability gains reasonably well (i.e. gain ‘magical powers’ in a fantasy world) but I struggle to see how it could properly test gains in intelligence from self-improvement. Suppose the AI in the fantasy simbox brews a ‘potion’ that makes it 2x as smart. How do we simulate this? We could just increase the agent’s compute in line with the scaling laws but a.) early AGIs are almost certainly near the frontier of our compute capability anyway and b.) much of recursive self improvement is presumably down to algorithmic improvements which we almost necessarily cannot simulate (since if we knew better algorithms we would have included them in our AGIs in the simulation in the first place!)
This is so vital because the probable breakdown of proxies to human values under the massive distributional shift induced by recursive self improvement is the fundamental difficulty to alignment in the first place.
Perhaps this is unique to my model of AI risk, but almost all the probability of doom channels through p(FOOM) such that p(doom | no FOOM) is quite low in comparison. This is because if we have don’t have FOOM then there is not extremely large amounts of optimization power unleashed and the reward proxies for human values and flourishing don’t end up radically off-distribution and so probably don’t break down. There are definitely a lot of challenges left in this regime, but to me it looks solvable and I agree with you that in worlds without rapid FOOM, success will almost certainly look like considerable iteration on alignment with a bunch of agents undergoing some kind of automated simulated alignment testing in a wide range of scenarios plus using the generalisation capabilities of machine learning to learn reward proxies that actually generalise reasonably well within the distribution of capabilities actually obtained. The main risk, however, in my view, comes from the FOOM scenario.
Finally, I just wanted to say that I’m a big fan of your work and some of your posts have caused major updates to my alignment worldview—keep up the fantastic work!