Beliefs and state of mind into 2025

RussellThor10 Jan 2025 22:07 UTC

18 points

This post is to record the state of my thinking at the start of 2025. I plan to update these reflections in 6-12 months depending on how much changes in the field of AI.

1. Summary

It is best not to pause AI progress until at least one major AI lab achieves a system capable of providing approximately a 10x productivity boost for AI research, including performing almost all tasks of an AI researcher. Extending the time we remain in such a state is critical for ensuring positive outcomes.

If it was possible to stop AI progress sometime before that and focus just on mind uploading, that would be preferable, however I don’t think that is feasible in the current world. Alignment work before such a state suffers from diminishing returns from human intelligence and the lack of critical data on how things will actually play out.

2. When and how to pause?

The simple position is that superintelligent AI will be dangerous, therefore we should stop doing it, or at least pause until we figure out more. However I am genuinely unsure how long to pause, and when. I think the most important thing is having mildly superintelligent AI to help solve alignment, and staying at the stage as long as practical.

Just because something is dangerous, making it slower doesn’t make it safer. For example making childbirth take one week would obviously be far worse. The details determine what is the best course. The major reason against an immediate pause in AI is that it would likely apply to software, not hardware and increase the hardware overhang, without giving a counteracting increase in safety to make it worthwhile.

2.1 Diminishing returns

2.1.1 Background on when progress gives diminishing returns

For a lot of tech, progress is linear to exponential. We are used to steady progress, and the expectation of steady progress, and make plans accordingly. Often progress comes from technology and processes building on themselves. A common example of that is Moore’s law where the existing tools which use the current chips are essential to build the new better chips. However sometimes growth can be slower than linear, with diminishing returns, even with the best planning and resources.

The clearest example of this is probably pure mathematics. Unlike in technology where a new machine benefits all in field, a genius proving a hard theorem does not automatically help school children learn their times tables or beginner algebra at all. Instead it makes the gap from a novice to the leading edge of humanities knowledge greater than it was before. This means that it takes longer for a novice to reach the boundary than before, and excludes ever more people from contributing at all, as they simply cannot reach such a level even with unlimited time. In the limit, with fixed human intelligence and population rather then steady progress, we get diminishing returns to almost completely stalled progress. There will be so much accumulated knowledge required to reach the boundary of human knowledge that pretty much no-one would even be able to reach it, before thinking about extending it. Furthermore even though some knowledge will be stored in text, it is possible that no-one alive would actually understand it if the field went out of fashion. I believe for math we are already seeing something like that, and to me this is clearly happening in fundamental physics.

Physics is a bit different because experimental data can guide theories, however there needs to be enough data to make a difference. Say we have an important experiment with a true/false result. The actual result when known will not double the progress as both options will already have been considered beforehand and likely both already be in diminishing returns territory. For example the LHC discovered the Higgs, (totally expected, so didn’t change much) and the absence of low energy SUSY which was not so expected, but not enough data to help make major progress. You could argue there has been 40+ years of little progress. I would argue then there will likely be even less in the next 40 years with fixed human intelligence. Or in other words, quantum gravity etc will not be solved by unmodified humans, but instead is almost certain to be done by some kind of ASI.

2.1.2 Diminishing returns and alignment

I believe there is clear evidence of this happening with alignment research and progress. The 5-10 years before GPT3.5/4 gave us more than the 0-5 before. Major groups like MIRI essentially seem to have given up.

If alignment research is similar to other fields, then an unlimited period of time before GPT4, that is without actual data would not have lead to major further progress. It would in fact quite likely entrench existing ideas some of which will likely be wrong. From an outside/meta level for a new field without the needed experimental results you would not expect all theories to be correct.

Therefore a simple pause on AI capabilities to allow more time for alignment wouldn’t have helped.

2.2 Pause at each step?

One way is to pause with each major advance and allow alignment to advance to diminishing returns territory, with that hope that there will be enough progress to align a superintelligence at that stage and continue with capabilities if not. There are problems I see with this

2.2.1. Slowing down progress increases the overhang

Here I include the integration of robots in society, for example humanoid robots in all stages of the supply chain in what I call the overhang. Not just increases in computing hardware. With constant computing hardware but increasing robot integration takeover risk increases.

2.2.2 Society may be unstable at any level of pre-AGI technology from now until alignment is solved

There are known sustainability issues, however the unappreciated ones may be greater.

Centralization could be irreversible

Regimes like North Korea will be even worse and technically possible with AI. Imagine NK but with everyone with a constantly listening smartphone matched to a LLM. Any kind of opposition to authority would be simply impossible to coordinate. Once a democracy failed there would be no going back, and with time the number of NK like states would accumulate.

Society could adapt badly to pre-AGI

For example, also AI partners, disinformation polarization etc could lead to fragmentation of society. If anything, with time we seem to be having more issues as a society. If this is true, then our society would be better than a future one at deciding what the post-Singularity world should look like. The more fragmented and polarized society becomes the less clear it is what our CEV is and how to achieve it. We do not appear to be getting smarter, wiser, less warlike or well adjusted so we should make important decisions now rather than later.

2.3 Pause at the point where AI increases AI researcher productivity by about 10X and aim to maximize time there.

If alignment can’t be solved pre-AGI, and we can’t wait for WBE, then what is the optimum course? To me it is maximizing the time where AI is almost very dangerous – that is the time we can learn the most because we get useful results and the AI helps with alignment work itself.

2.3.1 Scheming?

Even if the AI is scheming, unless it is actively trying to takeover it would be hard for it to succeed in its plans. For example you can get the AI to design other AI’s with different architectures suited to interpretability and optimize them to similar capabilities as the original AI. Then proceed to use that AI for further work.

2.3.2 If we get safely to this point, does prior alignment work matter at all?

At the 10* stage, what you need is researchers that understand the issues, but are prepared to update rapidly on new results and use new tools. Existing models of what is dangerous will likely not be fully correct.

2.3.2 Avoiding racing is important—resist the urge to improve the AI

If the aim is to maximize the time where at least one AI lab is in this situation, then a race situation is the worst situation to be in. The lab or group of labs should have and believe they have a period of at least 1 year, preferable 2-3. Then they can resist the temptation to just have the AI continue to optimize itself to stay ahead. A major research lab/group that achieved such an AI would be compute constrained – more researchers would not help as they would not have access to such AI. Only a researcher with enough AI/compute to be 10* would be very useful.

2.3.3 Time required

Because of the speed-up enabled by AI, you will get to diminishing returns much faster. Just 1 year could well be enough and 10 years would be more than optimal to figure out how to align super AI, or at least have confidence to more to the next step in the unlikely even it is required (I expect a 10* AI would know how to create an aligned superintelligence or at least one that was aligned using the available computation with maximum efficiency.)

2.4 What is the ideal fantasy way to get to superintelligence?

If we accept that superintelligence is inevitable at some stage, what is the best or most natural path to get there if we were not constrained by reality? Its clearly not making an AI that we don’t fully understand. One way would be if everyone’s IQ increased by 1 point per year, from a fixed date. This would share the gains evenly among everyone alive. (Children born 20 years later start +20) However that would cause large disruption as people got unsatisfied with their careers.

Another way is if each generation was 20 IQ smarter than the last. That may not be that disruptive as parents routinely cope with smarter children. Finally you could extend the human lifespan to say 500 years and view the first 50 as like some kind of childhood, where IQ then steadily increases after that. Some sci-fi has people becoming uploads later in life.

In terms of what is possible for us, whole brain emulation or mind uploading seems the most physically possible. It seems desired and likely as part of a post Singularity society. To me it would be desirable to just go to WBE without superintelligence first, however that is less likely to be possible for us with the current tech and geopolitical environment.

The plan for TAI should first be the alignment of a mildly superintelligent system, then optimize to physical limits, ensure some geo-political stability, install defenses against anticipated attacks, then pursue WBE asap.

2.5 Past outcomes

In Superintelligence I think Bostrom says somewhere that if he knew the formula for intelligence, he wouldn’t disclose it because of alignment dangers. However I definitely would if I had lived in 2005 and knew such a formula. I think at that stage we would be have been constrained by computational power and there would be no dangerous overhang. In a world where we were compute constrained we would have more time to detect and adapt to misalignment, and scheming etc. Specifically by the formula for intelligence I mean the neural code or a system as efficient and adaptable as biology.

2.6 Prediction

I expect an AGI that can 10* AI researcher output by 2028-2032. I believe the current architecture can’t scale to that level (75%) but may help discover the new architectures by suggesting experiments and studying biology and the neural code. I believe there is a good chance a much better architecture is discovered by more directly studying biology.

What links here?

RussellThor10 Jan 2025 22:07 UTC

18 points

10 comments7 min readLW link

cousin_it 10 Jan 2025 22:44 UTC
7 points
3
I guess the opposite point of view is that aligning AIs to AI companies’ money interests is harmful to the rest of us, so it might actually be better if AI companies didn’t have much time to do it, and the AIs got to keep some leftover morality from human texts. And WBE would enable the powerful to do some pretty horrible things to the powerless, so without some kind of benevolent oversight a world with WBE might be scary. But I’m not sure about any of this, maybe your points are right and mine are wrong.
- Nathan Helm-Burger 10 Jan 2025 23:46 UTC
  4 points
  0
  Parent
  In one specific respect I’d like to challenge your point. I think fine-tuning models currently aligns them ‘well-enough’ to any target point of view. I think that the ethics shown by current LLMs are due to researchers actively putting them there. I’ve been doing red teaming exercises on LLMs for over a year now, and I find it quite easy to fine-tune them to be evil and murderous. Human texts help them understand morality, but don’t make them care enough about it for it to be sticky in the face of fine-tuning.
  - cousin_it 11 Jan 2025 0:02 UTC
    4 points
    0
    Parent
    Yeah, on further thought I think you’re right. This is pretty pessimistic then, AI companies will find it easy to align AIs to money interests, and the rest of us will be in a “natives vs the East India Company” situation. More time to spend on alignment then matters only if some companies actually try to align AIs to something good instead, and I’m not sure any companies will do that.
    - Noosphere89 11 Jan 2025 1:04 UTC
      4 points
      0
      Parent
      This is also my view of the situation, as well, and is a big portion of the reason why solving AI alignment, which reduces existential risk a lot, is non-trivially likely without further political reforms I don’t expect to lead to dystopian worlds (from my values).
    - Nathan Helm-Burger 11 Jan 2025 1:06 UTC
      2 points
      0
      Parent
      Yeah, any small group of humans seizing unprecedented control over the entire world seems like a bad gamble to take, even if they start off seeming like decent people.
      
      I’m currently hoping we can figure some kind of new governance solution for managing decentralized power while achieving adequate safety inspections.
      
      https://www.lesswrong.com/posts/FEcw6JQ8surwxvRfr/human-takeover-might-be-worse-than-ai-takeover?commentId=uSPR9svtuBaSCoJ5P
  - Noosphere89 10 Jan 2025 23:49 UTC
    4 points
    2
    Parent
    This is consistent with a model where AI alignment is heavily dependent on the data, and way less dependent on inductive biases/priors, so this is good news for alignment.
- RussellThor 11 Jan 2025 1:38 UTC
  3 points
  2
  Parent
  Perhaps, depends how it is. I think we could do worse than just have Anthropic have a 2 year lead etc. I don’t think they would need to prioritize profit as they would be so powerful anyway—the staff would be more interested in getting it right and wouldn’t have financial pressure. WBE is a bit difficult, there needs to be clear expectations, i.e. leave weaker people alone and make your own world
  https://www.lesswrong.com/posts/o8QDYuNNGwmg29h2e/vision-of-a-positive-singularity
  There is no reason why super AI would need to exploit normies. Whatever we decide, we need some kind of clear expectations and values regarding what WBE are before they become common, Are they benevolent super-elders, AI gods banished to “just” the rest of the galaxy, the natural life progression of first world humans now?
  - cousin_it 13 Jan 2025 8:49 UTC
    7 points
    0
    Parent
    I think the problem with WBE is that anyone who owns a computer and can decently hide it (or fly off in a spaceship with it) becomes able to own slaves, torture them and whatnot. So after that technology appears, we need some very strong oversight—it becomes almost mandatory to have a friendly AI watching over everything.
    - RussellThor 14 Jan 2025 1:10 UTC
      1 point
      0
      Parent
      I’m considering a world transitioning to being run by WBE rather than AI so I would prefer not to give everyone “slap drones” https://theculture.fandom.com/wiki/Slap-drone To start with the compute will mean few WBE, much less than humans and they will police each other. Later on, I am too much of a moral realist to imagine that there would be mass senseless torturing. For a start if you well protect other em’s so you can only simulate yourself, you wouldn’t do it. I expect any boring job can be made non-conscious so their just isn’t the incentive to do that. At the late stage singularity if you will let humanity go their own way, there is fundamentally a tradeoff between letting “people”(WBE etc) make their own decisions and allowing the possibility of them doing bad things. You also have to be strongly suffering averse vs util - there would surely be >>> more “heavens” vs “hells” if you just let advanced beings do their own thing.
RussellThor 3 Jul 2025 11:10 UTC
2 points
0
July 2025 update
My core view hasn’t changed much since this post. AI has continued to progress, but not out of line with what I would expect if LLM’s were going to plateau. I have found OpenAI O3 model very useful now at writing in seconds functions that could take me 10-60 minutes, in line with task completion 50% metrics.
I want to put some further ideas and predictions down. Firstly I may explore more a world constrained by TEPS/brain architecture as opposed to FLOPS
https://www.lesswrong.com/posts/yew6zFWAKG4AGs3Wk/?commentId=mWGxpDb5chtyTGx8k
During this time I spent more time learning the details of AI more, as a result was more convinced it wasn’t just FLOPS and SW cannot make up for HW topology.
I also just re-read https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target and mostly agree with the big picture. However I am not sure how relevant it will be—I expect a major way an ASI changes it values is by self-reflection and once such an AI gets this advance I don’t think we can predict how RL will go when it is trying to influence the process.
Ideas are that in that case if there is a limited “intelligence” budget then companies and countries may toss up between spending it on may human level AGI, some small number of larger, slower ASI. Will those ASI try to do things like solve aging or be entirely on getting ahead by designing better chip manufacturing? In this case existing chip manufacturing is very valuable, Taiwan especially strategic. That would certainly be a type of slow takeoff, and could also go backwards if international strife reduced worldwide chipmaking capacity.
Moral Realism and Orthogonality thesis
The strong OT has never seemed correct to me, with the point about the vast space of potential minds seeming clearly a wrong way of looking at the problem. I have also found the reasoning about ASI and human values a bit self contradictory. e.g. ASI is superintelligent but our values are still better? Which leads to Moral Realism.
MR could be more empirically as the extent to which different creatures values converge as they are raised to Superintelligence. Consider the following process—a biological creature becomes superintelligent somehow and reflects on its values. It realizes that if evolution went differently for it, then it may have different values. It would be interested in finding that out. e.g. say running simulated evolution of many different creatures, raising them to SI and seeing how their values differed. If all such creatures, simulated and original decided their values should be the same, then they would converge. The extent to which such values converge with ever greater self reflection and experimentation is a measure of MR. If there remain many different value systems, then that is less MR, if they all converge, then that is strong MR. As much as being a philosophical position MR is a prediction that all manner of intelligences values will converge as they become SI.
This doesn’t make SI safe, especially in a slow take-off as a weak SI could still self reflect and come up with values very undesirable to us.
The effect of Pre AGI on the world
One major reason I am not in favor of a pause right now is that I am not convinced it is safe. Its not like we are in a safe world, we could even have had >50% chance of nuclear war up to this date. That risk remains, and a further one is the affect of pre-AGI AI is having. This includes both polarization, and centralization/totalitarianism. 1984 can be done ever more effectively as AI advances. I am not sure if this is temporary or start of a permanent downward trend—https://www.globalexpressionreport.org/ If we do lose democracy etc, then I expect future govts to make very bad choices with AI, S-Risk even worse then X-Risk potentially.
Measuring and controlling the level of AI Self awareness
There doesn’t seem to be much research into that yet—for example what architecture is essential for self awareness (I don’t mean situational awareness). Can we tell based on the AI circuits and turn it up/down? It seems like we should be doing that so we know more about what the AI is thinking.
P(Doom) and rationality
I don’t think you can rationally defend a P(Doom) figure for AI. Mine is about 10-20% though that is more of a feeling and I believe that the chance of humans/descendants becoming interstellar isn’t changed that much by SI existing. AI just compresses the risk into a smaller timeframe.
I also get the feeling that we promote rationality to its “Level of Incompetence” that is we use it most when we can be least sure of ourselves. I got that strong impression from the Rootclaim COVID origins debate. Both sides were attempting to be rational and be Bayesian, but gave extremely far apart estimates. Rootclaim would be very good at applying such techniques at situations were there was history, such as murder claims, but they spent most of their attention applying it to COVID origins which were a one-off and did not do well.