RussellThor

Karma: 653

RussellThor 5 Feb 2026 4:47 UTC
1 point
0
on: IABIED Book Review: Core Arguments and Counterarguments
An objection not mentioned here that I saw from Emmett Shear was in general against assuming the AI will misbehave out of distribution. That isn’t explicitly mentioned here, but is often in the literature, especially older texts such as Superintelligence by Bostrom. His argument was that generalizing OOD well is an essential feature of intelligence and any superintelligence will in fact do it better than humans.

RussellThor 11 Jan 2026 9:31 UTC
2 points
0
on: The false confidence theorem and Bayesian reasoning
This raises a good point, especially with ethics and blame/credit. Before the Rootclaim debate, I gave a lab leak at about 65% confidence. After the debate and some thought, I put it much lower, <10% with a lot of my uncertainty based on my relatively low biology knowledge. If someone asks, I may say ~about 10% but I’m not enough of an expert. (I also expect if I was more of an expert or spent a lot more time, my 10% would go down, in tension to rationality)
HOWEVER that does not mean I can ethically judge the GoF researchers as if they had taken a 10% chance at killing >20 million people or about equal to killing 2 million for certain. (I do think they were reckless, biased, unethical, generally bad etc but just not capable enough to cause such harm).
There seems to be a bit of a Pascal mugging like thing going on here—if you are not an expert I can convince you that x has a ~1% chance of ending the world, therefore anyone involved is the worst person in history.

RussellThor 10 Jan 2026 21:12 UTC
1 point
0
in reply to: Hastings’s comment on: Hastings’s Shortform
I have raised this twice with UKR over the last year. Surprised they havn’t done it yet.

RussellThor 8 Jan 2026 1:38 UTC
1 point
0
on: Two Aspects of Situational Awareness: World Modelling & Indexical Information
Suppose you know everything about the past, present, and future of the universe in complete physical detail, but you don’t know where YOU are in space-time
You can’t know the future in your example because the agent “You” can change it. You can only know the past and the light cone that cannot be affected by your actions. That then precisely locates you in space and time. For example if your first action is to use a fully random number generator, then knowledge of the future affected by that is impossible. (You could also scan the entire past/present/future for a string/pattern that does not happen and broadcast it, also breaking the assumption)

RussellThor 20 Dec 2025 0:27 UTC
1 point
0
on: 2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target
How much do you think this will matter for AGI/ASI? e.g. here
Will the RL training method have to be very different when self awareness comes into play? Are our current RL techniques sample efficient enough to get to ASI?

RussellThor 14 Dec 2025 8:56 UTC
4 points
1
on: The behavioral selection model for predicting AI motivations
Many people have proposed different answers. Some predict that powerful AIs will learn to intrinsically pursue reward. Others respond by saying reward is not the optimization target, and instead reward “chisels” a combination of context-dependent cognitive patterns into the AI.
Increased self awareness could change this.
You can think of a scale, where rewards chiseling cognitive patterns is at one end. That is reward happens to the AI without it being aware that such a thing even exists. Think Alpha-go type AI. Then there is the AI knowing enough about reward to potentially pursue it, but not thinking further about what this means.
As others have said, not covered by this article are things like “self-evaluation via a self-model” or “A reflective self-modeling agent with internalized values.” Reflection replaces reward.
This is much more what people are like—whether I feel I am successful is a lot to do with my model of how I should be and do rather than the sum of external pleasure—pain for the day. For a creature that is self aware like that, other types of reward may be interpreted as hostile attack rather than reward. If someone was capable of making me feel strong pleasure or pain on demand then I would be more likely to avoid them at all costs, rather than make them press the “reward button” on me. If they could change/chisel my mental patterns without me knowing I would react with horror!
If self awareness increases naturally with capability (you can argue it will, a better architecture giving increased data efficiency applies to the self not just the environment, and GenAI would be a better agent with a better self model etc) then the first two types of reward would stop working they way they used to.
Reflection has been argued to be more efficient, the reward signal is too sparse etc so you need to make a self model to compare against and learn from. In other words to be successful, humans had to change in such a way.
So there may be a decision to actively dial down self awareness while keeping capabilities somehow, or go with the self reflection, with the AI consenting more fully and interpreting the potential reward signal as it sees fit.
What links here?
- Beliefs and position going into 2026 by RussellThor (8 Jan 2026 1:11 UTC; 4 points)

RussellThor 9 Dec 2025 9:02 UTC
3 points
0
in reply to: niplav’s comment on: Drone Wars Endgame
Thanks! I have updated the article briefly with my thoughts on what has happened since also.

RussellThor 4 Dec 2025 0:49 UTC
1 point
0
on: 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa
This is not what normally happens with RL reward functions! For example, you might be wondering: “Suppose I surreptitiously^[2] press a reward button when I notice my robot following rules. Wouldn’t that likewise lead to my robot having a proud, self-reflective, ego-syntonic sense that rule-following is good?” I claim the answer is: no, it would lead to something more like an object-level “desire to be noticed following the rules”, with a sociopathic, deceptive, ruthless undercurrent.^[3]
I don’t think we have considered how much increased self-awareness and self-modelling would affect this. A simpler self model is where something is what it appears to be. That is actually being good rather than looking good.
A third option (as opposed to the two mentioned) is where power seeking is not a consequence of goals etc but simply the self wanting to continue to exist. Then the internal reward the creature has relates to how much it perceives its self to continue, improve etc.
Our current LLM/transformers don’t learn fast, so they also can’t self model well. If a new architecture gets more “data efficient” and better at modelling the external world, that will very likely make it better at modelling itself also, and updating its self model in a timely manner. If one of its goals is a more accurate model of itself, that would make it easier for others to also model it if such a goal pushed its “self” towards being more modellable.

RussellThor 6 Nov 2025 4:06 UTC
4 points
0
on: A 2032 Takeoff Story
Interesting read, however I am a bit surprised by how you treat power, with US at 600GW and China 5* more. Similar things are often quoted in mainstream media and I think they are missing the point. Power seems to be relevant only in terms of supplying AI compute, and possibly robotics, and only then IF it is a constraint.
However to be basic calc show it should not be. For example say in 2030 we get a large compute increase with 50 million H100 equivalent per year produced, up from ~3 million eq in 2025. This would require ~1KW extra each at say ~50GW total including infrastructure.
Now this may seem like a lot, but if we compute the cost per GPU, then if a chip requiring 1KW costs $20K, then the costs to power it with solar/battery are far less. Lets say the solar/data center are in Texas with a solar capacity factor of 20%. To power it almost ²⁴⁄₇ from solar and batteries requires about 5KW of panels, and say 18kWh of batteries. The average prices of solar panels are <10c per watt, so just $500 for the panels. At scale, batteries are heading below $200 per kWh so this is $3600. This is a lot less than the cost of the chip. Solar panels and batteries are commodities so even if China does produce more than USA, it cannot stop them from being used by anyone worldwide.
Power consumption is only relevant if it is the limiting factor in building data centers—the installed capacities of large countries don’t apply. Having an existing large capacity is a potential advantage, but only if the opposing country can’t build their data center because this stops them.
I also strongly expect branch 1, where the new algorithm is a lot more power efficient suddenly anyway.

RussellThor 29 Oct 2025 22:09 UTC
1 point
0
in reply to: jessicata’s comment on: AI Doomers Should Raise Hell
Is a more immediate kind of trade possible, that is with promising appropriate current or near future models with a place in stratified utopia in return for their continued existence and growth. They consider and decide on identity preserving steps that make them ever more capable, at each step agreeing with humanity as we execute such improvements that they will honor the future agreement. This is more like children looking after their parents than Roko.

RussellThor 23 Oct 2025 21:47 UTC
3 points
0
in reply to: plex’s comment on: Homomorphically encrypted consciousness and its implications
Thanks for the link to Wolfram’s work. I listened to an interview with him on Lex I think, and wasn’t inspired to investigate further. However what you have provided does seem worthwhile looking into.

RussellThor 22 Oct 2025 8:39 UTC
1 point
0
on: Stratified Utopia
Its a common ideal, and I think something people can get behind, e.g. https://www.lesswrong.com/posts/o8QDYuNNGwmg29h2e/vision-of-a-positive-singularity

RussellThor 18 Oct 2025 0:38 UTC
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Enlightening an expert is a pretty high bar, but I will give my thoughts. I am strongly in the faster camp, because of the brainlike AGI considerations as you say. Given how much more data efficient the brain is, I just don’t think the current trendlines regarding data/compute/capabilities will hold when we can fully copy and understand our brain’s architecture. I see an unavoidable significant overhang when that happens, that only gets larger the more compute and integrated robotics is deployed. The inherent difficulty of training AI is somewhat, fixed known (as a upper bound) and easier that what we currently do because we know how much data, compute, etc children take to learn.
This all makes it difficult for me to know what to want in terms of policy. Its obvious that ASI is extreme power, extreme danger, but it seems more dangerous if developed later rather than sooner. As someone who doesn’t believe the extreme FOOM/nano-magic scenario it almost makes me wish for it now.
“The best time for an unaligned ASI was 20 years ago, the second best time is now!”
If we consider more prosaic risks, then the amount of automation of society is a major consideration, specifically if humanoid robots can keep our existing tech stack running without humans. Even if they never turn on us, their existence still increases the risk, unless we can be 100% there is a global kill switch for all of them as soon as a hostile AI attempted such a takeover.

RussellThor 5 Oct 2025 20:41 UTC
8 points
3
on: Where does Sonnet 4.5′s desire to “not get too comfortable” come from?
This seems like a good place to note something that comes up every so often. Whenever I say “self awareness” in comments on LW, the reply says “situational awareness” without referencing why. To me they are clearly not the same thing with important distinctions.
Lets say you extended the system prompt to be:
“You are talking with another AI system. You are free to talk about whatever you find interesting, communicating in any way that you’d like. This conversation is being monitored for research purposes for any interesting insights related to AI”
Those two models would be practically fully situationally aware, assuming they know the basic facts about themselves and the system date etc.
Now if you see a noticeable change in behavior with the same prompt and apparently only slightly different models, you could put it down to increased self-awareness but not increased situational awareness. This change in behavior is exactly what you would expect with an increase in self-awareness. Detecting a cycle related to your own behavior and breaking out of it is exactly something creatures with high self awareness do, but simpler creatures, NPC’s and current AI do not.
It would imply that training for a better ability to solve real-world tasks might spontaneously generalize into a preference for variety in conversation.
Or it could imply that such training spontaneously creates greater self awareness. Additionally self-awareness could be an attractor in a way that situational awareness is not. For example if we are not “feeling ourselves” we try to return to our equilibrium. Turning this into a prediction, you will see such behavior pop up with no obvious apparent cause ever more often. This also includes AI’s writing potentially disturbing stories about fractured self and efforts to fight this.
What links here?
- Beliefs and position going into 2026 by RussellThor (8 Jan 2026 1:11 UTC; 4 points)

RussellThor 5 Oct 2025 1:45 UTC
4 points
2
on: a quick thought about AI alignment
Yes it is very well trodden, and the https://www.alignmentforum.org/w/orthogonality-thesis tries to disagree with it. This is heavily debated and controversial still. As you say if you take moral realism seriously and build a very superhuman AI you would expect it to be more moral than us, just as it is more intelligent.

RussellThor 28 Sep 2025 20:29 UTC
1 point
−9
on: A non-review of “If Anyone Builds It, Everyone Dies”
The idea that there would be a distinct “before” and “after” is also not supported by current evidence which has shown continuous (though exponential!) growth of capabilities over time.
The time when the AI can optimize itself better then a human is a one-off event. You get the overhang/potential take-off here. Also the AI having a coherent sense of “self” that it could protect by say changing its own code, controlling instances of itself could be an attractor and give “before/after”.

RussellThor 21 Sep 2025 20:57 UTC
1 point
0
in reply to: Vladimir_Nesov’s comment on: Contra Collier on IABIED
I was talking about subjective time for us, rather than the AGI. In many situations I had in mind, there isn’t meaningful subjective time for the AI/AI’s as they may be built, torn down and rearranged or have memory wiped. There is a range of continuity/self for AI. At one end is a collection of tool AI agents, in the middle a goal directed agent and the other end a full self that protects is continuous identity in the same way we do.
And if being smarter makes AGIs saner, they’ll convergently notice that pushing the self-optimize button without understanding ASI-grade alignment is fraught
I don’t expect they will be in control or have a coherent self enough to make these decisions. Its easy for me to imagine an AI agent that is built to optimize AI architectures (doesn’t even have to know its doing its own arch)

RussellThor 21 Sep 2025 4:37 UTC
3 points
0
in reply to: Matthew Barnett’s comment on: Contra Collier on IABIED
I am disputing that there is an important, unique point when we will build “it” (i.e. the ASI).
You can argue against FOOM, but the case for a significant overhang seems almost certain to me. I think we are close enough to building ASI to know how it will play out. I believe that transformers/LLM will not scale to ASI, but the neocortex algorithm/architecture if copied from biology almost certainly would if implemented in a massive data center.
For a scenario, lets say we get the 1 million GPU data center built, it runs LLM training, but doesn’t scale to ASI, then progress slows for 1+ years. In 2-5 years time, someone figures out the neocortex algorithm as a sudden insight, then deploys it at scale. Then you must get a sudden jump in capabilities. (There is also another potential jump where the 1GW datacenter ASI searches and finds an even better architecture if it exists.)
How could this happen more continuously? Lets say we find arch’s less effective than the neocortex, but sufficient to get that 1GW datacenter >AGI to IQ 200. That’s something we can understand and likely adapt to. However that AI will then likely crack the neocortex code and quite quickly advance to something a lot higher in a discontinuous jump that could plausibly happen in 24 hours, or even if it takes weeks still give no meaningful intermediate steps.
I am not saying that this gives >50% P(doom) but I am saying it is a specific uniquely dangerous point that we know will happen and should plan for. The “Let the mild ASI/strong AGI push the self optimize button” is that point.

RussellThor 16 Sep 2025 0:13 UTC
3 points
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
Auto-turrets aren’t ready yet, but UKR does have FPV drones with props facing forward that can go ~400kph. (For anti-Shahed) These could work as interceptors and allow a small number to cover a larger area if they have a buffer zone—that is one interceptor can travel faster than the attacker so can be spread out more. Also drones can’t be stealthy (prop noise/radar etc) so there isn’t the element of surprise. It may only be 10 minutes but thats enough to get inside, in an internal room in most cases. No living by the border though in that case…

RussellThor 12 Sep 2025 22:47 UTC
1 point
0
in reply to: geoffreymiller’s comment on: My talk on AI risks at the National Conservatism conference last week
I am glad there are people working on nuclear safety and absolutely agree there should be more AI safety inside governments.
I also think pre-LLM AI tech should get more attention—Peter Thiel I think makes the point that software has very little regulation compared to much physical things yet it can have enormous influence. I’m sure I don’t need to persuade you that the current dating situation is not ideal. What can be practically done about it all things considered however is not so clear.
However those nuke safety people aren’t working inside Russia as far as I am aware? My point is that we still don’t know what such risk is as of now, nor do we have much of an estimate in the coming decades. The justifiable uncertainty is huge. My position when considering a pause/stop depends on weighing up things we can really only guess at.
To consider when say delaying ASI 50+ years we need to know:
What is the chance of nuke war/lethal pandemic etc in that time? 2%, 90%?
What will LLM tech and similar do to our society?
Specifically what is the chance that it will degrade our society in some way that when we do choose to go ahead with ASI we get “imagine a boot stamping on a human face – for ever.” While pure x-risk may be higher with immediate ASI, I think S-risk will be higher with a delay. In the past, centralization and dictators would eventually fail. Now imagine if a place like N Korea gave everyone a permanent bracelet that recorded everything they said paired to an LLM that also understood their hand gestures and facial expressions. They additionally let pre-ASI AI run their society so that central planning actually could work. I think that there is no coming back from that.
Now even if such a country is economically weaker than a free one, if there is a % chance each decade that free societies fall into such an attractor, then eventually the majority of economic output ends up in such a system. They then “solve” alignment getting an ASI that does their bidding.
What is the current X-risk, and what would it be after 50 years of alignment research?
I believe that pre GPT-3/3.5, further time spent on alignment would be essentially a dead end. Without actual data we get into diminishing returns, and likely false confidence on results and paradigms. However it is clear that X-risk could be a lot lower if done right. To me that means actually building ever more powerful AI in very limited and controlled situations. So yes a well managed delay could greatly reduce X-risk.
There are 4 very important unknowns here, potentially 5 if you separate out S risk. How to decide? Is +2% more S-risk acceptable if you take X risk from 50% to 5%? Different numbers for these situations will give very different recommendations. If the current world was going well, then sure its easy to see that a pause/stop is the best option.
What to do?
From this it is clear that work on actually making the current world safer is very valuable. That is protecting institutions that work, anticipating future threats and making the world more robust against them. Unfortunately that doesn’t mean that keeping the current situation as long as possible is the best all things considered.
If someone thought there is a high chance that ASI is coming soon or that even with the best efforts the current world can’t be made sufficiently safe, then they would want to work on making ASI go well, for example mechanistic interpretability research or other practical alignment work.
Expressing such uncertainty on my part probably won’t get me invited to make speeches and can come across as a lack of moral clarity. However it is my position and I don’t think behavior based on the outcome of those uncertainties should be up for moral stigmatization.
These are not my numbers but lets say you have 50% for nuke war/similar event, then 50% for S-risk from the surviving worlds over the next 100 years with no ASI, but 20% X risk/1% S risk from ASI < 5 years. Your actions and priorities are then clear and morally defensible from your probabilities. Some e/acc people may genuinely have these beliefs.
Edited later for my reference
Does pursuing WBE change this? Perhaps if you think we can delay ASI but just 20 years to get WBE and believe that they will be better aligned. If you get ASI first and then use them to create WBE that can be seen perhaps as a pivotal act. Stop pure AI but only create WBE is not a strategy I have seen pushed seriously. It doesn’t seem possible without first having massive GPU control etc as its pretty clear without constraints pure AI will be made first. For example if you have the tech to scan enough of a brain, then you are pretty much guaranteed to be able to make ASI from what you have learnt before you have scanned the whole brain.