AI Safety person currently working on multi-agent coordination problems.
Jonas Hallgren
(The following is about a specific sub-point on the following part:)
If this is how they’re seeing things, I guess I feel like I want to say another oops/sorry/thanks to the gradualists. …And then double-click on why they think we have a snowball’s chance in hell of getting this without a huge amount of restriction on the various frontier labs and way more competence/paranoia than we currently seem to have. My guess is that this, too, will boil down to worldview differences about competence or something. Still. Oops?
I think the point about the corrigibility basin being larger than thought is the thing that makes me more optimistic about alignment (only a 10-30% risk of dying!) and I thought you pointed that out quite well here. I personally don’t think this is because of the competence of the labs but rather the natural properties of agentic systems (as I’m on your side when it comes to the competency of the labs). The following is some thinking of why and me trying to describe it in a way as well as me sharing some uncertainties about it.
I want to ask you why you think that the mathematical traditions that you’re basing your work on as of the posts from a year ago (decision theory, AIXI) are representative of future agents? Why are we not trying the theories out on existing systems that get built into agents (biology for example)? Why should we condition more on decision theory than distributed systems theory?
The answer (imo) is to some extent around the VNM axioms and reflexive rationality and that biology is to ephemeral to build a basis on, yet it still seems like we’re skipping out on useful information?
I think that there are places where biology might help you re-frame some of the thinking we do about how agents form.
More specifically, I want to point out OOD updating as something that biology makes claims about that are different from the traditional agent foundations model. Essentially, the biological frame implies something that is closer to a distributed system because it can cost a lot of energy to have a fully coordinated system due to costs of transfer learning that aren’t worth it. (here’s for example, a model of the costs of changing your mind: https://arxiv.org/pdf/2509.17957).
In that type of model, becoming a VNM agent is rather something that has an energy cost associated with it and it isn’t clear it is worth it when you incorporate the amount of dynamic memory and similar that you would require to set this up. So it would seem to me that biology and agent foundations arrive at different models about the arising of VNM-agents and I’m feeling quite confused about it.
I also don’t think I’m smart enough to figure out how to describe this from a fundamental decision theory way because it’s a bit too difficult to me and so I was thinking that you might have an idea why taking biology more seriously doesn’t make sense from a more foundational decision theory basis?
More specifically, does the argument about corrigibility being easier given non VNM-agents make sense?
Does the argument around VNM being more of a convergence property make sense?
And finally, I like the way you distilled the disagreement so thanks for that!
I’m wondering whether the spiritual attractor that we see in claude is somewhat because of the detail of instructions that exist within meditation to describe somatic and ontological states of being?
The language itself is a lot more embodied and is a lot closer to actual sensory experience compared to western philosophy and so when constructing a way to view the world, the most prevalent descriptions might make the most amount of sense to go down?
I’m noticing more and more how buddhist words are so extremely specific. For example of dukkha (unsatisfactoriness) is not just suffering it is unsatisfactoriness and ephemeral at the same time, it is pointing at a very specific view (prior model applied to sense data), a lot more specific than is usual within more western style of thinking?
Yeah, it is a different purpose and vibe compared to people going out and doing motivational interviewing in physical locations.
I guess there’s a question here that is more relevant to people like some sort of empowerment framing or similar. “You might be worried about being disempowered by AI and you should be”, we’re here to help you answer your questions about it. Still serious but maybe more “welcoming” in vibes?
Warning that this comment is probably not very actionable but I thought I would share the vibe I got from the website. (as feedback is sometimes sparse)
Time will tell but part of me gets the vibe that you want to convince me of something when I go on the site (which you do) and as a consequence the vibe of the entire thing makes my guardrails already be up.
There’s this entire thing about motivational interviewing when trying to change someone’s mind that I¨’m reminded of. Basically, asking someone about their existing beliefs instead and then listening and questioning them later on when you’ve established sameness. The framing of the website is a bit like “I will convince you” rather than “I care about your opinion, please share” and so I’m wondering whether the underlying vibe could be better?
Hot take and might be wrong, I just wanted to mention it and best of luck!
Hot take:
I’ve been experiencing more “frame lock-in” with more sophisticated models. I recently experienced it with Sonnet 4.5 so I want to share a prediction that models will grow more “intelligent” (capable of reasoning within frames) whilst they have a harder time with changing frames. There’s research on how more intelligent people become better at motivated reasoning and interpreting things within existent frames and it seems like LLMs might inhabit similar biases.
I’m a big heuristics bridging fan as I think that it is to some extent a way to describe a very compressed action-policy based on an existing reward function that has been tested in the past.
So we can think about what you’re saying here as a way to learn values to some extent or another. By bridging local heuristics we can find better meta heuristics and also look at what times these heuristics would be optimal. This is why I really like Meaning Alignment Institute’s work on this because they have a way of doing this at scale: https://arxiv.org/pdf/2404.10636
I also think that a part of the “third wave” of AI Safety which is more focused on sociotechnical stuff kind of gets around the totalitarian and control heuristics as it’s saying it can be solved in a pro-social way? I really enjoyed this post, thanks for writing it!
System Level Safety Evaluations
Firstly, I find it really funny to ask for more specification through an example about something being underspecified and maybe that was the point! :D
If it was not a gag then here’s an example based on my interpretation of what #2 is (and I’m happy to be corrected): Imagine that you know that you need to get something done, say you have a deadline on friday and you need to write an essay on a topic like the economics of AI. Yet you don’t know where to start, who’s the audience, what type of frame should you take, what example do you start with?
The uncertainty of the task makes you want to avoid it since you need to pin it down first, it is an ambigious task.
I think thinking as a self-reflective process can be quite limited. It is at a certain level of coarse graining that is higher (at least for me) than doing something like feeling or pre-cognitive intuitions and tendencies.
So, I’ll say the boring thing which is basically meditation could be that cogtech as it allows you to increase the precision of your self-reflective microscope and allows you to see other things than the higher coarse graining of self-reflective thought allows you to see. Now, I’m sure that one still falls for a bunch of failure modes there as well since it can be very hard to see what is wrong with a system from within the system itself. It’s just that the mistakes become less coarse grained and that they come from another perspective.
In my own experience there are different states of being, one is from the thinking perspective, another is from a perspective of non-thinking awareness. The thinking perspective thinks it’s quite smart and takes things very seriously and the aware perspective sees this and thinks it’s quite endearing and the thinking part then takes that in and reflects on that it’s ironically ignorant. The thinking part tracks externalities and through the aware part is able to drop it because it finds itself ignorant? I used to only have the thinking part and that created lots of loops and cognitive strain and suffering because I got stuck in certain beliefs?
I think this deep belief of knowing that I’m very cognitively limited in terms of my perspective and frame allows me to hold beliefs about the world and my self a lot more loosely than I was able to hold them before? Life is a lot more vibrant and relaxing as a consequence as it is a lot easier to be wrong and it is actually a delight to be proven wrong. I would say this in the past but I wouldn’t emotionally feel it and as I heard someone say “Meditation is the practice of taking what you think into what you feel”.
A Lens on the Sharp Left Turn: Optimization Slack
I wanted to ask if you could record it or at least post the transcript after it’s done? It would be nice to have. Also, this was cool as I got to understand the ideas more deeply and from a different perspective than Sahil’s, I thought it was quite useful especially in how it relates to agency.
Prediction & Warning:
There are lots of people online who have started to pick up the word “clanker” in order to protest against AI systems. This word and sentiment is on the rise and I think that this will be a future schism in the more general anti-AI movement. The warning part here is that I think that the Pause movement and similar can likely get caught up in a general anti AI system speciesism.
Given that we’re starting to see more and more agentic AI systems with more continous memory as well as more sophisticated self modelling, the basic foundations for a lot of the existing physicalist theories of consciousness are starting to be fulfilled. Within 3-5 years I find it quite likely that AIs will at least have some sort of basic sentience that we can almost basically prove (given IIT or GNW or another physicalist theory).
This could potentially be one of the largest suffering risks that we’ve seen that we’re potentially inducing on the world. When you’re using a word like “clanker”, you’re essentially demonizing that sort of a system. Right now it’s generally fine as it’s currently about a sycophantic non-agentic chatbot and so it’s fine as an anti measure to some of the existing thoughts of AIs being conscious but it is likely a slippery slope?
More generally, I’ve seen a bunch of generally kind and smart AI Safety people have quite an anti-AI species sentiment in terms of how to treat these sorts of systems. From my perspective, it feels a bit like it comes from a place of fear and distrust which is completely understandable as we might die if anyone builds a superintelligent AI.
Yet that fear of death shouldn’t stop us from treating potential conscious beings kindly?
A lot of racism or similar can be seen as coming from a place of fear, the aryan master race was promoted because of the idea that humanity would go extinct if we got worse genetics into the system. What’s the difference from the idea that AIs might share our future lightcone?
The general argument goes that this time it is completely different since the AI can self-replicate, edit it’s own software, etc. This is a completely reasonable argument as there’s a lot of risks involved with AI systems.
It is when we get to the next part that I see a problem. The argument that follows is: “Therefore, we need to keep the almight humans in control to wisely guide the future of the lightcone.”
Yet, there’s generally a lot more variance within a distribution of humans compared to variance between distributions.
So when someone says that we need humans to remain in control, I think: “mmm, yes the totally homogenous group of “humans” that don’t include people like hitler, polpot and stalin”. And for the AI side of things we also have the same: “Mmm, yes the totally homogenous group of “all possible AI systems” that should be kept away so that the “wise humans” can remain in control.” Because a malignant RSI system is the only future AI based system that can be thought of, there is no way to change the system so that it values cooperation and there is no other way for future AI development to go than a quick take-off where an evil AI takes over the world.
Yes, there are obviously things that AIs can do that humans can do but don’t demonize all possible AI systems as a consequence, it is not black and white. We can protect ourselves against recursively self-improving AI and at the same time respect AI sentience, we can hold at the surface level contradictory statements at the same time?
So let’s be very specific about our beliefs and let’s make sure that our fear does not guide us into a moral catastrophe whether it be the extinction of all future life on earth nor a capture of sentient beings into a future of slavery?
I wanted to register some predictions and bring this up as I haven’t seen that many discussions on it. Finally, politics is war and arguments are soldiers so let’s keep it focused on the something object level? If you disagree, please tell me the underlying reasons. Finally in that spirit, here’s a set of questions I would want to ask someone who’s anti the above sentiment expressed:
How do we deal with potentially sentient AI?
Does respecting AI sentience lead to powerful AI taking over? Why?
What is the story that you see towards that? What are the second and third-order consequences?
How do you imagine our society looking like in the future?
How does a human controlled world look in the future?
I would change my mind if you could argue that there is a better heuristic to use than kindness and respect towards other sentient beings. You need to tit for that with defecting agents, yet why are all AI systems defecting in that case? Why is the cognitive architecture of future AI systems so different that I can’t apply the same game theoretical virtue ethics on them as I do to humans? And given the inevitable power-imbalance arguments that I’ll get as a consequence of that question, why don’t we just aim for a world where we retain power balance between our top-level and bottom-up systems (a nation and an individual for example) in order to retain power-balance between actors?
Essentially, I’m asking for a reason to believe why this story of system level alignment between a group and an individual will be solved by not including future AI systems as part of the moral circle?
Thank you for clarifying, I think I understand now!
I notice I was not that clear when writing my comment yesterday so I want to apologise for that.
I’ll give an attempt at restating what you said in other terms. There’s a concept of temporal depth in action plans. The question is to some extent, how many steps in the future are you looking similar to something else. A simple way of imagining this is how long in the future a chess bot can plan and how stockfish is able to plan basically 20-40 moves in advance.
It seems similar to what you’re talking about here in that the longer someone plans in the future, the more external attempts it avoids with regards to external actions.
Some other words to describe the general vibe might be planned vs unplanned or maybe centralized versus decentralized? Maybe controlled versus uncontrolled? I get the vibe better now though so thanks!
I guess I’m a bit confused why the emergent dynamics and the power-seeking are on different ends of the spectrum?
Like what do you even mean by emergent dynamics there? Are we talking about non-power seeking system, and in that case, what systems are non-power seeking?I would claim that there is no system that is not power-seeking since any system that survives needs to do bayesian inference and therefore needs to minimize free energy. (Self-referencing here but whatever) hence any surviving system needs to power-seek, given power-seeking is attaining more causal control over the future.
So therefore, there is no future where there is no power-seeking system it is just that the thing that power-seeks acts over larger timespans and is more of a slow actor. The agentic attractor space is just not human flesh bag space nor traditional space, it is different yet still a power seeker.
Still, I do like what you say about the change in the dynamics and how power-seeking is maybe more about a shorter temporal scale? It feels like the y-axis should be that temporal axis instead since it seems to be more what you’re actually pointing at?
I was reflecting on some of the takes here for a bit and if I imagine a blind gradient descent in this direction, I imagine quite a lot of potential reality distortion fields due to various of the underlying dynamics involved with holding this position.
So the one thing I wanted to ask was that if you have any sort of reset mechanism here? Like what is the schelling point before the slippery slope? What is the specific action pattern you would take if you got too far? Or do you trust future you enough in order to ensure that it won’t happen?
I just want to be annoying and drop a “hey, don’t judge a book by it’s cover!”
There might be deeper modelling concerns that we’ve got no clue about, it’s weird and is a negative signal but it is often very hard to see second order consequences and similar from a distance!
(I literally know nothing about this situation but I just want to point it out)
Fwiw, I disagree that the answer to stage 1 and 3 of your quick headline being solved, I think that there are enough unanswered questions there that, enough of them so that we can’t be certain whether or not a multipolar model could potentially hold.
For 1, I agree with the convergence claims but the speed of that convergence is in question. There are fundamental reasons to believe that we get hierarchical agents (e.g this from physics, shard theory). If you have a hierarchical collective agent then a good question is how you get it to maximise and become full consequentialist because it will due to optimality reasons. I think that one of the main ways it smooths out the kinks in its programming is by running into prediction errors and updating from that and then the question becomes how fast it runs into prediction errors. Yet in order to atain prediction errors you need to do some sort of online learning in order to update your beliefs. But the energy cost of that online learning scales pretty badly if you’re doing something like classic life does but with a really large NN. Basically there’s a chance that if you hard scale a network to very high computational power, updating that network increases a lot in energy and so if you want the most bang for your buck you get something more like Comprehensive AI Services since you get a distributed system of more specific learners forming a larger learner.
Then you can ask the question what the difference between the distributed AI and humans Collective Intelligence is. There are arguments that they will just form a super-blob through different forms of trade yet how is that different from what human collective intelligence is? (Looking at this right now!)
Are there forms of collective intelligence that can scale with distributed AI and that can capture AI systems in part of it’s optimality? (E.g group selection due to inherent existing advantages) I do think so and I do think that really strong forms of collective decision making potentially gives you a lot of intelligence. We can then imagine a simple verification contract that an AI gets access to a collective intelligence if it behaves in a certain way, it’s worth it for it because it is a lot easier to access power through yet it also agrees to play by certain rules. I don’t see why this wouldn’t work and I would love for someone to tell me that it doesn’t work!
For 3, why can’t RSI be a collective process given the above arguments around collective versus individual learning? If RSI is a bit like classic science there might also be thresholds and similar at which you get less fast scaling, I feel this is one of the less talked about points in superintelligence, what is the underlying difficulty of RSI at higher levels? From an outside view + black swan perspective it seems very arrogant to believe that to have a linear difficulty scaling?
Some other questions are: What types of knowledge discovery will be needed? What experiments? Where will you get new bits of information from? How will these distribute into the collective memory of the RSI process?
All of these things determine the unipolarity or multipolarity of an RSI process? So we can’t be sure of how it will happen and there’s also probably path dependence based on the best alternative at the initial conditions.
If you combine the fact that power corrupts your world models with the general startup person being power hungry as well as AI Safety being a hot topic, you also get a bunch of well meaning people doing things that are going to be net-negative in the future. I’m personally not sure that the VC model actually even makes sense for AI Safety Startups given some of the things I’ve seen in the space.
Speaking from personal experience I found that it’s easy to skimp out on operational infrastructure like a value aligned board or a more proper incentive scheme. You have no time so instead you start prototyping a product yet that means you get this path dependence where if you succeed, you suddenly have a lot less time. As a consequence the culture changes because the incentives are now different. You start hiring people and things become more capability focused. And voila, you’re now in a capabilities/AI safety startup and it’s unclear what it is.
So get a good board and don’t commit to something unless you have it in contract form or similar that you will have at least a PBC structure if not something even more extreme as the underlying company model. The main problem I’ve seen here is if your co-founder(s) is/are being cagey about it, I would move on to new people at least if you care about safety.
Yes!
I completely agree with what you’re saying at the end here. This project came about from trying to do that and I’m hoping to release something like that in the next couple of weeks. It’s a bit arbitrary but it is an interesting first guess I think?
So that would be the taxonomy of agents yet that felt quite arbitrary so the evolutionary approach kind of came from that on.
Nice I like it.
A random thing here as well is to have specific accounts focused on different algorithms. (The only annoying part is when you watch a gaming video on your well-trained research youtube but that’s a skill issue.)