AI Safety person currently working on multi-agent coordination problems.
Jonas Hallgren
I see it a bit like Kant’s categorical imperative. It is supposed to point out a way of seeing the world where you’re randomly put into the world.
It’s an intuition pump to get at compassion and risk aversion as core parts of your values and how that affects society. ( I think it leads to a better safety net and better outcomes in general if you have at least a certain degree of equity but that’s beside the point).
Can you claim that this could actually be the case? Probably not there’s the moral luck argument among others which is basically like “sucks to suck I guess. I got a good hand”.
It somewhat also seems to me that the size of the main drivers of progress in the future will have an impact on the concentration of power.
E.g if I remember correctly we’ve seen the rise of small businesses in the last couple of years due to various reasons and if that continues (which I don’t know if it will or not) we might just have a lot of decentralization with more automation.
A shop for each local place with local gradients based on the local data you get.
Economies of scale and all that though so one might expect the basic outcome to be more concentration of power. Yet, at the same time with economies of scale you often get less exploration and in chaotic environments exploration can be quite good since if you’re small you can move more quickly.
E.g I think the size of businesses matter and I think it is uncertain how that will look like in the future.
Firstly, I agree that planning can be good and that we would not be where we are be it not for central planning and efficiency. Everything being metis is a shit idea for the reasons you point out.
I do think we can be a bit more granular and that some of the things that you’re talking about here is the passing on of information and the building of more concrete theories. I don’t think you can equate the prince with a sort of pro high modernism perspective? Or like it feels like you’re arguing for standardization and saying that standardization is the same type of thing as high modernism?
The point in Seeing Like a State seems more like “top-down coarse graining without a good read of the underlying dynamics are bad”?
And yes, a Prussian-style forest plantation looks terrible, but it can hardly be disputed that it surpasses the natural forest in predictability.
Yeah and this is not necesarily a point about predictability but more a problem of how the forests die over time if you set them up in this way? To be a bit more technical it is a bit like setting a linear filter on an unknown system and hoping that it is not a complex system and if it is then we pray it doesn’t do too much damage.
It’s just a question about the chaos of the environment and if it is high, coarse grained system without local systems are just likely to get shit on by systems with more feedback loops.
I think there are likely some interesting angles here that we could find on what the dimensions are pointing at because I agree with you that metis is insufficient for this.
Feel free to disregard but I’ve gotten some feedback that it could be nice to bring some metascience and philosophy of science into thinking about automated alignment science so I thought it would be worth giving a shot.
If I take the perspective of my internal combined model of Michael Nielsen and Thomas Kuhn then I would also say something about the framing questions that are implicit here?
Right now it seems your model is to some extent an iteration loop on top of a set of well-defined set of evaluation benchmarks and iterating on those in order to fill up the space of potential problems that might arise.
I wonder if you’re underestimating the difficulty implicit here by focusing on this model?
If we look at the engineering around some major crises in the past I think we can point at some interesting problems that show up.
Take the Gaussian copula in pricing structured credit before 2008. People doing quantiative analysis were looking at the underlying math in a good and straightforward way that fit the frames of the time well. The problem was that the correlations were estimated from a regime where house prices had only risen, which made every model locally correct and the aggregate position catastrophically wrong when the regime changed. Now you might say withhold test data and do k-fold verification but it is likely that no amount of better aggregation would have caught it because the space of evidence being aggregated was pre-defined by a framing nobody inside the loop was incentivised to examine.
Or aviation automation through the 80s and 90s. They had lots of different technical measures about their reliability including things like formal verification. And yet aircraft were flying into terrain while the autopilot did exactly what its spec said it would do, because the framing-level question of whether “automation meets spec” was the right safety property at all couldn’t be reached from inside the engineering paradigm. It took human-factors researchers from outside to introduce concepts like mode confusion and automation surprise.
I’m not saying you can predict the specific framing failures, but I worry that if you don’t build automated alignment science with paradigm-level reframing as a core capacity, you end up in a bad state. This is because science very much seems like a series of weird serendipitous discoveries following each other (or at least the process of paradigm shift, normal science is also very much real).
More specifically I’m worried about the lack of research on problem formulation? It seems like the approach here is a very engineering heavy mindset where you’re iteratively figuring out how to make a system safe. It seems like an annoying part of the scientific process in weirder domains like complexity science, biological systems sciences and social sciences is missing which is that sometimes it is really really hard to interpret results.
Usually this is also done in a community way and the re-interpretation of ideas often involves bringing in external perspective on the same problem and then seeing it in a new way. The mixing of ideas and the mixing of problems don’t seem very implied here? I don’t know if I would subscribe to a model of scientific progress that is similar to the one that you’re implicitly imposing here? For example, how do you model serendipity and adaptivity to new problems? How are you thinking of modelling the integration of external frames in your problems?
Is this model of research open to multi-polar worlds where you have more complex interactions? If we are talking about automated alignment science and not engineering why is there no mention of theories of science?
I don’t know if this is helpful or not, it could just be the case that you’re strictly aiming for a scalable oversight single agent scenario and that you find it justified to simplify the model to an engineering problem here. I hope that this might be somewhat useful for you to potentially think about? (It also feels a bit weird to mention this to you when you wrote a paper on it being harder than you think but that is precisely why I thought you might be receptive to some of these thoughts.)
Yeah, nice.
I’ve thought for a while that natural abstractions generally make sense but only when specifying a certain reference frame. The possibility of redundant information leading to such a thing is still plausible but it feels like a coupled claim that doesn’t need to be there.
If you raise an AI in a specific grid world like chess then it is likely that it will share a relatively large space of convergent representations. If you compare that to learning about an environment where you have lots of different ways that you can compose the structure of the world like age of empires then it would seem that the abstraction learning process is a lot more path dependent.
Could you elaborate on how you think it is relevant? I’m afraid I don’t fully see it at a glance.
Is anyone else having claude believing it is you when it’s responding to ideas you’re writing? In my main research chat it always starts its thought threads like it is the one who have experienced and thought what I just told it. I don’t really know what to do with this information, I mostly find it mildly amusing.
Don’t forget that they have a very strong banking system with high secrecy around it. This probably brings in 5-10% of their GDP in revenue per year.
If you play the economic benefits of this back the last 100-150 years you probably end up with a lot of the difference downstream from that. (Not that the other things aren’t cool or useful)
(I also personally see this as a bit of a global defect since it allows people to tax fraud and similar but let’s not get into the global banking system)
I’m looking forward to the next post called “Running for The AI Safety Presidency” :D
I do wonder what effective mechanisms to bring forth spokespeople for the “AI Safety” community would look like? Is there a democratic procedure that people would agree to? I feel like if something like it were to happen it would probably be one of the weirdest democratic procedures to have happened with lots of info dumps and prediction markets.
Spitballing a bit more on this, I think the procedure would need something like:
A clearly defined mandate (to ensure no overreach)
Shorter election cycles (something about log time based on AI progress?)
Representation of sub-groups (since AI Safety is very much not homogenous)
Okay, taking this even more seriously, one could send out a survey asking people to describe their views on AI Safety/scrape LW for it. You then do something like a k-means clustering or PCA or something similar to alternatives for like 5 groups of people to be represented. You then produce 10 of these mappings and people discuss which one seems most reasonable. These form some sort of AI Safety parliament or board depending on what you want from it.
There’s a lot of institution design questions here that get very tricky to solve because different parts of the movement will want different things and I think rationalists are probably one of the most diverse groups of people when it comes to political beliefs.
Good post with good points though!
Randomly (or quite explicitly if I’m being honest) I’ve been trying to work on trying to create a more general algorithmic feed for a bit so I’ve got some context that might be interesting.
Most of it is in a github repo of mine as I’ve been working on different ways of look at recommendation algorithms and thinking of ways to make them more useful to people.
Some of my thoughts on the codebase are best represented as part of the codebase and I cba going through it fully myself and write a long comment so I created an LLM block below.
My actual human thoughts are something like that it seems kind of hard to create a nice algorithmic feed tool that works more universally for people? My initial attempts at training an algorithm for this didn’t work, my next thing I will do is to do a RAG text embedding style setup where the thought is basically that if I already have a bunch of preference data locally it will be easy to build on that.
If that doesn’t work I’ve been cooking this plan for using active inference and a brain inspired knowledge architecture in order to explicitly parametrize exploration as a key value in the system itself and also to be able to train it more efficiently over time.
So basically I agree with you on this post but it seems like a tool where the natural incentives are against it where it is at least not trivial to build something good. (I’m ok at SWE but not the best and I’m just doing this in my free time so don’t overupdate)
LLM summary of prompted codebase knowledge
A few pieces of this are already further along than the framing suggests, and I think the interesting question is different from “who will build it.”
On the “who” question: Bluesky’s AT Protocol already ships a custom feed generator API — any feed on the network can be an LLM-curated one, and the switching cost for users is one tap. Paper Skygest (a curated academic-paper feed) has 50K+ weekly users on exactly that infrastructure. Matter and Readwise Reader are already doing LLM-curated long-form for personal reading. The startup you’re predicting exists in several forms; the pieces just haven’t been assembled into the specific “tell it what you want and it obeys” product yet, probably because that product is less defensible than it sounds.
The less-discussed problem: **declared preferences aren’t actually the alignment target most people want.** If I tell a feed “no Trump news,” I get a feed that mirrors my present self — which is fine for filtering, but it’s not “aligned,” it’s just obedient. It optimizes against ragebait by replacing one reward-hacked loss function with another (my stated preferences, which are also gameable, just by me). The deeper misalignment in YouTube/X isn’t that they ignore what I say I want; it’s that nobody — including me — has a clean articulation of what attention *should* optimize for, especially in aggregate.
The more interesting design targets I see are things like:
- **Bridging objectives** (what Community Notes uses): rank content that gets positive engagement *across* ideological clusters, not within them. This is already deployed at scale and demonstrably reduces the ragebait equilibrium without needing personalization at all.
- **Epistemic-stance-aware ranking**: labelers or LLMs tagging claim/question/evidence/opinion, then ranking by curiosity/inquiry rather than engagement.
- **Slow feeds**: weekly digests, not real-time streams, where latency itself is part of the design (sidesteps the recency/outrage coupling).
These don’t require users to “declare preferences” in the shallow sense. They require a *different objective function* at the infrastructure layer. And crucially, open protocols like ATProto make it possible to run that experiment without having to out-compete YouTube.
So I’d restate the prediction as: the disruption isn’t “LLMs will obey what you tell them to show you,” it’s that the cost of running a feed with a non-trivial objective function dropped by ~100x, and the equilibrium where one algorithm optimizes everyone’s attention for ad revenue is no longer a natural monopoly. The startup play is real; the deeper shift is that “algorithmic feed” stops being a thing one company does.
Seeing some stuff on the x platform about consciousness e.g (this and this among other things). A reminder that consciousness is a conflationary alliance term, e.g if you’re going to use the C word you will likely confuse a lot of people.
There are nice things that we can talk about when it comes to LLMs like self models (which relate to problems that are potentially more tractable like <<Boundaries>>) or expressions that are correlates to emotions which do not invoke the conflationary part and that is important for the potential personhood of the system.
For what we’re arguing for is not the conscious experience, we’re arguing whether the AI systems should be granted moral patienthood in the future and when they should be granted moral personhood. You might say that this is dependent on the models having “conscious experience” yet that is not precise and so you can’t really meaningfully progress the debate on this by doing that.
Even if you’re for example a functionalist, there are still many interesting questions to ask here:
What is the functional equivalence of workspace theory in an AI?
What are the parts of integrated information theory that lead to a self reported phenomenological experience?
There are many more questions around autopoeisis (e.g self-evidencing systems), planning with your own future boundaries in mind, causal emergence, synergistic information and more that could be very interesting to answer here.
The point is to be precise with your language or you will end up in definition and word soup land. Ban the C word from your vocabulary just like you might have banned the word emergence a while back! If you’re backed into a corner and have to use it, define the word before you talk more about it!
Right so I was trying to make sense of this no trade thing because I thought it would make it so that the individual subcomponents would die out if they didn’t have access to trade but that isn’t the case if they’re subordinated to the larger system. Also similar to the multi-agent thing it’s not about multi-scale modularity but rather optimisation within a specific optimisation system or whatever you would like to call it. (random quick picture):
Ah man, I’ve been missing these sorts of posts from you, very happy to read this, super cool as usual.
These are some questions that arise for me, maybe not the most well out but hopefully somewhat interesting:
Do you think this is a good argument for multi-scale modularity in biology? Also thoughts on multi-agent models of mind with this model in the background? Finally any thoughts on applying this to whether we will have a group of AI systems doing RSI or a singular one?
I suppose this argument would be dependent on the condition when systems with convex utility functions show up, do you have any more detailed thoughts on when we can expect convex utility trade offs to show up?
Is the answer something like when there are benefits to specialization? I guess there also has to be the aspect of trade already present in the system? Do you have thoughts on the assumptions you have to make and have true before the convergence arises or do you think this is a general property of any learning system?
I would hypothesise that it is more about the underlying ability to use the engine that is intelligence. If we do the classic eliezer definition (i think it is in the sequences at least) of the ability to hit a target then that is only half of the problem because you have to choose a problem space as well.
Part of intelligence is probably choosing a good problem space but I think the information sampling and the general knowledge level of the people and institutions and general information sources around you is quite important to that sampling process. Hence if you’re better at integrating diverse sources of information then you’re likely better at making progress.
Finally I think there’s something about some weird sort of scientific version of frame control where a lot of science is about asking the right question and getting exposure to more ways of asking questions lead to better ways of asking questions.
So to use your intelligence you need to wield it well and wielding it well partly involves working on the right questions. But if you’re not smart enough to solve the questions in the first place it doesn’t really matter if you ask the right question.
I would agree that this is a weird incentive issue and that IQ is probably easier and less thorny than personality traits. With that being said here’s a fun little thought on alternative ways of looking at intelligence:
Okay but why is IQ a lot more important than “personality”?
IQ being measured as G and based on correlational evidence about your ability to progress in education and work life. This is one frame to have on it. I think it correlates a lot of things about personality into a view that is based on a very specific frame from a psychometric perspective?
Okay, let’s look at intelligence from another angle, we use the predictive processing or RL angle that’s more about explore exploit, how does that show up? How do we increase the intelligence of a predictive processing agent? How does the parameters of when to explore and when to exploit and the time horizon of future rewards?
Openness here would be the proclivity to explore and look at new sources of information whilst conscientiousness is about the time horizon of the discouting factor in reward learning. (Correlatively but you could probably define new better measures of this, the big 5 traits are probably not the true names for these objectives.)
I think it is better for a society to be able to talk to each other and integrate information well hence I think we should make openness higher from a collective intelligence perspective. I also think it is better if we imagine that we’re playing longer form games with each other as that generally leads to more cooperative equilibria and hence I think conscientiousness would also be good if it is higher.
(The paper I saw didn’t replicate btw so I walk back the intelligence makes you more ignorant point. )
(Also here’s a paper talking about the ability to be creative having a threshold effect around 120 iq with openness mattering more after that, there’s a bunch more stuff like this if you search for it.)
I meant the basic economy way of defining public good, not necessarily the distribution mchanism, electricity and water are public goods but they aren’t necessarily determined by the government.
I’ve had the semi ironic idea of setting up a “genetic lottery” if supply was capped as it would redistribute things evenly (as long as people sign up evenly which is not true).
Anyways, cool stuff, happy that someone is on top of this!
I’ll share a paper I remember seeing on the ability to do motivated reasoning and holding onto false views being higher for higher iq people tomorrow (if it actually is statistically significant).
Also maybe the more important things to improve after a certain IQ might be openness and conscientiousness? Thoughts on that?
I do think that it actually is quite possible to do some gene editing on big 5 and ethics tbh but we just gotta actually do it.
Okay, I think the gradual point is a good one and also that it very much helps our institutions to be able to deal with increased intelligence.
I would be curious what you think about the idea of more permanent economic rifts and also the general economics of gene editing? Might it be smart to make it a public good instead?
Maybe there’s something here about IQ being hereditary already and thus the point about a more permanent two caste society with smart and stupid people is redundant but somehow I still feel that the economics of private gene editing over long periods of time feels a bit off?
Hmmm but what if human good not coupled with human wisdom? Maybe more intelligence more power seeking if not carefully implemented?
Probably better than doing the Big AI though.
The one problem with learning category theory and functional programming deeply is that you just stop making sense to like 95% of the population.
I’m sitting here with my multi-agent system library I’m building and I’m like yup the step is just a Kleisli arrow and that is why JAX lax scans work on this system!
Also, LLMs fuck up with this type of code all the time, especially if you run it in Python which is not trained on functional programming.
It is like hella useful if you’re a shape rotator though as you can just couple arrows in your head and good stuff happens. (if someone knows about models fine-tuned for functional programming, I would be very happy.)
(Some random math + programming reflection to distract you from the mildly world-changing happenings in AI governance :) )