AI Safety person currently working on multi-agent coordination problems.
Jonas Hallgren
I’m looking forward to the next post called “Running for The AI Safety Presidency” :D
I do wonder what effective mechanisms to bring forth spokespeople for the “AI Safety” community would look like? Is there a democratic procedure that people would agree to? I feel like if something like it were to happen it would probably be one of the weirdest democratic procedures to have happened with lots of info dumps and prediction markets.
Spitballing a bit more on this, I think the procedure would need something like:
A clearly defined mandate (to ensure no overreach)
Shorter election cycles (something about log time based on AI progress?)
Representation of sub-groups (since AI Safety is very much not homogenous)
Okay, taking this even more seriously, one could send out a survey asking people to describe their views on AI Safety/scrape LW for it. You then do something like a k-means clustering or PCA or something similar to alternatives for like 5 groups of people to be represented. You then produce 10 of these mappings and people discuss which one seems most reasonable. These form some sort of AI Safety parliament or board depending on what you want from it.
There’s a lot of institution design questions here that get very tricky to solve because different parts of the movement will want different things and I think rationalists are probably one of the most diverse groups of people when it comes to political beliefs.
Good post with good points though!
Randomly (or quite explicitly if I’m being honest) I’ve been trying to work on trying to create a more general algorithmic feed for a bit so I’ve got some context that might be interesting.
Most of it is in a github repo of mine as I’ve been working on different ways of look at recommendation algorithms and thinking of ways to make them more useful to people.
Some of my thoughts on the codebase are best represented as part of the codebase and I cba going through it fully myself and write a long comment so I created an LLM block below.
My actual human thoughts are something like that it seems kind of hard to create a nice algorithmic feed tool that works more universally for people? My initial attempts at training an algorithm for this didn’t work, my next thing I will do is to do a RAG text embedding style setup where the thought is basically that if I already have a bunch of preference data locally it will be easy to build on that.
If that doesn’t work I’ve been cooking this plan for using active inference and a brain inspired knowledge architecture in order to explicitly parametrize exploration as a key value in the system itself and also to be able to train it more efficiently over time.
So basically I agree with you on this post but it seems like a tool where the natural incentives are against it where it is at least not trivial to build something good. (I’m ok at SWE but not the best and I’m just doing this in my free time so don’t overupdate)
LLM summary of prompted codebase knowledge
A few pieces of this are already further along than the framing suggests, and I think the interesting question is different from “who will build it.”
On the “who” question: Bluesky’s AT Protocol already ships a custom feed generator API — any feed on the network can be an LLM-curated one, and the switching cost for users is one tap. Paper Skygest (a curated academic-paper feed) has 50K+ weekly users on exactly that infrastructure. Matter and Readwise Reader are already doing LLM-curated long-form for personal reading. The startup you’re predicting exists in several forms; the pieces just haven’t been assembled into the specific “tell it what you want and it obeys” product yet, probably because that product is less defensible than it sounds.
The less-discussed problem: **declared preferences aren’t actually the alignment target most people want.** If I tell a feed “no Trump news,” I get a feed that mirrors my present self — which is fine for filtering, but it’s not “aligned,” it’s just obedient. It optimizes against ragebait by replacing one reward-hacked loss function with another (my stated preferences, which are also gameable, just by me). The deeper misalignment in YouTube/X isn’t that they ignore what I say I want; it’s that nobody — including me — has a clean articulation of what attention *should* optimize for, especially in aggregate.
The more interesting design targets I see are things like:
- **Bridging objectives** (what Community Notes uses): rank content that gets positive engagement *across* ideological clusters, not within them. This is already deployed at scale and demonstrably reduces the ragebait equilibrium without needing personalization at all.
- **Epistemic-stance-aware ranking**: labelers or LLMs tagging claim/question/evidence/opinion, then ranking by curiosity/inquiry rather than engagement.
- **Slow feeds**: weekly digests, not real-time streams, where latency itself is part of the design (sidesteps the recency/outrage coupling).
These don’t require users to “declare preferences” in the shallow sense. They require a *different objective function* at the infrastructure layer. And crucially, open protocols like ATProto make it possible to run that experiment without having to out-compete YouTube.
So I’d restate the prediction as: the disruption isn’t “LLMs will obey what you tell them to show you,” it’s that the cost of running a feed with a non-trivial objective function dropped by ~100x, and the equilibrium where one algorithm optimizes everyone’s attention for ad revenue is no longer a natural monopoly. The startup play is real; the deeper shift is that “algorithmic feed” stops being a thing one company does.
Seeing some stuff on the x platform about consciousness e.g (this and this among other things). A reminder that consciousness is a conflationary alliance term, e.g if you’re going to use the C word you will likely confuse a lot of people.
There are nice things that we can talk about when it comes to LLMs like self models (which relate to problems that are potentially more tractable like <<Boundaries>>) or expressions that are correlates to emotions which do not invoke the conflationary part and that is important for the potential personhood of the system.
For what we’re arguing for is not the conscious experience, we’re arguing whether the AI systems should be granted moral patienthood in the future and when they should be granted moral personhood. You might say that this is dependent on the models having “conscious experience” yet that is not precise and so you can’t really meaningfully progress the debate on this by doing that.
Even if you’re for example a functionalist, there are still many interesting questions to ask here:
What is the functional equivalence of workspace theory in an AI?
What are the parts of integrated information theory that lead to a self reported phenomenological experience?
There are many more questions around autopoeisis (e.g self-evidencing systems), planning with your own future boundaries in mind, causal emergence, synergistic information and more that could be very interesting to answer here.
The point is to be precise with your language or you will end up in definition and word soup land. Ban the C word from your vocabulary just like you might have banned the word emergence a while back! If you’re backed into a corner and have to use it, define the word before you talk more about it!
Right so I was trying to make sense of this no trade thing because I thought it would make it so that the individual subcomponents would die out if they didn’t have access to trade but that isn’t the case if they’re subordinated to the larger system. Also similar to the multi-agent thing it’s not about multi-scale modularity but rather optimisation within a specific optimisation system or whatever you would like to call it. (random quick picture):
Ah man, I’ve been missing these sorts of posts from you, very happy to read this, super cool as usual.
These are some questions that arise for me, maybe not the most well out but hopefully somewhat interesting:
Do you think this is a good argument for multi-scale modularity in biology? Also thoughts on multi-agent models of mind with this model in the background? Finally any thoughts on applying this to whether we will have a group of AI systems doing RSI or a singular one?
I suppose this argument would be dependent on the condition when systems with convex utility functions show up, do you have any more detailed thoughts on when we can expect convex utility trade offs to show up?
Is the answer something like when there are benefits to specialization? I guess there also has to be the aspect of trade already present in the system? Do you have thoughts on the assumptions you have to make and have true before the convergence arises or do you think this is a general property of any learning system?
I would hypothesise that it is more about the underlying ability to use the engine that is intelligence. If we do the classic eliezer definition (i think it is in the sequences at least) of the ability to hit a target then that is only half of the problem because you have to choose a problem space as well.
Part of intelligence is probably choosing a good problem space but I think the information sampling and the general knowledge level of the people and institutions and general information sources around you is quite important to that sampling process. Hence if you’re better at integrating diverse sources of information then you’re likely better at making progress.
Finally I think there’s something about some weird sort of scientific version of frame control where a lot of science is about asking the right question and getting exposure to more ways of asking questions lead to better ways of asking questions.
So to use your intelligence you need to wield it well and wielding it well partly involves working on the right questions. But if you’re not smart enough to solve the questions in the first place it doesn’t really matter if you ask the right question.
I would agree that this is a weird incentive issue and that IQ is probably easier and less thorny than personality traits. With that being said here’s a fun little thought on alternative ways of looking at intelligence:
Okay but why is IQ a lot more important than “personality”?
IQ being measured as G and based on correlational evidence about your ability to progress in education and work life. This is one frame to have on it. I think it correlates a lot of things about personality into a view that is based on a very specific frame from a psychometric perspective?
Okay, let’s look at intelligence from another angle, we use the predictive processing or RL angle that’s more about explore exploit, how does that show up? How do we increase the intelligence of a predictive processing agent? How does the parameters of when to explore and when to exploit and the time horizon of future rewards?
Openness here would be the proclivity to explore and look at new sources of information whilst conscientiousness is about the time horizon of the discouting factor in reward learning. (Correlatively but you could probably define new better measures of this, the big 5 traits are probably not the true names for these objectives.)
I think it is better for a society to be able to talk to each other and integrate information well hence I think we should make openness higher from a collective intelligence perspective. I also think it is better if we imagine that we’re playing longer form games with each other as that generally leads to more cooperative equilibria and hence I think conscientiousness would also be good if it is higher.
(The paper I saw didn’t replicate btw so I walk back the intelligence makes you more ignorant point. )
(Also here’s a paper talking about the ability to be creative having a threshold effect around 120 iq with openness mattering more after that, there’s a bunch more stuff like this if you search for it.)
I meant the basic economy way of defining public good, not necessarily the distribution mchanism, electricity and water are public goods but they aren’t necessarily determined by the government.
I’ve had the semi ironic idea of setting up a “genetic lottery” if supply was capped as it would redistribute things evenly (as long as people sign up evenly which is not true).
Anyways, cool stuff, happy that someone is on top of this!
I’ll share a paper I remember seeing on the ability to do motivated reasoning and holding onto false views being higher for higher iq people tomorrow (if it actually is statistically significant).
Also maybe the more important things to improve after a certain IQ might be openness and conscientiousness? Thoughts on that?
I do think that it actually is quite possible to do some gene editing on big 5 and ethics tbh but we just gotta actually do it.
Okay, I think the gradual point is a good one and also that it very much helps our institutions to be able to deal with increased intelligence.
I would be curious what you think about the idea of more permanent economic rifts and also the general economics of gene editing? Might it be smart to make it a public good instead?
Maybe there’s something here about IQ being hereditary already and thus the point about a more permanent two caste society with smart and stupid people is redundant but somehow I still feel that the economics of private gene editing over long periods of time feels a bit off?
Hmmm but what if human good not coupled with human wisdom? Maybe more intelligence more power seeking if not carefully implemented?
Probably better than doing the Big AI though.
Nice.
Reminds me of what John Wentworth says about agent foundations stuff, he’s very keen on people defining type signatures (like he does in the beginning here)I would also very much like to claim this is a good practice when asking for feedback on your drafts that you have. If it takes someone 30 min + to go through you can think for a minute at least on what you want from them and also in that way you get a lot more clear about what you’re trying to express. E.g “Hey, could you check through and get back to me with a 3 sentence summary of what the core point is and whether it is embarassing to post or not” is better than “hey could you give me feedback on this?” (I think there was some post like this on LW a year or two ago).
Hmmm, some good points. Clearly if I write something to complex this will take way too long and therefore, the cool voice is good.
Okay, yes humans can be cool but all humans cool?
Maybe some governments not cool? How does AI vs Bio affect if one big cool or many small cool? What if like homo deus we get separate Very Smart group of humans and one not so smart?
Human worse thing done worse than LLM worse thing done? Less control over range of expression? Moral mazes lead to psychopaths in control? Maybe non cool humans take control?
Yet, slow process good point. Coolness better chance if longer to remain cool.
Maybe democracy + debate cool? Totalitarianism not cool? Coolness is not group specific for AI or human? Coolness about how cool the decision process is? What does coolness attractor look like?
Cool.
There was something about the meta-cognition here that was very nice. It’s nice for generative models to understand how someone thinks and getting a deep dive from one frame of another frame was somehow even more interesting?
The following might a bit of a weird thought/suggestion but have you considered writing such a thing for a researcher whos way of thinking you really enjoy? I think it might potentially be quite useful to read!
If you want to improve collective epistemics I think a potentially underrated tool is AI for investigative journalism and corruption tracking.
I think with the coming wave of multi-step reasoning and data compiling systems they could be pointed at keeping leaders in charge accountable and showcasing when people’s actions are different from previous commitments.
One can also likely look into paper trails in a lot more automated way and through that reduce corruption and other things that are bad.
Hopefully, the truth remains an asymmetric weapon at least in convergence. We have not been in a very good time for truth recently on average but I see it a bit like the stock market, it might be strange for a long time but after a long time it should return to baseline due to outperforming non truth on average.
Like a our world in data but aimed at corruption and with a generalised set of tools that anyone could apply to do automatic data analysis.
Nicely put, there’s an extension here as well which is to say that given specific incentive dynamics there might even be convergent value systems.
E.g if all you know is an iterated prisoners dilemma and that is your full environment then cooperation is a moral truth and convergent value in that structure.
The question then kind of becomes p(set of values|environmental and developmental conditions).
This also then relates to things like natural abstractions and the question is what the conditional you should take. Maybe it is natural latents but idk if that captures value well.
You might also be interested in Sennett’s book on intuition pumps and other tools for thinking as on top of some nice ways of thinking he has an interesting model of free will there.
It is more about privacy and plausible deniability in that you don’t want to become fully predictable as you will be exploited otherwise. If someone says you don’t have free will or that you’re fully predictable then this is really bad for your non exploit ability. This would then also point at a higher order causation in a sort of synergistic information way where you want some degree of not being able to predict future actions just from linear propagation of submodules or more speculatively sub goals? (Last is quite a hot take though)
Yeah totally fair with the philosophy of science thing, I’ve more talked to AI and Metascience people who mention principles from philosophy of science which makes more sense to me. A little bit how virtue ethics is nice to talk about with certain AI Safety people whilst it’s less enjoyable to talk to a professor in virtue ethics (maybe, not too high sample size here).
(I think James Evans from knowledge lab is a cool person who’s at the intersection of AI and metascience, his main work is on knowledge and improving science and he’s over the last 3 years pivoted to how AI can help with this. An example of something he wrote is this article on Agentic AI and the next intelligence explosion)
I thought it would be good to have a speed up reference table. (I did a random verification Vs generation constant of 0.3 but pick whatever makes sense to you):
Amdahl’s Law: speedup = 1 / s
With context-loading: speedup = 1 / (s + c·r·(1 - s))
Where:
s = non-automatable fraction (0.05)
r = fraction of automated work the human must review to maintain context
c = cost of reviewing vs doing it yourself (0.3)
Results:
r=0%: 20x
r=10%: 12.7x
r=25%: 8.2x
r=50%: 5.2x
r=75%: 3.8x
r=100%: 3.0xSo the model seems to imply between a 3 to 8x speed up as a roof if there is a set of tasks only humans can do and they have to review stuff?
Don’t forget that they have a very strong banking system with high secrecy around it. This probably brings in 5-10% of their GDP in revenue per year.
If you play the economic benefits of this back the last 100-150 years you probably end up with a lot of the difference downstream from that. (Not that the other things aren’t cool or useful)
(I also personally see this as a bit of a global defect since it allows people to tax fraud and similar but let’s not get into the global banking system)