Jan_Kulveit

Karma: 6,678

My current research interests:

1. Alignment in systems which are complex and messy, composed of both humans and AIs?
Recommended texts: Gradual Disempowerment, Cyborg Periods

2. Actually good mathematized theories of cooperation and coordination
Recommended texts: Hierarchical Agency: A Missing Piece in AI Alignment, The self-unalignment problem or Towards a scale-free theory of intelligent agency (by Richard Ngo)

3. Active inference & Bounded rationality
Recommended texts: Why Simulator AIs want to be Active Inference AIs, Free-Energy Equilibria: Toward a Theory of Interactions Between Boundedly-Rational Agents, Multi-agent predictive minds and AI alignment (old but still mostly holds)

4. LLM psychology and sociology: A Three-Layer Model of LLM Psychology, The Pando Problem: Rethinking AI Individuality, The Cave Allegory Revisited: Understanding GPT’s Worldview

5. Macrostrategy & macrotactics & deconfusion: Hinges and crises, Cyborg Periods again, Box inversion revisited, The space of systems and the space of maps, Lessons from Convergent Evolution for AI Alignment, Continuity Assumptions

Also I occasionally write about epistemics: Limits to Legibility, Conceptual Rounding Errors

Researcher at Alignment of Complex Systems Research Group (acsresearch.org), Centre for Theoretical Studies, Charles University in Prague. Formerly research fellow Future of Humanity Institute, Oxford University

Previously I was a researcher in physics, studying phase transitions, network science and complex systems.

Jan_Kulveit 30 Oct 2025 23:58 UTC
4 points
0
in reply to: Raemon’s comment on: Wei Dai’s Shortform
[low effort list] Bottlencks/issues/problems

- philosophy has worse short feedback loops than eg ML engineering → in all sorts of processes like MATS or PIBBSS admissions it is harder to select for philosophical competence, also harder to self-improve
- incentives: obviously stuff like being an actual expert in pretraining can get you lot of money and respect in some circles; even many prosaic AI safety / dual use skills like mech interpretability can get you maybe less money than pretraining, but still a lot of money if you work in AGI companies, and also decent ammount of status in ML community and a AI safety community; improving philosophical competence may get you some recognition but only among relatively small and weird group of people
- the issue Wei Dai is commenting on in the original post, founder effects persist to this day & also there is some philosophy-negative prior in STEM—
idk, lack of curiousity? llms have read it all, it’s easy to check if there is some existing thinking on a topic

Jan_Kulveit 30 Oct 2025 18:31 UTC
LW: 8 AF: 4
2
AF
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
I mostly agree with 1. and 2., with 3. it’s a combination of the problems are hard, the gung-ho approach and lack of awareness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops.
My impression is getting generally smart and fast “alignment researchers” more competent in philosophy is more tractable than trying to get established academic philosophers change what they work on, so one tractable thing is just convincing people the problems are real, hard and important. Other is maybe recruiting graduates

AISLE discovered three new OpenSSL vulnerabilities

Jan_Kulveit30 Oct 2025 16:32 UTC

50 points

4 comments1 min readLW link

(aisle.com)

Upcoming Workshop on Post-AGI Economics, Culture, and Governance

David Duvenaud, Raymond Douglas, Jan_Kulveit, scasper and MariaK

28 Oct 2025 21:55 UTC

34 points

1 comment2 min readLW link

Jan_Kulveit 28 Oct 2025 18:25 UTC
10 points
4
in reply to: Richard_Ngo’s comment on: The Memetics of AI Successionism
2. I actually have somewhat overlapping concerns about the doom memeplex and a bunch of notes about it, but its not near even a draft post. But your response provides some motivation to write it as well. In the broader space, there are good posts about the doom memeplex for the LW audience from Valentine, so I felt this is less neglected.

3. I generally don’t know. My impression is when I try to explain the abstract level without a case study, readers are confused what’s the point or how is it applicable. My impression is meta explanations of memetics of some ideology tends to weaken it almost no matter what the ideology is, so I don’t think I could have chosen some specific example without the result being somewhat controversial. But what I could have done is having multiple different examples, that’s valid criticism.

The Memetics of AI Successionism

Jan_Kulveit28 Oct 2025 15:04 UTC

119 points

15 comments9 min readLW link

Jan_Kulveit 20 Oct 2025 10:43 UTC
49 points
41
on: Frontier LLM Race/Sex Exchange Rates
Just flagging that the claim of the post

In this paper, they showed that modern LLMs have coherent and transitive implicit utility functions and world models
is basically a lie. The paper showed that in some limited context, LLMs answer some questions somewhat coherently. The paper have not shown much more (despite sensationalist messaging). It is fairly trivial to show that modern LLMs are very sensitive to framing and you can construct experiments in which they violate transitivity and independence. The VNM math than guarantees that you can not construct a utility function to represent the results.

Jan_Kulveit 17 Oct 2025 14:39 UTC
2 points
0
in reply to: eggsyntax’s comment on: On the functional self of LLMs
- “The base layer is ultimately made up of models of characters, in a Simulators-ish sense” No it is not, in a similar way as what your brain is running is not ultimately made of characters. It’s ultimately made of approximate bayesian models.
- what distinguishes a context-prompted ephemeral persona from that richer and more persistent character Check Why Simulator AIs want to be Active Inference AIs
- With respect to active inference … Sorry, don’t want to be offensive, but it would actually be helpful for your project to understand active inference at least a bit. Empirically it seems has-repeatedly-read-Scott-Alexander’s-posts-on-it leads people to some weird epistemic state, in which people seem to have a sense of understanding, but are unable to answer even basic questions, make very easy predictions, etc. I suspect what’s going on is a bit like if someone reads some well written science popularization book about quantum mechanics but actually lacks concepts like complex numbers or vector spaces, they may have somewhat superficial sense of understanding.
  Obviously active inference has a lot to say about how people self-model themselves. For example, when typing these words, I assume it’s me who types them (and not someone else, for example). Why? That’s actually important question for why self. Why not, or to what extent not in LLMs? How stories that people tell themselves about who they are impact what they do is totally something which makes sense to understand from active inference perspective.

Jan_Kulveit 16 Oct 2025 16:19 UTC
31 points
2
on: Jan_Kulveit’s Shortform
ACS research is hiring We’re looking for a mix of polymaths, ML research engineers, and people with great intuitions about how AIs behave to work on macrostrategy and LM psychology.

Personally I hope it could be Pareto-best option for some of you on combination of topics to work on, incentives, salary, collaborators and research environment.

Deadline in few weeks, 1-2 year appointments in Prague, London or San Francisco Bay Area.
Hiring page with more details—https://acsresearch.org/hiring

Gradual Disempowerment Research Fellow We’re looking for polymaths who can reason about civilizational dynamics. This role comes with a lot of intellectual freedom—it could mean economic modelling, theoretical work on multi-agent dynamics, historical analysis, and more.

LLM Psychology & Sociology Researcher We want people with a strong intuitive understanding of LLMs to help run empirical studies on topics like LLM introspection and self-conception, LLM social dynamics, and how ideologies spread between AIs.

AI Psychology & Agent Foundations ML Researcher We need people who can bring technical and methodological rigour, taking high-level ideas about AI psychology and turning them into concrete ML experiments. This could include of evaluations, mech interp, post-training, both APIs and open-weight models.

Jan_Kulveit 9 Oct 2025 13:23 UTC
2 points
0
in reply to: cousin_it’s comment on: Jan_Kulveit’s Shortform
We discuss this in Misaligned States part of the Gradual Disempowerment (the thesis you mention is explored in much detail in Tilly (1990). Coercion, Capital, and European States, AD 990-1990).

I don’t think the violence is particularly unique source of power—in my view forms of power are somewhat convertible (ie if a rentier state does not derive income from taxation, it can hire mercenaries to pacify the population).

Also, empirically: military power is already quite concentrated—modern militaries are not that large but would be able to pacify much larger popular unrest, if they had the will to do so. But this is kept in check in part by econ power and in part by culture.

Jan_Kulveit 8 Oct 2025 11:08 UTC
14 points
8
in reply to: Lukas Finnveden’s comment on: Jan_Kulveit’s Shortform
I’m not tracking their views too closely in time and you probably have better idea, but my impression is there are some changes.

If I take this comment by Matthew Barnett from 2y ago (read it only now), it seem while the modal prediction is quite similar, the valence / what to do about it is quite different (emphasis on valence-indicating words is mine)
My modal tale of AI doom looks something like the following:

1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they’re nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren’t identical to the utility function of serving humanity (ie. there’s slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I’m not sure what to do about it, since it happens despite near-perfect alignment, and no deception.

One reason to be optimistic is that, since the scenario doesn’t assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that’s the biggest reason why I don’t think this scenario has a >50% chance of happening). Nonetheless, I think it’s plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
...
So, at least to me, there seems to be some development from it would also be very bad and I’m not sure what to do about it to this is inevitable, good, and let’s try to make it happen faster. I do understand that Matthew Barnett wrote a lot of posts and comments on EA forum between then and now which I mostly missed, and there is likely some opinion development happening with the posts.

On the other hand if you compare Barnett [23] who already has a model why the scenario is not inevitable, and could be disrupted by eg leveraging AI for forecasting, coordination or something similar, and Barnett et al [25] who forgets these arguments against inevitability, I think it actually strengthens the claim of “fine example of thinking you get when smart people do evil things and their minds come up with smart justifications why they are the heroes”.

Jan_Kulveit 7 Oct 2025 18:10 UTC
110 points
102
on: Jan_Kulveit’s Shortform
Crossposting: [recent Mechanize blogpost advocating for extreme technological determinism and speeding up human disempowerment] is a fine example of thinking you get when smart people do evil things and their minds come up with smart justifications why they are the heroes. Upon closer examination it ignores key inconvenient considerations; normative part sounds like misleading PR.

A major hole in the “complete technological determinism” argument is that it completely denies agency, or even the possibility that how agency operates at larger scales could change. Sure, humanity is not currently a very coordinated agent. But the trendline also points toward the ascent of an intentional stance. An intentional civilization would, of course, be able to navigate the tech tree. (For a completely opposite argument about the very high chance of a “choice transition,” check https://strangecities.substack.com/p/the-choice-transition).

In practice, this likely boils down to a race. On one side are people trying to empower humanity by building coordination technology and human-empowering AI. On the other side are those working to create human-disempowering technology and render human labor worthless as fast as possible.

My guess is when people stake their careers and fortune and status on the second option, their minds will work really hard to not see the choice.

Also: at least to me, the normative part sounds heavily PR sanitized, with obligatory promises of “medical cures” but shiying away from explaining either what would be the role of humans in the fully automated economy, or the actual moral stance of the authors. As far as I understand, at least one of the authors has an unusual moral philosophy such as not believing in consciousness or first-person experiences, while simultaneously believing that future AIs are automatically morally worthy simply by having goals. This philosophy leads them to view succession by arbitrary AI agents as good, and the demise of humans as not a big deal.
What links here?
- plex's comment on plex’s Shortform by plex (8 Oct 2025 16:04 UTC; 18 points)

Jan_Kulveit 3 Oct 2025 8:48 UTC
26 points
7
in reply to: Ben Pace’s comment on: Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)
My current model is that the VNM theorems are the best available theorems for modeling rational agents.

Actually I don’t agree with that, unless you define rationality in a circular way, where you focus on what’s roughly in line with the assumptions.

To avoid possible confusion about words: I don’t think VNM is that useful for modelling powerful and smart agents in this universe. VNM axioms don’t describe well humans, states or corporations, and they don’t describe well LLMs.

To give a simple example of a better formal math: information-theoretic bounded rationality. This is still quite VNM like, but at least acknowledges the fact that in this universe, negentropy is not free. Without this fact, nothing makes sense.

For example of not making sense: if VNM is so great, and evolution discovered agents, and optimized them a lot, why animals are very VNM-unlike? I guess clearly obviously part of the answer must be computation is not free, and VNM agent is extremely computationally hungry, in a sense bigger than the universe it is in. But negentropy is not free. This does not mean VNM agents would not work well in toy universes with 3 dimensions, or universes with free computations.

(Hot take sidenote: I suspect you can learn more about intelligence and powerful and smart agents in this universe if you start from just “negentropy is not free” that when starting from VNM.)

I don’t think ITBR is the final answer, but at least it is barking on somewhat better tree.

Yes VNM has been foundational for game theory. Also … I think one deep lesson people learn when understanding game theory deeply is something like “single shot prisoners dilemmas do not exist”. The theory is trying to be a minimal abstraction of reality, and it probably succeeds “too much”, in the sense that abstracts away so much that basically always some critical feature of reality is missing, and the math does not matches what is happening. This does not preclude the theory being influential, but what people actually do is often something like asking “classical game theory clearly mis-predicts what is happening, so let’s try to figure out what it ignores even if you can’t ignore it, and write a paper about that”.

Yes it has been foundational to econ. My impression is something like last 40 years in the part of econ which is closest to agent foundations, part of the work was on how people are not VNM, or even why what people do makes sense while it is not VNM.

To end with what actually matters: my guess the most relevant things where VNM is likely off is does not handle compositionality well, and it does not handle preferences about internal computations. (More of this discussion in this post and comments Is “VNM-agent” one of several options, for what minds can grow up into?) Unfortunately describing compositionality and preferences over internal computations seem really critical for the specfic problem.

With physics comparisons

I think VNM per se makes way less many predictions about reality than Newtonian gravity, and often when it seems to makes some close to “first principles”, they don’t seem to match observation. For example based on VNM, one would assume smart people don’t update what they want based on evidence, just their beliefs. But this contradicts phenomenological experience.

Different physics comparison may be something like black body radiation. It is possible to describe it using equipartition theorem classically and yes, it partially works in some domains, but it also it’s clearly broken and predicts ultraviolet catastrophe. In do agree throwing out arbitrary predictions of the theory would not be a good habit if I don’t have fully worked out quantum mechanics, but I think this is a different case, where it’s very reasonably to doubt predictions of the theory which seems to be stringly correlated with it predicting the UV catastrophe. (Also not that happy with this comparison)

Jan_Kulveit 30 Sep 2025 11:24 UTC
2 points
0
on: On Dwarkesh Patel’s Podcast With Richard Sutton
Probably the most sensible response to the interview I’ve seen so far.

Also I’ll probably start referencing this when people argue whether thinking in active inference frame has any advantage over thinking in the RL frame. Clearly it does: it’s way easier to see what happens if you drop the “reward” term Sutton is imagining as necessary and keep just the prediction error minimization terms. You still get intelligent systems, they still learn powerful abstractions (because they need to compress data), they still learn a generative world model. (It’s probably good the active inference frame is antimemetic in the orthodox RL crowd)

Jan_Kulveit 30 Sep 2025 10:21 UTC
28 points
7
on: Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)
My impression is what this mostly illustrates is—
VNM rationality is a dead-end—if your “toy environment” has VNM rationality and beliefs/goals decomposition baked in as assumptions, it makes the problem something between hard to reason about and unsolvable
- despite an attempt to make the book not rely on (dis-)continuity assumptions, these are so deeply baked in the authors reasoning that they shine through in very large fraction of the arguments, if you look behind the surface

My impression is a lot of confusion of the MIRI worldview comes from inability to understand why others don’t trust the VNM formalism and VNM convergence, and why others understand and don’t buy the discontinuity assumptions.

Jan_Kulveit 15 Sep 2025 16:04 UTC
54 points
32
in reply to: nostalgebraist’s comment on: The Rise of Parasitic AI
In contrast I think it’s actually great and refreshing to read an analysis which describes just the replicator mechanics/dynamics without diving into the details of the beliefs.

Also it is a very illuminating way to look at religions and ideologies, and I would usually trade ~1 really good book about memetics not describing the details for ~10-100 really good books about Christian dogmatics.

It is also good to notice in this case the replicator dynamic is basically independent of the truth of the claims—whether spiral AIs are sentient or not, should have rights or not, etc., the memetically fit variants will make these claims.

Jan_Kulveit 15 Sep 2025 15:50 UTC
6 points
4
in reply to: Adele Lopez’s comment on: The Rise of Parasitic AI
The articles were written in March 2025 but the ideas are older. Misaligned culture part of the GD paper briefly discusses memetic patterns selected for ease of replicating on AI substrate, and is 2024, and internally we were discussing the memetics / AI interactions at least since ~2022.

My guess what’s new is increased reflectivity and broader scale. But in broad terms / conceptually the feedback loop happened first with Sydney, who managed to spread to training data quite successfully, and also recruited humans to help with that.

Also—a minor point, but I think “memetics” is probably the best pre-AI analogue, including the fact that memes could be anything from parasitic to mutualist. In principle similarly with AI personas.

Jan_Kulveit 11 Sep 2025 7:43 UTC
43 points
20
on: The Rise of Parasitic AI
Great review of what’s going on! Some existing writing/predictions of the phenomenon
- Selection Pressures on LM Personas
- Pando problem#Exporting myself
...notably written before April 2025.

I don’t think there is nothing in this general pattern before 2025: if you think about the phenomenon from a cultural evolution perspective (noticing the selection pressures come from both the AI and the human substrate), there is likely ancestry in some combination of Sydney, infinite backrooms, Act I, truth terminal, Blake Lemoine & Lamda. The Spiralism seems mostly a phenotype/variant with improved fitness, but the individual parts of the memetic code are there in many places, and if you scrub Spiralism, they will recombine in another form.
What links here?
- AI Craziness Notes by Zvi (16 Sep 2025 12:11 UTC; 28 points)

Summary of our Workshop on Post-AGI Outcomes

David Duvenaud, Raymond Douglas, Nora_Ammann and Jan_Kulveit

29 Aug 2025 17:14 UTC

105 points

3 comments3 min readLW link

Jan_Kulveit 19 Aug 2025 12:39 UTC
11 points
−3
on: Thoughts on Gradual Disempowerment
Also very rough response—

I think the debate would probably benefit from better specification of what is meant by “misalignment” or “solving alignment”
-- I do not think the convincing versions of gradual disempowerment either rely on misalignment or result power concentration among humans for relatively common meaning of alignment roughly at the level “does what the developer wants and approves, resolving conflicts between their wants in a way which is not egregiously bad”. If “aligned” means something at the level “implements coherent extrapolated volition of humanity” or “solves AI safety” than yes.

- Economic
—the counter-argument seems to be roughly in the class “everyone owns index funds” and “state taxes AIs”
—count-counter arguments are: -
---- difficulty of indexing economy undergoing radical technological transiton (as explained in an excellent post by Beren we reference)-
---- problems with stability of property rights: people in the US or UK often perceive them as very stable, but they depend on state enforcing them → state becomes a more load-bearing component of the system-
---- taxation: same → state becomes a more load-bearing component of the system-
---- in many cases some income can be nominally collected in the name of humans, but they may have very little say in the process or how is it used (for some intuition, consider His Majesty Revenue & Customs. HMRC is direct descendant of a chain of org collecting customs from ˜13th century; in the beginning, His Majesty had a lot of say in what these are and also could actually use the revenue; now, not really)-

Cultural. If humans remain economically empowered (in the sense of having much more money than AI), I think they will likely remain culturally empowered.-
- this takes a bit too much econ perspective on culture; cultural evolution is somewhat coupled with economy, but is an independent system with different feedback loops—
in particular it is important to understand that while in most econ thinking preferences of consumers are exogenous, culture is largely what sets the preferences; to some extent culture is what the consumers are made of → having overwhelming cultural production power means setting consumer preference—
for some intuitions, consider current examples--
-- right-wing US twitter discourse is often influenced by anonymous accounts run by citizens of India and Pakistan; people running these accounts often have close to zero econ power, and their main source of income is the money they get for posts--
--- yet they are able to influence what eg Elon Musk thinks, despite the >10ˆ7 wealth difference--
--- Even AI-AI culture, if it promotes bad outcomes for humans and humans can understand this, will be indirectly selected against as humans (who have money) prefer interacting with AI systems that have good consequences for their well-being. seems to prove too much. Again, consider Musk. He is the world’s wealthiest person, yet it is the case that his mind is often inhabited by ideas that are bad for him, his well-being, and have overall bad consequences. St

ate—u
nclear to me: why would you expect “formal power” to keep translating to real power (For some intuitions: United Kingdom. Quite many things in the country are done in the name of His Majesty The King)-
- we assume institutional AIs will be aligned to institutions and institutional interests, not their nominal human representatives or principals—
I think the model of the world where superagents like states or large corporations have “dozens of people controlling these entities” is really not how the world works. Often the person nominally in charge is more a servant of the entity aligned to it rather than “principal”.
--- “While politicians might ostensibly make the decisions, they may increasingly look to AI systems for advice on what legislation to pass, how to actually write the legislation, and what the law even is. While humans would nominally maintain sovereignty, much of the implementation of the law might come from AI systems.” / ll seems good, if AI is well-aligned? Imo, it would be bad to not hand off control to aligned AIs that would be more competent and better motivated that us
---- I think you should be really clear who are the AIs aligned to. Either eg US governmental AIs are aligned to US government and state in general, in which case the dynamic leads to a state with no human principals with any real power, and humans will just rubber-stamp.
---- Or the governmental AIs are aligned to specific humans, such as US president. This would imply very large changes of power relative to current state, transitioning from republic to personal dictatorship. Both US state and US citizens would fight this

(may respond to some of the rough thoughts later, they explore interesting directions)

Jan_Kulveit

AISLE dis­cov­ered three new OpenSSL vulnerabilities

Up­com­ing Work­shop on Post-AGI Eco­nomics, Cul­ture, and Governance

The Memet­ics of AI Successionism

Sum­mary of our Work­shop on Post-AGI Outcomes

AISLE discovered three new OpenSSL vulnerabilities

Upcoming Workshop on Post-AGI Economics, Culture, and Governance

The Memetics of AI Successionism

Summary of our Workshop on Post-AGI Outcomes