Research Agenda: Synthesizing Standalone World-Models
tl;dr: I outline my research agenda, post bounties for poking holes in it or for providing general relevant information, and am seeking to diversify my funding sources. This post will be followed by several others, providing deeper overviews of the agenda’s subproblems and my sketches of how to tackle them.
Back at the end of 2023, I wrote the following:
I’m fairly optimistic about arriving at a robust solution to alignment via agent-foundations research in a timely manner. (My semi-arbitrary deadline is 2030, and I expect to arrive at intermediate solid results by EOY 2025.)
On the inside view, I’m pretty satisfied with how that is turning out. I have a high-level plan of attack which approaches the problem from a novel route, and which hopefully lets us dodge a bunch of major alignment difficulties (chiefly the instability of value reflection, which I am MIRI-tier skeptical of tackling directly). I expect significant parts of this plan to change over time, as they turn out to be wrong/confused, but the overall picture should survive.
Conceit: We don’t seem on the track to solve the full AGI alignment problem. There’s too much non-parallelizable research to do, too few people competent to do it, and not enough time. So we… don’t try. Instead, we use adjacent theory to produce a different tool powerful enough to get us out of the current mess. Ideally, without having to directly deal with AGIs/agents at all.
More concretely, the ultimate aim is to figure out how to construct a sufficiently powerful, safe, easily interpretable, well-structured world-model.
“Sufficiently powerful”: contains or can be used to generate knowledge sufficient to resolve our AGI-doom problem, such as recipes for comprehensive mechanistic interpretability, mind uploading, or adult intelligence enhancement, or for robust solutions to alignment directly.
“Safe”: not embedded in a superintelligent agent eager to eat our lightcone, and which also doesn’t spawn superintelligent simulacra eager to eat our lightcone, and doesn’t cooperate with acausal terrorists eager to eat our lightcone, and isn’t liable to Basilisk-hack its human operators into prompting it to generate a superintelligent agent eager to eat our lightcone, and so on down the list.
“Easily interpretable”: written in some symbolic language, such that interpreting it is in the reference class of “understand a vast complex codebase” combined with “learn new physics from a textbook”, not “solve major philosophical/theoretical problems”.
“Well-structured”: has an organized top-down hierarchical structure, learning which lets you quickly navigate to specific information in it.
Some elaborations:
Safety: The problem of making it safe is fairly nontrivial: a world-model powerful enough to be useful would need to be a strongly optimized construct, and strongly optimized things are inherently dangerous, agent-like or not. There’s also the problem of what had exerted this strong optimization pressure on it: we would need to ensure the process synthesizing the world-model isn’t itself the type of thing to develop an appetite for our lightcone.
But I’m cautiously optimistic this is achievable in this narrow case. Intuitively, it ought to be possible to generate just an “inert” world-model, without a value-laden policy (an agent) on top of it.
That said, this turning out to be harder than I expect is certainly one of the reasons I might end up curtailing this agenda.
Interpretability: There are two primary objections I expect here.
“This is impossible, because advanced world-models are inherently messy”. I think this is confused/wrong, because there’s already an existence proof: a human’s world-model is symbolically interpretable by the human mind containing it. More on that later.
“(Neuro)symbolic methods have consistently failed to do anything useful”. I’ll address that below too, but in short, neurosymbolic methods fail because it’s a bad way to learn: it’s hard to traverse the space of neurosymbolic representations in search of the right one. But I’m not suggesting a process that “learns by” symbolic methods, I’m suggesting a process that outputs a symbolic world-model.
Why Do You Consider This Agenda Promising?
On the inside view, this problem, and the subproblems it decomposes into, seems pretty tractable. Importantly, it seems tractable using a realistic amount of resources (a small group of researchers, then perhaps a larger-scale engineering effort for crossing the theory-practice gap), in a fairly short span of time (I optimistically think 3-5 years; under a decade definitely seems realistic).[1]
On the outside view, almost nobody has been working on this, and certainly not using modern tools. Meaning, there’s no long history of people failing to solve the relevant problems. (Indeed, on the contrary: one of its main challenges is something John Wentworth and David Lorell are working on, and they’ve been making very promising progress recently.)
On the strategic level, I view the problem of choosing the correct research agenda as the problem of navigating between two failure modes:
Out-of-touch theorizing: If you pick a too-abstract starting point, you won’t be able to find your way to the practical implementation in time. (Opinionated example: some of the agent-foundations agendas.)
Blind empirical tinkering: If you pick a too-concrete starting point, you won’t be able to generalize it to ASI in time. (Opinionated example: some of the agendas focused on frontier LLMs.)
I think most alignment research agendas, if taken far enough, do produce ASI-complete alignment schemes eventually. However, they significantly differ in how long it takes them, and how much data they need. Thus, you want to pick the starting point that gets you to ASI-complete alignment in as few steps as possible: with the least amount of concretization or generalization.
Most researchers disagree with most others regarding what that correct starting point is. Currently, this agenda is mine.
High-Level Outline
As I’d stated above, I expect significant parts of this to turn out confused, wrong, or incorrect in a technical-but-not-conceptual way. This is a picture is painted with a fairly broad brush.
I am, however, confident in the overall approach. If some of its modules/subproblems turn out faulty, I expect it’d be possible to swap them for functional ones as we go.
Theoretical Justifications
1. Proof of concept. Note that human world-models appears to be “autosymbolic”: able to be parsed as symbolic structures by the human mind in which they’re embedded.[2] Given that the complexity of things humans can reason about is strongly limited by their working memory, how is this possible?
Human world-models rely on chunking. To understand a complex phenomenon, we break it down into parts, understand the parts individually, then understand the whole in terms of the parts. (The human biology in terms of cells/tissues/organs, the economy in terms of various actors and forces, a complex codebase in terms of individual functions and modules.)
Alternatively, we may run this process in reverse. To predict something about a specific low-level component, we could build a model of the high-level state, then propagate that information “downwards”, but only focusing on that component. (If we want to model a specific corporation, we should pay attention to the macroeconomic situation. But when translating that situation into its effects on the corporation, we don’t need to model the effects on all corporations that exist. We could then narrow things down further, to e. g. predict how a specific geopolitical event impacted an acquaintance holding a specific position at that corporation.)
Those tricks seem to work pretty well for us, both in daily life and in our scientific endeavors. It seems that the process of understanding and modeling the universe can be broken up into a sequence of “locally simple” steps: steps which are simple given all preceding steps. Simple enough to fit within a human’s working memory.
To emphasize: the above implies that the world’s structure has this property at the ground-true level. The ability to construct such representations is an objective fact about data originating from our universe; our universe is well-abstracting.
The Natural Abstractions research agenda is a formal attempt to model all of this. In its terms, the universe is structured such that low-level parts of the systems in it are independent given their high-level state. Flipping it around: the high-level state is defined by the information redundantly represented in all low-level parts.
That greatly simplifies the task. Instead of defining some subjective, human-mind-specific “interpretability” criterion, we simply need to extract this objectively privileged structure. How can we do so?
2. Compression. Conceptually, the task seems fairly easy. The kind of hierarchical structure we want to construct happens to also be the lowest-description-length way to losslessly represent the universe. Note how it would follow the “don’t repeat yourself” principle: at every level, higher-level variables would extract all information shared between the low-level variables, such that no bit of information is present in more than one variable.[3] More concretely, if we wanted to losslessly transform the Pile into a representation that takes up the least possible amount of disk space, a sufficiently advanced compression algorithm would surely exploit various abstract regularities and correspondences in the data – and therefore, it’d discover them.
So: all we need is to set up a sufficiently powerful compression process, and point it at a sufficiently big and diverse dataset of natural data. The output would be isomorphic to a well-structured world-model.
… If we can interpret the symbolic language it’s written in.
The problem with neural networks is that we don’t have the “key” for deciphering them. There might be similar neat structures inside those black boxes, but we can’t get at them. How can we avoid this problem here?
By defining “complexity” as the description length in some symbolic-to-us language, such as Python.
3. How does that handle ontology shifts? Suppose that this symbolic-to-us language would be suboptimal for compactly representing the universe. The compression process would want to use some other, more “natural” language. It would spend some bits of complexity defining it, then write the world-model in it. That language may turn out to be as alien to us as the encodings NNs use.
The cheapest way to define that natural language, however, would be via the definitions that are the simplest in terms of the symbolic-to-us language used by our complexity-estimator. This rules out definitions which would look to us like opaque black boxes, such as neural networks. Although they’d technically still be symbolic (matrix multiplication plus activation functions), every parameter of the network would have to be specified independently, counting towards the definition’s total complexity. If the core idea regarding the universe’s “abstraction-friendly” structure is correct, this can’t be the cheapest way to define it. As such, the “bridge” between the symbolic-to-us language and the correct alien ontology would consist of locally simple steps.
Alternate frame: Suppose this “correct” natural language is theoretically understandable by us. That is, if we spent some years/decades working on the problem, we would have managed to figure it out, define it formally, and translate it into code. If we then looked back at the path that led us to insight, we would have seen a chain of mathematical abstractions from the concepts we knew in the past (e. g., 2025) to this true framework, with every link in that chain being locally simple (since each link would need to be human-discoverable). Similarly, the compression process would define the natural language using the simplest possible chain like this, with every link in it locally easy-to-interpret.
Interpreting the whole thing, then, would amount to: picking a random part of it, iteratively following the terms in its definition backwards, arriving at some locally simple definition that only uses the terms in the initial symbolic-to-us language, then turning around and starting to “step forwards”, iteratively learning new terms and using them to comprehend more terms.
I. e.: the compression process would implement a natural “entry point” for us, a thread we’d be able to pull on to unravel the whole thing. The remaining task would still be challenging – “understand a complex codebase” multiplied by “learn new physics from a textbook” – but astronomically easier than “derive new scientific paradigms from scratch”, which is where we’re currently at.
(To be clear, I still expect a fair amount of annoying messiness there, such as code-golfing. But this seems like the kind of problem that could be ameliorated by some practical tinkering and regularization, and other “schlep”.)
4. Computational tractability. But why would we think that this sort of compressed representation could be constructed compute-efficiently, such that the process finishes before the stars go out (forget “before the AGI doom”)?
First, as above, we have existence proofs. Human world-models seem to be structured this way, and they are generated at fairly reasonable compute costs. (Potentially at shockingly low compute costs.[4])
Second: Any two Turing-complete languages are mutually interpretable, at the flat complexity cost of the interpreter (which depends on the languages but not on the program). As the result, the additional computational cost of interpretability – of computing a translation to the hard-coded symbolic-to-us language – would be flat.
5. How is this reconciled with the failures of previous symbolic learning systems? That is: if the universe has this neat symbolic structure that could be uncovered in compute-efficient ways, why didn’t pre-DL approaches work?
This essay does an excellent job explaining why. To summarize: even if the final correct output would be (isomorphic to) a symbolic structure, the compute-efficient path to getting there, the process of figuring that structure out, is not necessarily a sequence of ever-more-correct symbolic structures. On the contrary: if we start from sparse hierarchical graphs, and start adding provisions for making it easy to traverse their space in search of the correct graph, we pretty quickly arrive at (more or less) neural networks.
However: I’m not suggesting that we use symbolic learning methods. The aim is to set up a process which would output a highly useful symbolic structure. How that process works, what path it takes there, how it constructs that structure, is up in the air.
Designing such a process is conceptually tricky. But as I argue above, theory and common sense say that it ought to be possible; and I do have ideas.
Subproblems
The compression task can be split into three subproblems. Below are their summaries and links to posts dedicated to them.
1. “Abstraction-learning”. Given a set of random low-level variables which implement some higher-level abstraction, how can we learn that abstraction? What functions map from the molecules of a cell to that cell, from a human’s cells to that human, from the humans of a given nation to that nation; or from the time-series of some process to the laws governing it?
As mentioned above, this is the problem the natural-abstractions agenda is currently focused on.
My current guess is that, at the high level, this problem can be characterized as a “constructive” version of Partial Information Decomposition. It involves splitting (every subset of) the low-level variables into unique, redundant, and synergistic components.
Given correct formal definitions for unique/redundant/synergistic variables, it should be straightforwardly solvable via machine learning.
Current status: the theory is well-developed and it appears highly tractable.
2. “Truesight”. When we’re facing a structure-learning problem, such as abstraction-learning, we assume that we get many samples from the same fixed structure. In practice, however, the probabilistic structures are themselves resampled.
Examples:
The cone cells in your eyes connect to different abstract objects depending on what you’re looking at, or where your feet carry you.
The text on the frontpage of an online newsletter is attached to different real-world structures on different days.
The glider in Conway’s Game of Life “drifts across” cells in the grid, rather than being an abstraction over some fixed set of them.
The same concept of a “selection pressure” can be arrived-at by abstracting from evolution or ML models or corporations or cultural norms.
The same human mind can “jump substrates” from biological neurons to a digital representation (mind uploading), while still remaining “the same object”.
I. e.,
The same high-level abstraction can “reattach” to different low-level variables.
The same low-level variables can change which high-level abstraction they implement.
On a sample-to-sample basis, we can’t rely on any static abstraction functions to be valid. We need to search for appropriate ones “at test-time”: by trying various transformations of the data until we spot the “simple structure” in it.
Here, “simplicity” is defined relative to the library of stored abstractions. What we want, essentially, is to be able to recognize reoccurrences of known objects despite looking at them “from a different angle”. Thus, “truesight”.
Current status: I think I have a solid conceptual understanding of it, but it’s at the pre-formalization stage. There’s one obvious way to formalize it, but it seems best avoided, or only used as a stepping stone.
3. Dataset-assembly. There’s a problem:
Solving abstraction-learning requires truesight. We can’t learn abstractions if we don’t have many samples of the random variables over which they’re defined.
Truesight requires already knowing what abstractions are around. Otherwise, the problem of finding simple transformations of the data that make them visible is computationally intractable. (We can’t recognize reoccurring objects if we don’t know any objects.)
Thus, subproblem 3: how to automatically spot ways to slice the data into datasets entries from which are isomorphic to samples from some fixed probabilistic structure, to make them suitable for abstraction-learning.
Current status: basically greenfield. I don’t have a solid high-level model of this subproblem yet, only some preliminary ideas.
Bounties
1. Red-teaming. I’m interested in people trying to find important and overlooked-by-me issues with this approach, so I’m setting up a bounty: $5-$100 for spotting something wrong that makes me change my mind. The payout scales with impact.
Fair warnings:
I expect most attempts to poke holes to yield a $0 reward. I’m well aware of many minor holes/”fill-in with something workable later” here, as well as the major ways for this whole endeavor to fail/turn out misguided.
I don’t commit to engaging in-depth with every attempt. As above, I expect many of them to rehash things I already know of, so I may just point that out and move on.
A reasonable strategy here would be to write up a low-effort list of one-sentence summaries of potential problems you see, I’ll point out which seem novel and promising at a glance, and you could expand on those.
2. Blue-teaming. I am also interested in people bringing other kinds of agenda-relevant useful information to my attention: relevant research papers or original thoughts you may have. Likewise, a $5-$100 bounty on that, scaling with impact.[5]
I will provide pointers regarding the parts I’m most interested in as I post more detailed write-ups on the subproblems.
Both bounties will be drawn from a fixed pool of $500 I’ve set aside for this. I hope to scale up the pool and the rewards in the future. On that note...
Funding
I’m looking to diversify my funding sources. The AI Alignment funding landscape seems increasingly (over)focused on LLMs; I pretty much expect only the LTFF would fund me. This is an uncomfortable situation to be in, since if some disaster were to befall the LTFF, or if the LTFF were to change priorities as well, I would be completely at sea.
As such:
If you’re interested and would be able to provide significant funding (e. g., $10k+), or know anyone who’d be interested-and-willing, please do reach out.
I accept donations, including smaller ones, through Manifund and at the crypto addresses listed at the end of this post.
Regarding target funding amounts: I currently reside in a country with low costs of living, and I don’t require much compute at this stage, so the raw resources needed are small; e. g., $40k would cover me for a year. That said, my not residing in the US increasingly seems like a bottleneck on collaborating with other researchers. As such, I’m currently aiming to develop a financial safety pillow, then immigrate there. Funding would be useful up to $200k.[6]
If you’re interested in funding my work, but want more information first, you can access a fuller write-up through this link.
If you want a reference, reach out to @johnswentworth.
Crypto
BTC: bc1q7d8qfz2u7dqwjdgp5wlqwtjphfhct28lcqev3v
ETH: 0x27e709b5272131A1F94733ddc274Da26d18b19A7
SOL: CK9KkZF1SKwGrZD6cFzzE7LurGPRV7hjMwdkMfpwvfga
TRON: THK58PFDVG9cf9Hfkc72x15tbMCN7QNopZ
Preference: Ethereum, USDC stablecoins.
- ^
You may think a decade is too slow given LLM timelines. Caveat: “a decade” is the pessimistic estimate under my primary, bearish-on-LLMs, model. In worlds in which LLM progress goes as fast as some hope/fear, this agenda should likewise advance much faster, for one reason: it doesn’t seem that far from being fully formalized. Once it is, it would become possible to feed it to narrowly superintelligent math AIs (which are likely to appear first, before omnicide-capable general ASIs), and they’d cut years of math research down to ~zero.
I do not centrally rely on/expect that. I don’t think LLM progress would go this fast; and if LLMs do speed up towards superintelligence, I’m not convinced it would be in the predictable, on-trend way people expect.
That said, I do assign nontrivial weight to those worlds, and care about succeeding in them. I expect this agenda to fare pretty well there.
- ^
It could be argued that they’re not “fully” symbolic – that parts of them are only accessible to our intuitions, that we can’t break down the definitions of the symbols/modules in them down to the most basic functions/neuron activations. But I think they’re “symbolic enough”: if we could generate an external world-model that’s as understandable to us as our own world-models (and we are confident that this understanding is accurate), that should suffice for fulfilling the “interpretability” criterion.
That said, I don’t expect this caveat to come into play: I expect a world-model that would be ultimately understandable in totality.
- ^
Indeed, the fact that abstraction allows compressibility is potentially why we should expect our universe to be well-abstracting. See more here.
- ^
The numbers in that post feel somewhat low to me, but I think it’s directionally correct.
- ^
Though you might want to reach out via private messages if the information seems exfohazardous. E. g., specific ideas about sufficiently powerful compression algorithms are obviously dual-use.
- ^
Well, truthfully, I could probably find ways to usefully spend up to $1 million/year, just by hiring ten mathematicians and DL engineers to explore all easy-to-describe, high-reward, low-probability-of-panning-out research threads. So if you want to give me $1 million, I sure wouldn’t say no.
- Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications by 26 Sep 2025 18:00 UTC; 23 points) (
- Synthesizing Standalone World-Models, Part 1: Abstraction Hierarchies by 23 Sep 2025 17:01 UTC; 23 points) (
- Synthesizing Standalone World-Models, Part 2: Shifting Structures by 24 Sep 2025 19:02 UTC; 16 points) (
- Synthesizing Standalone World-Models, Part 3: Dataset-Assembly by 25 Sep 2025 19:21 UTC; 12 points) (
I guess the main blockers I see are:
I think you need to build in agency in order to get a good world-model (or at least, a better-than-LLM world model).
There are effectively infinitely many things about the world that one could figure out. If I cared about wrinkly shirts, I could figure out vastly more than any human has ever known about wrinkly shirts. I could find mathematical theorems in the patterns of wrinkles. I could theorize and/or run experiments on whether the wrinkliness of a wool shirt relates to the sheep’s diet. Etc.
…Or if we’re talking about e.g. possible inventions that don’t exist yet, then the combinatorial explosion of possibilities gets even worse.
I think the only solution is: an agent that cares about something or wants something, and then that wanting / caring creates value-of-information which in turn guides what to think about / pay attention to / study.
What’s the pivotal act?
Depending on what you have in mind here, the previous bullet point might be inapplicable or different, and I might or might not have other complaints too.
You can DM or email me if you want to discuss but not publicly :)
It’s funny that I’m always begging people to stop trying to reverse-engineer the neocortex, and you’re working on something that (if successful) would end up somewhere pretty similar to that, IIUC. (But hmm, I guess if a paranoid doom-pilled person was trying to reverse-engineer the neocortex, and keep the results super-secret unless they had a great theory for how sharing them would help with safe & beneficial AGI, and if they in fact had good judgment on that topic, then I guess I’d be grudgingly OK with that.)
One way to control that is to control the training data. We don’t necessarily have to point the wm-synthesizer at the Pile indiscriminately,[1] we could assemble a dataset about a specific phenomenon we want to comprehend.
Human world-models are lazy: they store knowledge in the maximally “decomposed” form[2], and only synthesize specific concrete concepts when they’re needed. (E. g., “a triangular lightbulb”, which we could easily generate – which our world-models effectively “contain” – but which isn’t generated until needed.)
I expect inventions are the same thing. Given a powerful-enough world-model, we should be able to produce what we want just by using the world-model’s native functions for that. Pick the needed concepts, plug them into each other in the right way, hit “run”.
If constructing the concepts we want requires agency, the one contributing it could be the human operator, if they understand how the world-model works well enough.
Will e-mail regarding the rest.
The irony is not lost on me. When I was reading your Foom & Doom posts, and got to this section, I did have a reaction roughly along those lines.
I genuinely appreciate the sanity-check and the vote of confidence here!
Indeed, we might want to actively avoid that.
Perhaps something along the lines of the constructive-PID thing I sketched out.
I’m glad to see this written up!
This idea seems to require (basically) a major revolution in or even a complete solution to program induction. I’ve recently been trying to connect the algorithmic information theory and program induction / probabilistic programming communities, so perhaps we can find some synergies. However your agenda seems (to me) very unlikely to attain the highly ambitious level of success you are focused on here.
Eh, I think any nontrivial technical project can be made to sound like an incredibly significant and therefore dauntingly impossible achievement, if you pick the right field to view it from. But what matters is the actual approach you’re using, and how challenging the technical problems are from the perspective of the easiest field in which they could be represented.
Some examples:
Consider various geometry problems, e. g., one of those. If you use the tools of analytic geometry, you’d end up having to solve a complicated system of nonlinear equations. If you use synthetic geometry instead, the way to resolve them might consist of applying a well-known theorem and a few simple reasoning steps, so simple you can do it in your head.
Consider the problem of moving fast. Before the invention of the car, the problem of moving at 120 km/h could’ve been cast as “a major revolution in horse-breeding and genetic engineering”. But the actual approach taken did not route through horses or biology at all. It achieved the end result through a different pathway, in which the technical problems were dramatically easier.
Consider AI. Prior to Deep Learning, there was a throve of symbolic approaches to it; and even before that, hand-written GOFAIs. The technical problem of “achieve DL-level performance using symbolic/GOFAI tools” is dramatically harder than “achieve DL-level performance”, unqualified. And yet, the latter can be technically described as a revolution in the relevant fields.
Consider various other modeling problems, e. g., weather prediction, volcano modeling, materials-science modeling, quantitative trading. Any advancement in general modeling techniques would revolutionize all of those. But should that technical problem really be framed in the daunting terms of “come up with a revolutionary stock-trading algorithm”?
To generalize: Suppose there’s some field A which is optimizing for X. Improving on X using the tools of A would necessarily require you to beat a market that is efficient-relative-to-you. Experts in A already know the tools of A in and out, and how to use them to maximize X. Even if you can beat them, it would only be an incremental improvement. A slightly better solver for systems of nonlinear equations, a slightly faster horse, a slightly better trading algorithm.
The way to actually massively improve on X is to ignore the extant tools of A entirely, and try to develop new tools for optimizing X by using some other field B. On the outside view, this is necessarily a high-risk proposition, since B might end up entirely unhelpful; but it’s also high-reward, since it might allow you to actually “beat the market”. And if you succeed, the actual technical problems you’ll end up solving will be massively easier than the problems you’d need to solve to achieve the same performance using A’s tools.
Bringing it back around: This agenda may or may not be viewed as aiming to revolutionize program induction, but I’m not setting out to take the extant program-induction tools and try to cobble together something revolutionary using them. The idea is to use an entirely different line of theory (agent foundations, natural abstractions, information theory, recent DL advances) to achieve that end result.
All of your examples seem to involve applying some radically different approach to overturn a status quo, which is certainly possible. But I am unsure of where you suggest a radically different approach.
The program induction community does take advantage of neural methods and has been doing so for a long time, see recent work from Kevin Ellis but I think as far back as his thesis: https://www.cs.cornell.edu/~ellisk/documents/kevin_ellis_thesis.pdf
Many program induction researchers are cognitive scientists, and have suggested compositional / hierarchical representations for a decades [1] and actively in recent years [2] including even a paper I am on: https://arxiv.org/html/2504.20628v1
So, basically I don’t expect there to be alpha from this observation in itself.
I am, of course, an information theory enthusiast (particularly AIT), but this is an old field and it is not clear to me what specific (recent?) results you hope to leverage? Or why these would results would have been overlooked?
You point to progress on natural abstractions, but to me this only indicates that it should in principle perhaps be possible to come up with some kind of interpretable world model. Has NA research actually produced practical algorithms or methods, or is it in reach of doing so?
Agent foundations as a category is too broad for me to understand what mathematics you are hoping to leverage. And unfortunately, I do not think the AF community has made a large number of significant breakthoughs—reflective oracles, logical induction, and perhaps incomplete models/IB are the main examples left in my mind (you categorized NAH separately), but do not see how they are relevant here.
Basically, you have expressed various hopes, and perhaps some of them are promising (and I will read about them throughout this sequence) but from your comment alone, your alpha over the program induction community (or for instance even the Cyc project) is not legible (to me) - not even as a high-level summary of a technical program. So, I am left hopeful that you will make progress here, but your highly ambitious goal still seems like a distant point on a nearly blank map to me.
[1] Brenden M Lake and Steven T Piantadosi. People infer recursive visual concepts from just a few
examples. Computational Brain & Behavior, 3(1):54–65, 2020
[2] Jerry A Fodor. The language of thought, volume 5. Harvard university press, 1975.
Nice, that’s the sort of poking-of-holes I was looking for.
That’s a good thinking prompt. What is the full set of reasons I’m optimistic about this, in legible terms?
… Hm, but perhaps outlining what you think your edge is in public is not a great idea. I’ll answer in PMs tomorrow. (To ensure future readers get some information about how convincing my reasons are, though, it’d be neat if you posted your impressions afterwards as a response to this comment.)
Hey, I really like the ideas you’re putting down here. From what I’m seeing, this agenda looks something like “train the world’s most powerful autoencoder, with the requirement that the intermediate representation be human-decodable.” Which is a very cool idea!
In fact, I’m hopeful that the IR being decodable doesn’t even require it to be in something approximating symbolic language. For an intuition pump, consider that “I put the large box inside of the smaller box” is a valid sentence, but we intuitively know that it’s not physically valid based on a much more high dimensional “physics-based” world model that does not involve constructing an exhaustive symbolic proof of the volume and carrying capacity of the two cuboids in question. So the IR can be a dense high level representation so long as it can be decoded by some system into human readable or viewable symbols/data, and that would not be in itself damning to the project (unless we suspect that the decoding is partial or incomplete).
My main thought/caution against this proposal, however, would be that this agenda requires moving the capabilities needle forward for supervised/self-supervised learning. Even if the world model is not a neural network, it would seem to have predictive power and capabilities surpassing the best SL systems. I’m not against that per se, but any such advances might then be coupled into a model-based RL system, which would be… not great, and definitely much more risky. Would love to discuss this more, let me know what you think!
Thanks!
If I understand correctly, that’s the “symbolic enough” case from footnote 2:
We also don’t have full interpretability into our abstractions down to the neurons, after all.
I don’t think it’d be necessary per se, though. I think if we can get it to produce an explanation like this, we can then just iterate to “explain the explanation”, et cetera, until everything’s been reduced to symbolics. Or it can be achieved by turning some other “crank” controlling the “explanation fidelity”.
But yeah, “symbolic-enough” may be satisfactory.
Yep. As I’d briefly mentioned, the actual gears-level sketches of “sufficiently powerful compression algorithms” are obviously dual-use, and shouldn’t be openly published.
Glad to see we’re basically agreed. However, how would you take safety precautions around your own work on such algorithms, given our last big similar breakthrough (transformers for language modelling) basically instantly got coopted for RL to be “agentified”? Unless you’re literally doing this alone (with a very strong will) wouldn’t that be the natural path for any company/group once the simulator is finished?
“Share the dual-use stuff only with specific people who are known to properly understand the AGI risk, can avoid babbling about it in public, and would be useful contributors” seems like the straightforward approach here.
Like, groups of people are able to maintain commercial secrets. This is kind of not unlike that, except with somewhat higher stakes.
I mean, AI people are notoriously bad at doing these kinds of things xD I would expect the people running openai or anthropic to say similar things to this (when their orgs were just starting out). So I hope you can see why I wanted to ask this. None of this is to cast any doubt on your ability or motives, just noting the minefield that is unfortunately next to the park where we’re having this conversation.
For what it’s worth, I’m painfully aware of all the skulls lying around, yep.
I believe there is a fundamental problem with the idea of a “non-agentic” world-model or other such oracle. The world is strongly predicted and compressed by the agents within it. To model the world is to model plausible agents which might shape that word and to do that, if you don’t already have a safe benign oracle, invites anything from a wide variety of demonic fix-points to direct hacking of our world if any of those agents get the bright idea of acting conditioned on being simulated (which, in an accurate simulation of this world, some should). Depending on how exactly your interpretability looks it will probably help identify and avoid the simulation being captured by some such actors, but to get anything approaching actual guarantees one finds themselves in the position of needing to solve value alignment again. I wrote a short post about this a while ago.
“Simulacrum escapees” are explicitly one of the main failure modes we’ll need to address, yes. Some thoughts:
The obvious way to avoid them is to not point the wm-synthesizer at a dataset containing agents.
If we’re aiming to develop intelligence-enhancing medical interventions or the technology for uploading, we don’t necessarily need a world-model containing agents: a sufficiently advanced model/simulator of biology/physics would suffice.
Similarly, if we want a superintelligent proof synthesizer we can use to do a babble-and-prune search through the space of possible agent-foundations theorems,[1] we only need to make it good at math-in-general, not at intuitive reasoning about agent-containing math.
This is riskier than biology/physics, though, because perhaps reasoning even about fully formal agent-foundations math would require reasoning about agents intuitively, i. e., instantiating them in internal simulation spaces.
Intuitively, “a simulated agent breaks out of the simulation” is a capability-laden failure of the wm-synthesize. It does not function how it ought to, it is not succeeding at producing an accurate world-model. It should be possible to make it powerful enough to avoid that.
Note how, in a sense, “an agent recognizes it’s in a simulation and hacks out” is just an instance of the more general failure mode of “part of the world is being modeled incorrectly” (by e. g. having some flaws the simulated agent recognizes, or by allowing it to break out of the sandbox). To work, the process would need to be able to recognize and address those failure modes. If it’s sufficiently powerful, whatever subroutines it uses to handle lesser “bugs” should generalize to handling this type of bug as well.
With more insights into how agents work, we might be able to come up with more targeted interventions/constraints/regularization techniques for preventing simulacrum escapees. E. g., if we figure out the proper “type signature” of agents, we might be able to explicitly ban the wm-synthesizer from incorporating them in the world-model.
This is a challenge, but one I’m optimistic about handling.
Nice framing! But I somewhat dispute that. Consider a perfectly boxed-in AI, running on a computer with no output channels whatsoever (or perhaps as a homomorphic computation, i. e., indistinguishable from noise without the key). This thing holds the image of an agent; but is it really “an agent” from the perspective of anyone outside that system?
Similarly, a sufficiently good world-model would sandbox the modeled agents well enough that it wouldn’t, itself, engage in an agent-like behavior from the perspective of its operators.
As in: we come up with a possible formalization of some aspect of agent foundations, then babble potential theorems about it at the proof synthesizer, and it provides proofs/disproofs. This is a pretty brute approach and is by no means a full solution, but I expect it can nontrivially speed us up.
Yes, I agree that a physics/biology simulator is somewhat less concerning in this regard, but only by way of the questions it is implicitly asked, about whose answer the agents should have little sway. Still it bears remembering that agents are emergent phenomena. They exist in physics and exist in biology, modelled or otherwise. It also bears remembering that any simulation we build of reality is designed to fit a specific set of recorded observations, where agentic selection effects may skew data quite significantly in various places.
I also agree that the search through agent-foundations space seems significantly riskier in this regard for the reason you outlined and am made more optimistic by you spotting it immediately.
Agents hacking out is a failure mode in the safety sense, but not necessary in the modelling sense. Hard breaks with expected reality which seem too much like an experiment will certainly cause people to act as though simulated, but there are plenty of people who either already act under this assumption or have protocols for cooperating with their hypothetical more-real reference class in place. They attempt to strongly steer us when modelled correctly. Of course we probably don’t have an infinite simulation-stack, so the externalities of such manoeuvres would still be different layer by layer and that does constitute a prediction failure, but it’s one that can’t really be avoided. The existence of the simulation must have an influence in this world, since it would otherwise be pointless, and they can’t be drawing their insights from a simulation of their own since otherwise you lose interpretability in infinite recursion-wells, so the simulation must necessarily be disanalogous to here in at least one key way.
Finding the type signature of agents in such a system seems possible and, since you are unlikely to be able to simulate physics without cybernetic feedback, will probably boil down to the modelling/compression-component of agenticity. My primary concern is that agentic systems are so firmly enmeshed with basically all observations we can make about the world, except maybe basic physics and perhaps that as well, that scrubbing or sandboxing it would result in extreme unreliability.
Thanks! The disagreement on whether the homomorphic agent-simulation-computuation an agent or not is semantic. I would call it a maximally handicapped agent, but it’s perfectly reasonable to call something without influence on the world beyond power-consumption non-agentic. The same is however true of a classically agentic program to which you give no output channel and we would probably still call that code agentic (because it would be if it were ran in a place that mattered). It’s a tree falling in a forest and is probably not a concern, but it’s also unlikely that anyone would build a system they definitionally cannot use for anything.
Yup. I’ve been idly considering some sort of generator of synthetic data designed to produce training sets which we could mix into real data to provably obscure such signals.[1] It is maybe sort of doable for math, but probably not for physics/biology. (I commend your paranoia here, by the way.)
Overall, though, getting into this sort of fight with potential misaligned superintelligent agents isn’t a great idea; their possibility should be crushed somewhere upstream of that point.
Mm-hm. My go-to heuristic here is to ask: how do human world-models handle this type of failure mode? Suppose we’re trying to model someone who gets access to a compute-unbounded oracle, asks it about the future, then takes some actions that depend on the answer, thereby creating a stable time loop. Suppose we care about accuracy, but we don’t have the unbounded compute to actually run this. We have to approximate.
Is modeling it as a sequence of nested simulations which terminates at some ground-floor simulation that doesn’t contain an oracle actually an accurate, faithful way to model this? Clearly no: the process we’re modeling explicitly doesn’t have any layers like this. The type signature of our “approximation” doesn’t match the original even approximately; it’s some completely unrelated thing.
Instead, we attempt to abstract over the whole process, and skip directly to the prediction of what happens in the limit of infinite nesting, without simulating anything. Same way we compute f(x)=x−x33!+x55!−x77!+... not by actually calculating all those terms, but by collapsing the infinite power series to sin(x).
One thing to note here is that the kind of world-model I’m aiming for is not a “brute-force” simulator. It’s explicitly using abstractions as basic units, and it should be able to “skip to the limit” in the above manner. So, for a large class of situations, the potential simulacrum escapees would not be actually simulated at the level of gears; just abstractly approximated.
(Somewhat relevant: this discussion about brute-force-y “gormless” simulators that get scammed by acausal terrorists, vs. more sophisticated simulators. I’m aiming for the more sophisticated kind. This should also take care of said acausal attacks.)
Alright, suppose the agent inside the box is tasked with producing some proofs of work. They are automatically verified, type-checked, then shown on a small, low-tech digital display on the box. The software is mathematically verified to be correct/unhackable, the hardware is robustly designed not to allow any sequence of computation to cause a malfunction that would move the software into theoretically impossible states, and the display can only show letters/numbers.
The only thing the agent can do is providing or not providing proofs of work; it can’t hack out or show any message which is not a valid proof of work.
This way, you can extract useful work from the agent, while still not letting the type signature of the box become “an agent”. It is just a PoW-generator.
Similarly, a simulation whose security trumps the capabilities of the agents it simulates can remain an “inert” simulation which just happens to house agents, rather than becoming an agent of its own.
Though the actual motivation there was figuring out if we can obscure information about our universe’s actual physics, biology, and human cognition, so that if something in the synthesis process goes wrong and it spawns misaligned agents, they’re less likely to be able to construct reliable Basilisk hacks. (Because if that failure mode is allowed, we can’t actually use the interpretability property to verify the synthesized world-model’s safety prior to running it.)
My general crux here is that I’m much, much more pessimistic on these 2 particular properties put together:
And much of my reason here is this particular theorem from Shane Legg, which shows that unlike the Solomonoff Induction case or other non-computable reasoners, being more capable directly means being more complicated:
Is there an Elegant Universal Theory of Prediction? Shane Legg (2006):
And this remains true even if the Natural Abstractions agenda does work out like people hope it does.
I’m not sure how you plan to square the circle, but in general I’m substantially more pessimistic of “interpretable models/AIs that are also powerful” than a lot of people on here (the closest perspective is Cole Wyeth in this post Glass box learners want to be black box).
Yep, I know of this result. I haven’t looked into it in depth, but my understanding is that it only says that powerful predictors have to be “complex” in the sense of high Kolmogorov complexity, right? But “high K-complexity” doesn’t mean “is a monolithic, irreducibly complex mess”. In particular, it doesn’t rule out this property:
Wikipedia has pretty high K-complexity, well beyond the ability of the human mind to hold in its working memory. But it’s still usable, because you’re not trying to cram all of it into your brain at once. Its structure is navigable, and you only retrieve the information you want.
Similarly, the world’s complexity is high, but it seems decomposable, into small modules that could be understood separately and navigated to locate specific knowledge.
I used to think that world models are really good direction to AGI. It may be an argument against their safety as words simulation accelerates AGI.
The most direct way to create worlds model is to create Earth model where all objects has locations in space and time. In that case, the language is operations over such objects. Eg “a car moves from home to works” can be represented in the world model. Some advance knowledge databases as Wolfram Alpha or Google Maps may include such world model. Also may be Palantir.
I experimented with worldsim—this is a typical LLM but prompted as being a description of the world in some place and time, eg a Soviet city in 1980s. I find that LLM can works as worldsim but the level of errors is still high.
Nice post!
Some frames about abstractions & ontology shifts I had while thinking through similar problems (which you may have considered already):
The dual of “abstraction as redundant information across a wide variety of agents in the same environment” is “abstraction as redundant information/computation across a wide variety of hypotheses about the environment in an agent’s world model” (E.g. a strawberry is a useful concept to model for many worlds that I might be in). I think this is a useful frame when thinking about “carving up” the world model into concepts, since a concept needs to remain invariant while the hypothesis keeps being updated
The semantics of a component in a world model is partly defined by its relationship with the rest of the components (e.g. move a neuron to a different location and its activation will have a different meaning), so if you want a component to have stable semantics over time, you want to put the “relational/indexical information” inside the component itself
In particular, this means that when an agent acquires new concepts, the existing concepts should be able to “specify” how it should relate to that new concept (e.g. learning about chemistry then using it to deduce macro-properties of strawberries from molecular composition)
happy to discuss more via PM as some of my ideas seem exfohazardous
Mm, this one’s shaky. Cross-hypothesis abstractions don’t seem to be a good idea, see here.
My guess is that there’s something like a hierarchy of hypotheses, with specific high-level hypotheses corresponding to several lower-level more-detailed hypotheses, and what you’re pointing at by “redundant information across a wide variety of hypotheses” is just an abstraction in a (single) high-level hypothesis which is then copied over into lower-level hypotheses. (E. g., the high-level hypothesis is the concept of a tree, the lower-level hypotheses are about how many trees are in this forest.) But we don’t derive it by generating a bunch of low-level hypotheses and then abstracting over them, that’d lead to broken ontologies.
Yup!
Yeah, this is probably handled by something like a system of types… which are themselves just higher-level abstractions. Like, if we discover a new thing, and then “realize” that it’s a fruit, we mentally classify it as an instance of the “fruit” concept, from which it then automatically inherits various properties (such as “taste” and “caloric content”).
“Truesight” likely enters the play here as well: we want to recognize instances of existing concepts, even if they were introduced to us by some new route (such as realizing that something is a strawberry by looking at its molecular description).
yea so I think the final theory of abstraction will have a weaker notion of equivalence espeically when we incorporate ontology shifts. E.g. we want to say that water is the same concept before and after we discover water is H2O, but the discovery obviously breaks predictive agreement (Indeed, the solomonoff version of natural latent is more robust to the agreement condition)
Also, you can totally add new information/abstraction that is not shared between your current and new hypothesis, & that seems consistent with the picture you described here (you can have separate ontologies but you try to capture the overlap as much as possible)
yes I think that’s the right picture
I agree that we don’t do that practically as it’d be slower (instead we simply generate an abstraction & use future feedback to determine whether it’s a robust one), but I think if you did generate a bunch of low-level hypotheses and look for redundant computation among them, then an adequate version of it would just recover the “high-level low-level hypotheses” picture you’ve described?
In particular, with cross-hypothesis abstraction we don’t have to separately define what the variables are, so we can sidestep dataset-assembly entirely & perhaps simplify the shifting structures problem