Thane Ruthenis

Karma: 9,250

Agent-foundations researcher. Working on Synthesizing Standalone World-Models, aiming at a technical solution to the AGI risk fit for worlds where alignment is punishingly hard and we only get one try.

Currently looking for additional funders ($1k+, details). Consider reaching out if you’re interested, or donating directly.

Or get me to pay you money ($5-$100) by spotting holes in my agenda or providing other useful information.

Thane Ruthenis 4 Nov 2025 12:44 UTC
21 points
4
on: I ate bear fat with honey and salt flakes, to prove a point
I was hoping someone would go ahead and try this. Great work, love it.
I think Eliezer Yudkowsky’s argument still has some merit, even if some people actually enjoy bear fat with honey and salt flakes more than ice cream
Hm, I think that specific argument falls through in that case. Suppose humans indeed like BFWHASF more than ice cream, but mostly eat the latter due to practical constraints. That means that, once we become more powerful and those constraints fall away, we would switch over to BFWHASF. But that’s actually the regime in which we’re not supposed to be able to predict what an agent would do!
As the argument goes, in a constrained environment with limited options (such as when a relatively stupid AGI is still trapped in our civilization’s fabric), an agent might appear to pursue values it was naively shaped to have. But once it grows more powerful, and gets the ability to take what it really wants, it would go for some weird unpredictable edge-instantiation thing. The suggested “BFWHASF vs. ice cream” case would actually be the inverse of that.
The core argument should still hold:
- For one, BFWHASF is surely not the optimal food. Once we have the ability to engineer arbitrary food items, and also modify our taste buds how we see fit, we would likely go for something much more weird.
- The quintessential example would of course be us getting rid of the physical implementation of food altogether, and instead focusing on optimizing substrate-independent (e. g., simulated) food-eating experiences (ones not involving even simulated biology). That maps to “humans are killed and replaced with chatbots (not even uploaded humans) babbling nice things” or whatever.
- Even people who prefer “natural” food would likely go for e. g. the meat of beasts from carefully designed environments with very fast evolutionary loops set up to make their meat structured for maximum tastiness.^[1] Translated to ASI and humans, this suggests an All Tomorrows kind of future (except without Qu going away).
But I do think BFWHASF doesn’t end up as a very good illustration of this point, if humans indeed like it a lot.
1. ^
  Just tune the implementation details of all of this such that the pipeline still meets those people’s aesthetic preferences for “natural-ness”. E. g., note that people who want “real meat” today are perfectly fine with eating the meat of selectively bred beasts.

Thane Ruthenis 2 Nov 2025 6:42 UTC
6 points
0
on: Why Is Printing So Bad?
I’ve been recently irritated at something that feels vaguely similar with regards to research loops. Walking through this example…
I note that most of those problems have little to do with the printing technology, and mostly about poor organization around that technology.
Friction sources
Choosing which printer to use isn’t trivial for someone who doesn’t already know it.
Counter-intuitive print dialogue which doesn’t make it clear how to locate a specific printer/add a new one to someone who doesn’t already know.
The printer’s physical location isn’t clear from where you control it.
The printer is not in the same room as where you control it. This means that if you have to iterate on printing due to (potentially minor) errors, each iteration is unnecessarily extended by having to walk back and forth.
No fine-grained feedback on what the printer does from where you control it (that it only printed two of your pages and then ran out, that page 1 printed blank).
Paper is not in the same room as the printer.
Locations of things aren’t trivially clear to someone who doesn’t happen to know them.
The printer’s initial location.
The printer’s new location.
Room 1A1′s location.
The location of the paper for the printer.
The new location of the paper.
Poor organization of printer output (that it just dumps everyone’s print-outs in the same stack; not sure if that was a problem, but sorting that out can add friction too).
Different printer behaviors depending on the software you print from, and what page you print, even though all entity types (documents, pages, function mapping from document-pages to physical pages) should be the same.
What common threads can we see?:
- Systems are not optimized to make it easy to interface with by someone who is using them the first time; someone with zero local tacit knowledge.
  - Printer/room/paper locations, printer choice, print dialogue.
  - Arguably (9) fits too: probably the different behaviors were due to some peculiarity of the printing algorithms in different software that someone very familiar with that software would know about.
- No affordances for easily recovering from common failure modes/errors/hiccups, or even for quickly learning about them.
  - Paper in a different room from the printer, printer in a different room from where you control it, no feedback on printer behavior from where you control it.
I see a few directions we can take it.
- Direction #1: No affordances for one-time users.
  - People who set up/modify systems are usually people who work with those systems a lot. They have tons of tacit knowledge about where things are, how to use them, how to avoid and recover from failures, etc. Modeling the mental state of someone who doesn’t have your tacit knowledge is notoriously hard, and even when it isn’t (“a random person wouldn’t know where I moved the paper”), there are no established societal norms/organizational standards regarding making things frictionlessly easy for first-time users.
  - Even when there are manuals/instructions/etc., they usually assume that you’re going to be thoroughly figuring out how to work with that system, as if you aimed to become a frequent user, rather than doing a specific thing once and never interacting with it again.
  - The problem is that “a person interacts with a system once and then never again” is a very common occurrence. For some systems (e. g., specific bureaucratic procedures), this is potentially the bulk of user experience; and on the flipside, “interacting with a system for the first time and never again” is a significant fraction of any given person’s life experience.
- Direction #2: Systemic accumulation of small friction sources.
  - All of those individual sources of friction were annoying, but ultimately not that time-consuming (figuring out things’ locations, how to install/locate a printer, replacing paper, re-printing things).
  - What made the whole experience miserable is that there were tons of those sources of minor friction which came into effect simultaneously.
  - Thus, it’s easy to see why the person responsible for creating each bit of friction didn’t think of it as a big deal, or why we don’t have societal/organizational standards regarding not creating such minor friction. (Consider that making a big deal of complaining about any given bit of friction in isolation would make you look like a loon/a “Karen”.)
  - But the problem is that tons of systems you interact with end up with these small bits of friction cropping up, so your ability to move forward is constantly mired down by that stuff.
- Direction #3: No affordances for quickly recovering from small failures.
  - Some sort of default assumptions that things will go as intended, with no minor errors, hiccups, need to fix problems/iterate, etc.
  - And as long as things-going-wrong doesn’t create a big incident (e. g., someone dying with that as the clear leading cause), nobody is particularly incentivized to think about that and prevent that.
  - So there are no affordances for easily recovering from them.
  - But this pattern reoccurs everywhere, so a large fraction of the human experience involves dealing with them.
In what other situations does this show up? An obvious example are various bureaucratic procedures, which require some research to make sense of, and where a small typo/mistake in your documents might lead to having to schedule another appointment next hour/day/month. That is: a procedure you’d plausibly interact with only one/a few times in your life is managed by people who interact with it constantly, and the bits of friction you run into are not obviously catastrophic, so those people are not forced to make things easy for first/imperfect/inexperienced users.
Note that the full thing is not necessarily easy to fix, much like e. g. technical debt isn’t^[1]. The work to keep systems easy-to-interface-with is, as usual, a battle against entropy. Overall, I feel this problem is just another manifestation of that one.
Things could certainly be made much better, though, by both:
- Establishing high-level organizational standards/policies (e. g., keeping printer paper in the printing room).
- Creating low-level societal norms (if you moved paper somewhere, leave a post-it about that).
1. ^
  And indeed, technical debt is another instance, where, for programmers maintaining a codebase, it’s no-one’s first priority to make it easy to understand/modify/interface-with for a one-time user.
  Potentially the “inconsistent printer behavior” is downstream of that, by the way. Perhaps there’s no obvious correct/convergent way to interact with low-level printer software, so each app’s designers had to take their own approach, so there are inconsistent behaviors across apps.

Thane Ruthenis 1 Nov 2025 11:04 UTC
16 points
8
on: Supervillain Monologues Are Unrealistic
My point is, that if you are a nascent super-villain, don’t worry about telling people about your master plan. Almost no one will believe you. So the costs are basically non-existent.
Good try, but you won’t catch me that easily!

Thane Ruthenis 31 Oct 2025 7:24 UTC
8 points
−3
in reply to: jbash’s comment on: An Opinionated Guide to Privacy Despite Authoritarianism
Humans can’t remember that many good passwords
I’d counter that you can in fact memorize ~arbitrarily many good passwords if you:
- Use a procedure that generates passwords that are high-entropy but memorable.
  - A memorability-enhancing improvement on the common “generate N random words” is “generate N random words, then add more words until the passphrase is grammatically correct”. (Important: don’t rearrange or change the initial random words in any way!)
    E. g., “lamp naval sunset TV” → “the lamp allows the naval sunset to watch TV”.
  - This doesn’t reduce entropy, since this is at worst an injective approximately deterministic function of the initial string.^[1] But, for me at least, it makes it significantly easier to recall the password (since that makes it “roll off the tongue”).
- Use spaced repetition on them. Like, literally put “recall the password for %website%” in an Anki deck. (Obviously don’t put the actual password there! Just create a prompt to periodically remind yourself of it.)
I’m not actually doing that for all my passwords, only for a dozen to the more important stuff.^[2] But I expect this scales pretty well.
1. ^
  Well, technically, this is only approximately injective: the worst case is that adding words to two random sequences would map them to the same phrase. E. g.:
  “lamp naval sunset TV” → “the lamp allows the naval sunset to watch TV”
  “lamp naval watch TV” → “the lamp allows the naval sunset to watch TV”
  But I posit that this would never happen in practice.
2. ^
  I also use KeePass for the rest, sync’d through ProtonDrive. The fewer cloud services you have to trust, the better.

Thane Ruthenis 30 Oct 2025 11:18 UTC
2 points
0
in reply to: Morpheus’s comment on: Synthesizing Standalone World-Models, Part 3: Dataset-Assembly
My response here would be similar to this one. I think there’s a kind of “bitter lesson” here: for particularly complex fields, it’s often easier to solve the general problem of which that field is an instance of, rather than attempting to solve the field directly. For example:
- If you’re trying to solve mechanistic interpretability, studying a specific LLM in detail isn’t the way; you’d be better off trying to find methods that generalize across many LLMs.
- If you’re trying to solve natural-language processing, turns out tailor-made methods are dramatically out-performed by general-purpose generative models (LLMs) trained by a general-purpose search method (SGD).
- If you’re trying to advance bioscience, you can try building models of biology directly, or you can take the aforementioned off-the-shelf general-purpose generative model, dump biology data into it, and get a tool significantly ahead of your manual efforts.
- Broadly, LLMs/DL have “solved” or outperformed a whole bunch of fields at once, without even deliberately trying, simply as the result of looking for something general and scalable.
Like, yeah, after you’ve sketched out your general-purpose method and you’re looking for where to apply it, you’d need to study the specific details of the application domain and tinker with your method’s implementation. But the load-bearing, difficult step is deriving the general-purpose method itself; the last-step fine-tuning is comparatively easy.
In addition, I’m not optimistic about solving e. g. interpretability directly, simply because there’s already a whole field of people trying to do that, to fairly leisurely progress. On intelligence-enhancement front, there would be mountains of regulatory red tape to go through, and the experimental loops would be rate-limited by the slow human biology. Etc., etc.

Thane Ruthenis 24 Oct 2025 13:24 UTC
LW: 6 AF: 4
0
AF
in reply to: Rohin Shah’s comment on: Synthesizing Standalone World-Models (+ Bounties, Seeking Funding)
Right, I probably should’ve expected this objection and pre-addressed it more thoroughly.
I think this is a bit of “missing the forest for trees”. In my view, every single human concept and every single human train of thoughts is an example of human world-models’ autosymbolicity. What are “cats”, “trees”, and “grammar”, if not learned variables from our world-model that we could retrieve, understand the semantics of, and flexibly use for the purposes of reasoning/problem-solving?
We don’t have full self-interpretability by default, yes. We have to reverse-engineer our intuitions and instincts (e. g., grammar from your example), and for most concepts, we can’t break their definitions down into basic mathematical operations. But in modern adult humans, there is a vast interpreted structure that contains an enormous amount of knowledge about the world, corresponding to, well, literally everything a human consciously knows. Which, importantly, includes every fruit of our science and technology.
If we understood an external superhuman world-model as well as a human understands their own world-model, I think that’d obviously get us access to tons of novel knowledge.

Thane Ruthenis 19 Oct 2025 6:17 UTC
10 points
0
in reply to: ryan_greenblatt’s comment on: faul_sname’s Shortform
Edited for clarity.
I’m curious, what’s your estimate for how much resources it’d take to drive the risk down to 25%, 10%, 1%?

Thane Ruthenis 18 Oct 2025 13:20 UTC
8 points
2
in reply to: faul_sname’s comment on: faul_sname’s Shortform
I can think of plenty of reasons, of varying levels of sensibility.
Arguments
Some people believe that (a) controlled on-paradigm ASI is possible, but that (b) it would require spending some nontrivial amount of resources/time on alignment/control research^[1], and that (c) the US AGI labs are much more likely to do it than the Chinese ones. Therefore, the US winning is less likely to lead to omnicide.
I think it’s not unreasonable to believe (c), so if you believe (a) and (b), as many people do, the conclusion checks out. I assign low (but nonzero) probability to (a), though.
Even if the Chinese labs can keep ASI aligned/under control, some people are scared of being enslaved by the CCP, and think that the USG becoming god is going to be better for them.^[2] This probably includes people who profess to only care about the nobody-should-build-it thing: they un/semi-consciously track the S-risk possibility, and it’s awful-feeling enough to affect their thinking even if they assign it low probability.
I think that’s a legitimate worry; S-risks are massively worse than X-risks. But I don’t expect the USG’s apotheosis to look pretty either, especially not under the current administration, and same for the apotheosis of most AGI labs, so the point is mostly moot.
I guess Anthropic or maybe DeepMind could choose non-awful results? So sure, if the current paradigm can lead to controlled ASI, and the USG stays asleep, and Anthropic/DM are the favorites to win, “make China lose” has some sense.
Variant on the above scenarios, but which does involve an international pause, with some coordination to only develop ASI once it can be kept under control. This doesn’t necessarily guarantee that the ASI, once developed, will be eudaimonic, so “who gets to ASI first/has more say on ASI” may matter; GOTO (2).
The AI-Risk advocates may feel that they have more influence on the leadership of the US labs. For US-based advocates, this is almost certainly correct. If that leadership can be convinced to pause, this buys us as much time as it’d take for the runners-up to catch up. Thus, the further behind China is, the more time we can buy in this hypothetical.
In addition, if China is way behind, it’s more likely that the US AGI labs would agree to stop, since more time to work would increase the chances of success of [whatever we want the pause for, e. g. doing alignment research or trying to cause an international ban].
Same as (4), but for governments. Perhaps the USG is easier to influence into arguing for an international pause. If so, both (a) the USG is more likely to do this if it feels that it’s comfortably ahead of China rather than nose-to-nose, (b) China is more likely to agree to an international ban if the USG is speaking from a position of power and is ahead on AI than if it’s behind/nose-to-nose. (Both because the ban would be favorable to China geopolitically, and because the X-risk arguments would sound more convincing if they don’t look like motivated reasoning/bullshit you’re inventing to convince China to abandon a technology that gives it geopolitical lead on the US.)
Some less sensible/well-thought-out variants of the above, e. g.:
Having the illusion of having more control over the US labs/government.
Semi/un-consciously feeling that it’d be better if your nation ends the world than if the Chinese do it.
Semi/un-consciously feeling that it’d be better if your nation is more powerful/ahead of a foreign one, independent of any X-risk considerations.
Suppose you think the current paradigm doesn’t scale to ASI, or that we’ll succeed in internationally banning ASI research. The amount of compute at a nation’s disposal is likely to still be increasingly important in the coming future (just because it’d allow to better harness the existing AI technology, for military and economic ends). Thus, constricting China’s access is likely to be better for the US as well.
This has nothing to do with X-risks though, it’s prosaic natsec stuff.
tl;dr:
- If we get alignment by default, some US-based actors winning may be more likely to lead to a good future than the Chinese actors winning.
- If on-paradigm ASI alignment is possible given some low-but-nontrivial resource expenditure, the US labs may be more likely to spend the resources on it than the Chinese ones.
- US AI Safety advocates may have more control over the US AGI labs and/or the USG. The more powerful those are relative to the foreign AGI researchers, the more leverage that influence provides, including for slowing down/banning AGI research.
- US AI Safety advocates may be at least partly motivated by dumb instincts for “my nation good, their nation bad”, and therefore want the US to win even if it’s winning a race-to-suicide.
- Keeping a compute lead may be geopolitically important even in non-ASI worlds.
1. ^
  E. g., Ryan Greenblatt thinks that spending just 5% more resources than is myopically commercially expedient would drive the risk down to 50%. AI 2027 also assumes something like this.
2. ^
  E. g., I think this is the position of Leopold Aschenbrenner.

Thane Ruthenis 11 Oct 2025 19:25 UTC
31 points
10
in reply to: Trevor Hill-Hand’s comment on: The Most Common Bad Argument In These Parts
Explanation
(The post describes a fallacy where you rule out a few specific members of a set using properties specific to those members, and proceed to conclude that you’ve ruled out that entire set, having failed to consider that it may have other members which don’t share those properties. My comment takes specific examples of people falling into this fallacy that happened to be mentioned in the post, rules out that those specific examples apply to me, and proceeds to conclude that I’m invulnerable to this whole fallacy, thus committing this fallacy.
(Unless your comment was intended to communicate “I think your joke sucks”, which, valid.))

Thane Ruthenis 11 Oct 2025 18:11 UTC
63 points
−1
on: The Most Common Bad Argument In These Parts
Exhaustive Free Association is a step in a chain of reasoning where the logic goes “It’s not A, it’s not B, it’s not C, it’s not D, and I can’t think of any more things it could be!”
Oh no, I wonder if I ever made that mistake.
Security Mindset
Hmm, no, I think I understand that point pretty well...
They listed out the main ways in which an AI could kill everyone (pandemic, nuclear war, chemical weapons) and decided none of those would be particularly likely to work
Definitely not it, I have a whole rant about it. (Come to think of it, that rant also covers the security-mindset thing.)
They perform an EFA to decide which traits to look for, and then they perform an EFA over different “theories of consciousness” in order to try and calculate the relative welfare ranges of different animals.
I don’t think I ever published any EFAs, so I should be in the clear here.
The Fatima Sun Miracle
Oh, I’m not even religious.
Phew! I was pretty worried there for a moment, but no, looks like I know to avoid that fallacy.

Thane Ruthenis 11 Oct 2025 18:00 UTC
2 points
0
in reply to: habryka’s comment on: RobertM’s Shortform
Oh, thanks!

Thane Ruthenis 11 Oct 2025 7:26 UTC
2 points
0
in reply to: RobertM’s comment on: RobertM’s Shortform
Another minor QoL improvement is the right-click behavior in the editor
That’s really useful, thanks!
Any chance you can make a way to easily insert a horizontal-line element into comments? Perhaps add the button to the “show more items” submenu? People (me) sometimes write long-form comments with several different sections, and as-is, you have to go into your post-drafts folder and copy-paste it from there.

Thane Ruthenis 10 Oct 2025 13:43 UTC
LW: 2 AF: 1
0
AF
in reply to: Jeremy Gillen’s comment on: Synthesizing Standalone World-Models, Part 4: Metaphysical Justifications
Some new data on that point:
Maybe if lots of noise is constantly being injected into the universe, this would change things. Because then the noise counts as part of the initial conditions. So the K-complexity of the universe-history is large, but high-level structure is common anyway because it’s more robust to that noise?
To summarize what the paper argues (from my post in that thread):
- Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
- We define the coarse-graining as a truncation of those real numbers: i. e., we fix some degree of precision.
  That degree of precision could be, for example, the Planck length.
- At the microstate level, the laws of physics may be deterministic and reversible.
- At the macrostate level, the laws of physics are stochastic and irreversible. We define them as a Markov process, with transition probabilities $P (x, y)$ defined as “the fraction of the microstates in the macrostate $x$ that map to the macrostate $y$ in the next moment”.
- Over time, our ability to predict what state the system is in from our knowledge of its initial coarse-grained state + the laws of physics degrades.
  Macroscopically, it’s because of the properties of the specific stochastic dynamic we have to use (this is what most of the paper is proving, I think).
  Microscopically, it’s because ever-more-distant decimal digits in the definition of the initial state start influencing dynamics ever stronger. (See the multibaker map in Appendix A, the idea of “microscopic mixing” in a footnote, and also apparently Kolmogorov-Sinai entropy.)
- That is: in order to better pinpoint farther-in-time states, we would have to spend more bits (either by defining more fine-grained macrostates, or maybe by locating them in the execution trace).
- Thus: stochasticity, and the second law, are downstream of the fact that we cannot define the initial state with infinite precision.
I. e., it is effectively the case that there’s (pseudo)randomness injected into the state-transition process.
And if a given state has some other regularities by which it could be compactly defined, aside from defining it through the initial conditions, that would indeed decrease its description length/algorithmic entropy. So we again recover the “trajectories that abstract well throughout their entire history are simpler” claim.

Thane Ruthenis 8 Oct 2025 14:30 UTC
7 points
0
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
I’ve been thinking about it in terms of “but which language are we using to compute the complexity of our universe/laws of physics?”. Usually I likewise just go “only matters up to an additive constant, just assume we’re not using a Turing tarpit and we’re probably good”. If we do dig into it, though, what can we conclude?
Some thoughts:
What is the “objectively correct” reference language?
We should, of course, assume that the algorithm computing our universe is simple to describe in terms of the “natural” reference language, due to the simplicity prior. I. e., it should have support for the basic functions our universe’s physics computes. I think that’s already equivalent to “the machine can run our physics without insane implementation size”.
On the flip side, it’s allowed to lack support for functions our universe can’t cheaply compute. For example, it may not have primitive functions for solving NP-complete problems. (In theory, I think there was nothing stopping physics from having fundamental particles that absorb Traveling Salesman problems and near-instantly emit their solutions.)
Now suppose we also assume that our observations are sampled from the distribution over all observers in Tegmark 4. This means that when we’re talking about the language/TM underlying it, we’re talking about some “natural”, “objective” reference language.
What can we infer about it?
First, as mentioned, we should assume the reference language is not a Turing tarpit. After all, if we allowed reality to “think” in terms of some arbitrarily convoluted Turing-tarpit language, we could arbitrarily skew the simplicity prior.
But what is a “Turing tarpit” in that “global”/”objective” sense, not defined relative to some applications/programs? Intuitively, it feels like “one of the normal, sane languages that could easily implement all the other sane languages” should be possible to somehow formalize...
Which is to say: when we’re talking about the Kolmogorov complexity of some algorithm, in what language are we measuring it? Intuitively, we want to, in turn, pick one of the “simplest” languages to define.^[1] But what language do we pick for measuring this language’s complexity? An infinite recursion follows.
Intuitively, there’s perhaps some way to short-circuit that recursion. (Perhaps by somehow defining the complexity of a language by weighing its complexity across “all” languages while prioritizing the opinions of those languages which are themselves simple in terms of whatever complexity measure this expression defines? Or something along those lines, circular definitions not always a problem. (Though see an essay Tsvi linked to which breaks down why many of those definitions don’t work.))
Regardless, if something like this is successful, we’ll get a “global” definition of what counts as a simple/natural language. This would, in turn, allow us to estimate the “objective” complexity of various problems, by measuring the length of their solutions in terms of that natural language (i. e., the length of the execution trace of a computation solving the problem). This would perhaps show that some problems are “objectively” hard, such as some theoretical/philosophical problems or the NP-complete problems.
The speed prior
What if we try to compute the complexity not of the laws of physics, but of a given observer-moment/universe-state, and penalize the higher-complexity ones?
In chaotic systems, this actually works out to the speed prior: i. e., to assuming that the later steps of a program have less realityfluid than the early ones. Two lines of reasoning:
The farther in time a state is,^[2] the more precisely you have to specify the initial conditions in order to hit it.
Justification: Suppose the program’s initial state is parametrized by real numbers. As it evolves, ever-more-distant decimal digits become relevant. This means that, if you want to simulate the universe on a non-analog computer (i. e., a computer that doesn’t use unlimited-precision reals) from $t = 0$ to $t = n$ starting from some initial state $S_{0}$ , with the simulation error never exceeding some value, the precision with which you have to specify $S_{0}$ scales with $n$ . Indeed, as $n$ goes to infinity, so does the needed precision (i. e., the description length).
Aside from picking the initial state that generates the observation, you also have to pinpoint that observation in the execution trace of the program. It can be as easy as defining the time-step (if you’re working with classical mechanics), or as difficult as pointing at a specific Everett branch. And pinpointing generally gets more expensive with time (even in the trivial case of “pick a time-step”, the length of the number you have to provide grows).
Anthropically, this means that the computations implementing us are (relatively) stable, and produce “interesting” states (relatively) quickly/in few steps.
Anyway, digging into the paper now...
1. Oh, I see it’s likewise concerned with the description length of states:
Gács [23] defines the coarse-grained algorithmic entropy of any individual state: roughly speaking, it is the number of bits of information that a fixed computer needs in order to identify the state’s coarse-grained cell. For example, a state in which all particles are concentrated in one location would have low entropy, because the repeated coordinates can be printed by a short program. If the coarse graining in question is Markovian, then Levin’s [24] law of randomness conservation says that the algorithmic entropy seldom decreases. In physical terms, we will come to see this as a vast generalization of the second law of thermodynamics
2. The way the paper justifies the second law of thermodynamics is neat.
My understanding of that
Suppose the microstate of a system is defined by a (set of) infinite-precision real numbers, corresponding to e. g. its coordinates in phase space.
We define the coarse-graining as a truncation of those real numbers: i. e., we fix some degree of precision.
That degree of precision could be, for example, the Planck length.
At the microstate level, the laws of physics may be deterministic and reversible.
At the macrostate level, the laws of physics are stochastic and irreversible. We define them as a Markov process, with transition probabilities $P (x, y)$ defined as “the fraction of the microstates in the macrostate $x$ that map to the macrostate $y$ in the next moment”.
Over time, our ability to predict what state the system is in from our knowledge of its initial coarse-grained state + the laws of physics degrades.
Macroscopically, it’s because of the properties of the specific stochastic dynamic we have to use (this is what most of the paper is proving, I think).
Microscopically, it’s because ever-more-distant decimal digits in the definition of the initial state start influencing dynamics ever stronger. (See the multibaker map in Appendix A, the idea of “microscopic mixing” in a footnote, and also apparently Kolmogorov-Sinai entropy.)
That is: in order to better pinpoint farther-in-time states, we would have to spend more bits (either by defining more fine-grained macrostates, or maybe by locating them in the execution trace).
Thus: stochasticity, and the second law, are downstream of the fact that we cannot define the initial state with infinite precision.
3. The part about incomputability being necessary is also interesting, metaphysically.
Why must it be impossible to prove lower bounds on Kolmogorov complexity?
So, Kolmogorov complexity is upper-semicomputable. This means that, for some $x$ :
You can prove an upper bound on $K (x)$ , just by finding a program that computes $x$ .
You can only prove a lower bound on $K (x)$ using a program $p$ with $K (p) > K (x)$ . Meaning, you can’t use any fixed-size program (or formal system) to prove arbitrarily high complexity.
Imagine if it were otherwise, if some $p$ much smaller than $K (x)$ could prove a lower bound on $K (x)$ . Then you could use that $p$ to cheaply pinpoint $x$ : by setting up a program that goes through programs in order, uses $p$ to estimate the lower bound on their $K (x)$ , then outputs the first program whose complexity is above a threshold. Which would simultaneously functions as an upper bound on $K (x)$ : since our small program was able to compute it, $K (x)$ can’t be higher than $K (p)$ .
Thus, in order for arbitrarily complex states/programs to exist, it must be impossible to prove that they are complex.
Why? Why does that have to be the case?
Intuitively, it’s because “proving” complexity requires pointing at specific features of the state $x$ and explaining why exactly they are complex. That is, your formal language must be expressive enough to precisely talk about those features, in their full detail. If, however, you can get away with using some abstractions/generalizations to prove $x$ ‘s complexity, that by definition decreases $x$ ’s complexity.
Impromptu poll: is structuring long-form comments this way, with collapsibles for topics, convenient, or should I have just used titles? Please react with thumbs up/down to the following statement: “collapsibles good”.
All that said,
But this smells really promising to me [...] as a more principled way to tackle bounded rationality of embedded systems.
I’m curious what you have in mind here. I’ve kind of been treating my thinking on those topics as basically recreational/a guilty pleasure. The possibility that there’s something actually useful here interests me.
1. ^
  Since that would allow its emulators to be common across Tegmark IV, which would, in turn, give a bump to any algorithms simple in its terms.
2. ^
  More specifically, the first occurrence of that observation.

Thane Ruthenis 8 Oct 2025 5:34 UTC
3 points
0
in reply to: Raemon’s comment on: “Intelligence” → “Relentless, Creative Resourcefulness”
It does so happen the answer is “basically yes” (me and a friend)
I’m now unsurprised.
I was surprised at the idea they were willing to change the conditions for an entire theater at the request of one person. Seems like if the ability to do that were known, it’d create obvious issues, with people with different preferences constantly walking up and asking to revert each other’s requests. I assumed there was no rule against it only because it didn’t actually occur to anyone to do that, and therefore it wasn’t commonly exploited.
If the theater wasn’t actually full of people with potentially different preferences (and that was known to the worker?), then that’s not surprising.

Thane Ruthenis 7 Oct 2025 18:06 UTC
4 points
0
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
...and actually, I’m not even really sure it’s best to think of “shards” as having goals, either long-term or short-term
Agreed; I was speaking loosely. (One line of reasoning there goes: shards are contextually activated heuristics; heuristics can be viewed as having been optimized for achieving some goal; inspecting shards (via e. g. self-reflection) can lead to your “reverse-engineering” those implicitly encoded goals; therefore, shards can be considered “proto-goals/values” of a sort, and complex patterns of shard activations can draw the rough shape of goal-pursuit.)

Thane Ruthenis 7 Oct 2025 16:06 UTC
9 points
3
in reply to: 1a3orn’s comment on: 1a3orn’s Shortform
whatever LLM-involved process or human-neuron involved process tends for some goal will nevertheless tend towards coherence
I think that’s right, and that it’s indeed a more fundamental/basic point.
Coherency isn’t demanded by minds, it’s demanded by tasks.
Suppose you want to set up some process that would fulfil some complicated task. Since it’s complicated, it would presumably involve taking a lot of actions, perhaps across many different domains. Perhaps it would involve discovering new domains; perhaps it would span long stretches of time.
Any process capable of executing this task, then, would need to be able to unerringly aim all of these actions at the task’s fulfilment. The more actions the task demands, the more diverse the domains and the longer the stretches of time it spans, the more the process executing it would approximate an agent pursuing this task as a goal.
“Coherency”, therefore, is just a property of any system that’s able to do useful, nontrivially complicated work, instead of changing its mind about what it’s doing and shooting itself in the foot every five minutes.
Which is why the AI industry is currently trying its hardest to produce AIs capable of developing long-term coherent goals. (They’re all eager to climb METR’s task-horizon benchmark, and what is it supposed to measure, if not that?) Those are just the kinds of systems that are able to perform increasingly complex tasks.
(On top of that consideration, we could then also argue that becoming coherent is a natural attractor for any mind that doesn’t destroy itself. A mind’s long-term behavior is shaped by whichever of its shards have long-term goals, because shards that don’t coherently pursue any goal end up, well, failing to have optimized for any goal over the long term. Shards that plan for the long term, on the other hand, are likely to both try and get the myopic shards under control, and to negotiate with each other regarding their long-term plans. Therefore, any autonomous system that is capable of executing complex tasks – any highly capable mind – would self-modify to be coherent.
There are various caveats and edge cases, but I think the generic case goes something like this.)

Thane Ruthenis 3 Oct 2025 17:04 UTC
18 points
7
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
Some signal: Daniel Litt, the mathematician who seems most clued-in regarding LLM use, still doesn’t think there have been any instances of LLMs coming up with new ideas.
I’m currently watching this space closely, but I don’t think anything so far has violated my model. LLMs may end up useful for math in the “prove/disprove this conjecture” way, but not in the “come up with new math concepts (/ideas)” way.
Ah, though perhaps our cruxes there differed from the beginning, if you count “prove a new useful conjecture” as a “novel insight”. IMO, that’d only make them good interactive theorem provers, and wouldn’t bear much on the question of “can they close the loop on R&D/power the Singularity”.

Thane Ruthenis 3 Oct 2025 0:17 UTC
11 points
5
in reply to: Cole Wyeth’s comment on: Checking in on AI-2027
It seems to me that even a hardcore skeptic of AI 2027 would have been unlikely to predict a much larger error.
As someone who could perhaps be termed as such, my expectations regarding the technical side of things only start to significantly diverge at the start of 2027. (I’m not certain of Agent-1 1.5x’ing AI research speed, but I can see that.^[1] The rest seems more or less priced-in.) And indeed, the end of 2026 is the point where, the forecast itself admits, its uncertainty increases and its predictions get less grounded.
Specifically, the point where I get off the ride is this one:
OpenBrain doubles down on this strategy with Agent-2. It is qualitatively almost as good as the top human experts at research engineering (designing and implementing experiments), and as good as the 25th percentile OpenBrain scientist at “research taste” (deciding what to study next, what experiments to run, or having inklings of potential new paradigms). While the latest Agent-1 could double the pace of OpenBrain’s algorithmic progress, Agent-2 can now triple it, and will improve further with time.
My understanding is that Agent-2 essentially “closes the loop” on automated AI R&D, and while human input is still useful due to worse taste, it’s no longer required. That’s the part that seems like a “jump” to me, not a common-sensical extrapolation, and which I mostly expect not to happen.
1. ^
  Because I am really confused about how much AI is accelerating research/programming now, so I have no idea what number to extrapolate. Maybe it gets so good at fooling people into thinking they’re being incredibly productive by managing 50 agents at once that it slows research down by 50% instead?

Thane Ruthenis 2 Oct 2025 19:34 UTC
2 points
0
in reply to: Noosphere89’s comment on: Synthesizing Standalone World-Models (+ Bounties, Seeking Funding)
I don’t see the immediate relevance. I think the implicit assumption here is that a process that builds an interpretable world-model pays some additional computational cost for the “interpretability” property, and that this cost scales with the world-model’s size? On the contrary, I argue that the necessary structure is already (approximately) learned by e. g. LLMs by default, and that the additional compute cost in building a translator from that structure to human programming languages is ~flat.
Here’s a framing: mechanistic interpretability/the science of reverse-engineering the functions learned by DL models currently scales poorly because it’s not bitter-lesson-pilled: requires more human labor the bigger a DL model is. The idea of this approach is to make that part unnecessary.
Alternatively, you mean that humans understanding the already pre-interpreted world-model afterwards is the step that doesn’t scale. But:
- I don’t expect it to directly scale with the world-model’s size, see the “well-structured” property. (The world-model would be split into clearly delineated modules, and once we understand its basic structure, we could just go to the modules we care about and e. g. extract them, instead of having to understand the whole thing.)
- The labor required for understanding it should be a rounding error compared to e. g. the labor that goes into scaling LLMs up by another order of magnitude.
Everything except the final “make sense of the already-interpreted world-model” step is supposed to be automated, by general-purpose methods whose efficiency does purely scale with compute/data.
(Also, if this is happening in the timeline where LLMs don’t plateau, at that point we probably have 10M/100M-context-length LLMs we could dump the codebase into to speed up our understanding of it.^[1])
1. ^
  There are several safety-relevant concerns about this idea, but they may be ameliorable.