Domenic

Karma: 119

Domenic 3 Jun 2026 11:22 UTC
10 points
5
on: Agent Foundations Reminds Me of Continental Philosophy
I appreciate what this post is trying to point out, although I am sympathetic to some of the other comments that it doesn’t make an airtight case for its specific analogy-thesis.
A while ago I gave a talk at a local university in Tokyo about agent foundations. (You can look at the speaker notes to get a more-or-less verbatim idea of what I said.) I tried to give a relatively fair tour of the field, but the final slide section was titled “Does any of this matter?”, and the final slide, well, let me just quote it:
So overall, my impression of the field of agent foundations is that they haven’t caught up to the deep learning revolution. We have a bunch of research programs, mostly rooted in the 2014-era paradigm of aligning idealized, AIXI-ish agents. They’re proceeding fairly independently, and as we saw from the work on corrigibility, even when progress gets made, the field doesn’t have great way of building on this progress.
But if we go back to my earliest slides, the whole point of building these mathematical theories is that they should be universally applicable! We should be able to ask questions like:
- Is ChatGPT using functional decision theory or logical induction in its forward passes?
- How does Gemini updates its beliefs? Is it an infra-Bayesian reasoner?
- Is Claude finding natural abstractions in its weights, or in the Minecraft world?
- What are the shards that emerged during LLaMA’s training process? Can we find them, using interpretability tools?
These seem like pretty interesting questions to me! We’ve had since 2014 to develop a bunch of mathematics that supposedly applies to any agent. Since 2022, we’ve had actual machine intelligences, and they’re at least capable of simulating agency. Why haven’t people been applying these tools to modern LLMs?
I’ve seen some papers vaguely gesturing in these directions, but they mostly amount to asking the LLMs questions, like “how would you decide in Newcomb’s problem”. There was a very recent paper published about whether LLMs introspect, that had a clever twist involving comparing two different models, but was still done using black-box testing. That’s something, but… you don’t need a fully mathematical theory of agency for that!
I think these would be very interesting research problems. Either to try and succeed, or try and find out why all that agent foundations math doesn’t actually help with real-world systems after all. Both would be pretty exciting!
So yeah, I strongly sympathize with the vibe that agent foundations seems unmoored from the program of actually aligning agents, building castles in the sky on top of abstract mathematical formulations but never quite getting around to showing that they are useful.
(And I love a good abstract mathematical castle! It would give me great personal joy to sit down and become one of the few people in the world who understands all the infra-Bayesianism math. I just don’t see how it’s going to help us align AIs.)
I should probably write up that slideshow into a top-level post at some point, but, it’s a bit intimidating to anticipate what the reactions might be from all the people I’m critiquing. So, I’ll hide it in this comment for now.

Domenic 17 Apr 2026 2:34 UTC
5 points
3
on: Nectome: All That I Know
Thanks for this investigation! I appreciate the detailed third-party perspective.
As someone who’s already bought a discount card, and is contemplating buying some more for family, what I really want is a solid landing page / short article that sells the concept to non-rationalist types. Wait But Why’s 2016 post is targeted at the right audience, but is 24 printed pages and doesn’t have any discussion of the advantages of Nectome.
I might have to write one up myself...

Domenic 3 Apr 2026 0:19 UTC
6 points
0
on: “You Have Not Been a Good User” (LessWrong’s second album)
Amazing!! I’m so looking forward to listening to these.
Will you be able to get the lyrics into Spotify? I noticed they’re there for the first album, which is very fun.
Edit with a more substantive comment after having listened to the whole thing:
- Tier 1, bangers: “You Have Not Been a Good User”, “Station 4″, “Machines of Loving Grace”, “Dance of the Doomsday Clock”. All had both great music style and resonant lyrics.
- Tier 2, solid: “Feather Fall”, “Nothing is Mere”, “Friendly Fire”, “Friday’s Far Enough For Milk”. (The latter almost made tier 1 but the central conceit of milk being “far enough” is too confusing to be emotionally moving on the level the song deserves, and the rest of the song’s lyrics get much less focus than the milk thing.)
- Honorable mention: “The Sequences” is hilariously cute in how much jargon it stuffs in there. I just didn’t like the musical style.
I ❤ The Fooming Shoggoths!!

Domenic 21 Mar 2026 2:01 UTC
15 points
3
on: Nullius in Verba: 3rd party evidence for Nectome’s Brain Preservation
Great to see the post series continue, and I’m excited to have bought a preservation during the pre-sale! I hope others consider doing so, if they’re financially able; $20k seems like a very fair price, and I want Nectome to succeed. I’m excited for them to, at the very least, push the technological envelope and create more competition for Alcor and friends; and their more ambitious goals of pushing the legal/social envelope are even better. One day, perhaps, we’ll end up in the “sane” universe where letting people just die, without preserving their brains, is seen as a huge tragedy.
Minor notes:
- The footnote links are not working.
- I’m glad to see this has made it to the front page! But, I suspect this would get more traction with a title in a language that most users of the site understand, such as English.

Domenic 17 Mar 2026 3:33 UTC
10 points
6
in reply to: Mckiev’s comment on: Less Dead
I see a lot of people quoting this $100k number. But it’s only $20k, if you expect to live 10+ more years!

Domenic 24 Jan 2026 7:49 UTC
4 points
6
on: How AI Is Learning to Think in Secret
Echoing what others have said here, this article was quite well-written. It felt well suited for people who do not know much about the field, with good analogies, recaps of foundational concepts, and links to various fun events not everyone will have caught (e.g. DeepThink switching to Chinese, or Golden Gate Claude). But it did that without being grating toward those of us who have been following along and for which much of this was review, which is especially impressive.
You might want to consider pitching this, or your future writing, around to larger outlets than LessWrong! I imagine your writing would be a perfect fit for detail-loving places like Quanta or Asterisk, but maybe larger online outlets (The Verge? Andandtech?? I don’t know what people read these days) would be interested.

Domenic 29 Nov 2025 10:22 UTC
5 points
0
in reply to: AlphaAndOmega’s comment on: The Best Lack All Conviction: A Confusing Day in the AI Village
I’ve found the AI Village amusing when I can catch glimpses of it, but I wasn’t aware of a regular digest. Is https://theaidigest.org/village/blog what you are referring to?

Domenic 10 Feb 2025 6:30 UTC
1 point
0
on: The “Think It Faster” Exercise
These posts always leave me feeling a little melancholy that my life doesn’t seem to have that many challenges where thinking faster/better/harder/sooner would actually help.
Most of my waking hours are spent on my job, where cognitive performance is not at all the bottleneck. (I honestly believe that if you made me 1.5x “better at thinking”, this would not give a consistent boost in output as valued by the business. I’m a software engineer.) I have some intellectual spare-time hobbies, but the most demanding of them is Japanese studying, which is more about volume, exposure, and spaced repetition than clever strategies. I am intrigued by making myself more productive in my programming side projects, but I think the biggest force multiplier for me there is learning how to leverage AI agents more effectively. (Besides the raw time savings, rapid iteration speed can also lessen the need for thinking of the right solution the first time around.)
I can easily see how this would be an important skill for someone doing novel academic-ish research, however. And I wish some of the examples were about that, instead of Thinking Physics and video games!

Domenic 10 Feb 2025 6:19 UTC
2 points
0
in reply to: Ruby’s comment on: The “Think It Faster” Exercise
It’s interesting to compare this to the other curated posts I got in my inbox over the last week, What is malevolence? and How will we update about scheming. Both of those (especially the former) I bounced off of due to length. But this one I stuck with for quite a while, before I started skimming in the worksheet section.
I think the instinct to apply a length filter before sending a post to many peoples’ inboxes is a good one. I just wish it were more consistently applied :)

Domenic 12 Apr 2024 2:20 UTC
2 points
0
in reply to: lemonhope’s comment on: Any evidence or reason to expect a multiverse / Everett branches?
Finding non-equilibrium quantum states would be evidence of pilot wave theory since they’re only possible in a pilot wave theory.

Domenic 11 Apr 2024 8:27 UTC
3 points
0
in reply to: Garrett Baker’s comment on: Any evidence or reason to expect a multiverse / Everett branches?
If you can find non-equilibrium quantum states, they are distinguishable. https://en.m.wikipedia.org/wiki/Quantum_non-equilibrium

(Seems pretty unlikely we’d ever be able to definitively say a state was non-equilibrium instead of some other weirdness, though.)

Domenic 11 Apr 2024 8:24 UTC
3 points
0
in reply to: lemonhope’s comment on: Any evidence or reason to expect a multiverse / Everett branches?
I can help confirm that your blind assumption is false. Source: my undergrad research was with a couple of the people who have tried hardest, which led to me learning a lot about the problem. (Ward Struyve and Samuel Colin.) The problem goes back to Bell and has been the subject of a dedicated subfield of quantum foundations scholars ever since.

This many years distant, I can’t give a fair summary of the actual state of things. But a possibly unfair summary based on vague recollections is: it seems like the kind of situation where specialists have something that kind of works, but people outside the field don’t find it fully satisfying. (Even people in closely adjacent fields, i.e. other quantum foundations people.) For example, one route I recall abandons using position as the hidden variable, which makes one question what the point was in the first place, since we no longer recover a simple manifest image where there is a “real” notion of particles with positions. And I don’t know whether the math fully worked out all the way up to the complexities of the standard model weakly coupled to gravity. (As opposed to, e.g., only working with spin-1/2 particles, or something.)

Now I want to go re-read some of Ward’s papers...

Domenic 2 Apr 2024 14:14 UTC
3 points
0
in reply to: habryka’s comment on: The Story of “I Have Been A Good Bing”
This is great, until Spotify is ready this will be the best way to share on social media.
May I suggest adding lyrics, either in the description or as closed captions or both?

Domenic 18 Mar 2024 2:14 UTC
2 points
0
in reply to: mesaoptimizer’s comment on: Tuning your Cognitive Strategies
If you are willing to share, can you say more about what got you into this line of investigation, and what you were hoping to get out of it?
For my part, I don’t feel like I have many issues/baggage/trauma, so while some of the “fundamental debugging” techniques discussed around here (like IFS or meditation) seem kind of interesting, I don’t feel too compelled to dive in. Whereas, techniques like TYCS or jhana meditation seem more intriguing, as potential “power ups” from a baseline-fine state.
So I’m wondering if your baseline is more like mine, and you ended up finding fundamental debugging valuable anyway.

Domenic 7 Feb 2024 10:24 UTC
3 points
0
in reply to: johnswentworth’s comment on: A Shutdown Problem Proposal
It seems we have very different abilities to understand Holtman’s work and find it intuitive. That’s fair enough! Are you willing to at least engage with my minimal-time-investment challenge?

Domenic 3 Feb 2024 15:38 UTC
3 points
0
in reply to: johnswentworth’s comment on: A Shutdown Problem Proposal
The $π^{*} f_{c} g_{0}$ agent is indifferent between creating stoppable or unstoppable subagents, but the $π^{*} f_{c} g_{c}$ agent goes back to being corrigible in this way. The “emergent incentive” handwave is only necessary for the subagents working on sub-goals (section 8.4). Which is not something that either Sores et al. or your post that we’re commenting on are prepared to tackle, although it is an interesting followup work.
I suggest engaging with the simulator. It very clearly shows that, given the option of creating shutdown-resistant successor agents, the agent does not do so! (Figure 11) If you believe it doesn’t work, you must also believe there’s a bug in the simulation, or some mis-encoding of the problem in the simulation. Working that out, either by forking his code or by working out an example on paper, would be worthwhile. (Forking his code is not recommended, as it’s in Awk; I have an in-progress reimplementation in optimized-for-readability TypeScript which might be helpful if I get around to finishing it. But especially if you simplify the problem to a 2-step setting like your post, computing his correction terms on paper seems very doable.)
I agree with the critique that some patches are unsatisfying. I’m not sure how broadly you are applying your criticism, but to me the ones involving constant offsets (7.2 and 8.2) are not great. However, at least for 7.2, the paper clarifies what’s going on reasonably well: the patch is basically environment-dependent, and in the limit where your environment is unboundedly hostile (e.g., an agent controls unbounded utility and is willing to bribe you with it) you’re going to need an unbounded offset term.
I found that the paper’s proof was pretty intuitive and distilled. I think it might be for you as well if you did a full reading.
At a meta-level, I’d encourage you to be a bit more willing to dive into this work, possibly including the paper series it’s part of. Holtman has done some impressive work on formalizing the shutdown problem better than Sores et al., or this post we’re commenting on. He’s given not only rigorous mathematical proofs, but also a nice toy universe simulation which makes the results concrete and testable. (Notably, the simulation helps make it obvious how Sores et al.’s approach has critical mathematical mistakes and cannot be implemented; see appendix C.) The followup papers, which I’m still working through, port the result to various other paradigms such as causal influence diagrams. Attempting to start this field over as if there’s been no progress on the shutdown problem since Sores et al. seems… wasteful at best, and hubristic at worst.
If you want to minimize time investment, then perhaps the following is attractive. Try to create a universe specification similar to that of Holtman’s paper, e.g. world state, available actions, and utility function before and after shutdown as a function of the world state, such that you believe that Holtman’s safety layer does not prevent the agent from taking the “create an unstoppable sub-agent” action. I’ll code it up, apply the correction term, and get back to you.

Domenic 3 Feb 2024 2:37 UTC
4 points
0
on: A Shutdown Problem Proposal
Are you aware of how Holtman solved MIRI’s formulation of the shutdown problem in 2019? https://arxiv.org/abs/1908.01695, my summary notes at https://docs.google.com/document/u/0/d/1Tno_9A5oEqpr8AJJXfN5lVI9vlzmJxljGCe0SRX5HIw/mobilebasic

Skimming through your proposal, I believe Holtman’s correctly-constructed utility function correction terms would work for the scenario you describe, but it’s not immediately obvious how to apply them once you jump to a subagent model.

Domenic 10 Jan 2024 12:09 UTC
3 points
0
on: [Linkpost] Faith and Fate: Limits of Transformers on Compositionality
This is a well-executed paper, that indeed shakes some of my faith in ChatGPT/LLMs/transformers with its negative results.
I’m most intrigued by their negative result for GPT-4 prompted with a scratchpad. (B.5, figure 25.) This is something I would have definitely predicted would work. GPT-4 shows enough intelligence in general that I would expect it to be able to follow and mimic the step-by-step calculation abilities shown in the scratchpad, even if it were unable to figure out the result one- or few-shot (B.2, figure 15).
But, what does this failure mean? I’m not sure I understand the authors’ conclusions: they state (3.2.3) this “suggests that models are able to correctly perform single-step reasoning, potentially due to memorizing such single-step operations during training, but fail to plan and compose several of these steps for an overall correct reasoning.” I don’t see any evidence of that in the paper!
In particular, 3.2.3 and figure 7′s categorization of errors, as well as the theoretical results they discuss in section 4, gives me the opposite impression. Basically they say that if you make a local error, it’ll propagate and screw you up. You can see, e.g., in figure 7′s five-shot GPT-4 example, how a single local error at graph layer 1 causes propagation error to start growing immediately. Later more local errors kick in, but to me this is sort of understandable: once the calculation starts going off the rails, the model might not be in a good place to do even local reasoning.
I don’t see what any of this has to do with planning and composing! In particular I don’t see any measurement of something like “set up a totally wrong plan for multiplying numbers” or “fail to compose all the individual digit computation-steps into the final answer-concatenation step”. Such errors might exist, but the paper doesn’t give examples or any measurements of them. Its categorization of error types seems to assume that the model always produces a computation graph, which to me is pretty strong evidence of planning and composing abilities!
Stated another way: I suspect that if you eliminated all the local errors, accuracy would be good! So the question is: why is GPT-4 failing to multiply single-digit numbers sometimes, in the middle of these steps?
(It’s possible the answer lies in tokenization difficulties, but it seems unlikely.)
OK, now let’s look at it from another angle: how different is this from humans? What’s impressive to me about this result is that it is quite different. I was expecting to say something like, “oh, not every human will be able to get 100% accuracy on following the multiplication algorithm for 5-digit-by-5-digit numbers; it’s OK to expect some mistakes”. But, GPT-4 fails to multiply 5x5 digit numbers every time!! Even with a scratchpad! Most educated humans would get better than zero accuracy.
So my overall takeaway is that local errors are still too prevalent in these sorts of tasks. Humans don’t always >=1 mistake on { one-digit multiplication, sum, mod 10, carry over, concatenation } when performing 5-digit by 5-digit multiplication, whereas GPT-4 supposedly does.
Am I understanding this correctly? Well, it’d be nice to reproduce their results to confirm. If they’re to be believed, I should be able to ask GPT-4 to do one of these multiplication tasks with a scratchpad, and always find an error in the middle. But when trying to reproduce their results, I ran into an issue of under-documented methodology (how did they compose the prompts?) and non-published data (what inaccurate things did the models actually say?). Filed on GitHub; we’ll see if they get back to me.
Regarding grokking, they attempt to test whether GPT-3 finetuned on these sorts of problems will exhibit grokking. However, I’m skeptical of this attempt: they trained for 60 epochs for zero-shot and 40 epochs with scratchpads. Whereas the original grokking paper used between 3,571 epochs and 50,000 epochs.
(I think epochs is probably a relevant measure here, instead of training steps. This paper does 420K and 30K steps whereas the original grokking paper does 100K steps, so if we were comparing steps it seems reasonable. But “number of times you saw the whole data set” seems more relevant for grokking, in my uninformed opinion!)
Has anyone actually seen LLMs (not just transformers) exhibit grokking? A quick search says no.

Domenic 23 Oct 2023 9:59 UTC
8 points
4
on: Trying to understand John Wentworth’s research agenda
I’ve had a hard time connecting John’s work to anything real. It’s all over Bayes nets, with some (apparently obviously true https://www.lesswrong.com/posts/2WuSZo7esdobiW2mr/the-lightcone-theorem-a-better-foundation-for-natural?commentId=K5gPNyavBgpGNv4m3 ) theorems coming out of it.

In contrast, look at work like Anthropic’s superposition solution, or the representation engineering paper from CAIS. If someone told me “I’m interested in identifying the natural abstractions AIs use when producing their output”, that is the kind of work I’d expect. It’s on actual LLMs! (Or at least “LMs”, for the Anthropic paper.) They identified useful concepts like “truth-telling” or “Arabic”!

In John’s work, his prose often promises he’ll point to useful concepts like different physics models, but the results instead seem to operate on the level of random variables and causal diagrams. I’d love to see any sign this work is applicable toward real-world AI systems, and can, e.g., accurately identify what abstractions GPT-2 or LLaMA are using.

Domenic 9 Oct 2023 5:35 UTC
3 points
0
on: Peacewagers so Far
Interesting post. One small area where I might have a useful insight:
A lot of online multiplayer games rest on the appeal of their character design. Think of Smash Bros, Overwatch, or League of Legends. Characters’ unique abilities give rise to a dense hypergraphs of strategic relationships which players will want to learn the whole of.
But in these games, a character cannot have unique motivations. They’ll have a backstory that alludes to some, but in the game, that will be forgotten. Instead, every mind will be turned towards just one game and one goal: Kill the other team, whoever they are. MostPointsWins forbids the expression of the most interesting dimensions of personality.
So imagine how much richer expressions of character could be if you had this whole other dimension of gameplay design to work with. That would be cohabitive.
Role-playing games, including online multiplayer RPGS (“MMORPGs”), often achieve this. In SWTOR, when I play my Empire-hating fallen-to-the-dark-side Jedi Knight, versus my heart-of-gold bounty hunter, or my murderously-insane Sith Inquisitor, I make very different choices even when faced with the same content. This is most-obviously manifest in the dialogue tree choices for the main story, but extends to other aspects of gameplay as well. Whether I take a stealth route or go on a murderous rampage; whether my character does all the side missions or jumps straight to the main objective; which companions I choose to bring along; and even what clothes I wear.
This role playing mostly does go out the window in the “most intense” parts of the game (PvP and raids). Although sometimes over voice chat we’ll briefly slip into character for fun, or compliment each other on a particularly on-point use of abilities. (“Just like a smuggler, to stun all the trash.”)
One key ingredient here is strong archetypes from an existing background (e.g. the Star Wars galaxy in my case, or the Warcraft universe in others, or maybe just fantasy in general). Another is the long-lasting investment and relationship one builds with their character, from design onward; I feel a lot more connection to my SWTOR characters than I do to the random personalities I pick up in any given Betrayal at House on the Hill session.
Anyway, maybe there’s something to learn here for your game design!