AI Safety person currently working on multi-agent coordination problems.
Jonas Hallgren
I’m no pro at the creative stuff but I’ve found that when I allow myself to have a fun zone where I just produce stuff whenever things appear it seems like I have generally better thoughts. (Similar to more procedural virtue ethics models that you have)
I spend at least 5-10 hours a week in this space and I think it is correlated with thinking that anything goes. I kind of have a separate evaluator system that I apply in other circumstances. No impact evaluation, just cool thoughts in this space. I also have a new saying that I like to follow “if it’s fun it’s fine.”
It also coincides nicely with a specific type of meditative skill that is a version of the problem solving walk described in the appendix of the mind illuminated about holding different kinds of mental spaces for different purposes (see appendix b here). Also if you believe in a constructivist theory of emotion then being more attune with your emotion is also being more intune with your research taste as you notice smaller pointers to useful bits of information!
A bit random but hopefully a bit relevant
It is good that the tradition of Agent Foundations have had a fascination with crystals before as well, I in fact think that we need to embrace the crystal side of rationality more and so I thank you for bringing this gem (pun intended) forth.
Interesting points!
I will admit to not being the most well-versed in crystals so that might be true. I’m wondering if there’s some sort of thing where you can think about a specific ontology of the world as one structure and another ontology as another structure?
For example, we know that there are incompatible world views in the world and an easy example is politics. The ground state of news is the same (enthropic solution) and the view that you pick up of that base state is going to be different dependent on how you process it?
Processing information is therefore kind of taking a soup of information and crystallizing it?
I like to think about utility functions and these crystalline surfaces and stuff as interpretaive frames. I think it is within chapter 6 of Godel Escher Bach that he talks about a jukebox and how it only knows how to play a specific disc by being preprogrammed with a tape reader. I think these crystals are the same and that it is some sort of combination of value function and ontology that creates a way of thinking.It is a way of interpreting information that comes in and it can only be shown through the specific type of crystalline structure that appears?
I think there’s something interesting here from an information theory perspective but I might just be overinterpreting and analogising a connection that isn’t there.
Fair points across the board, in retrospect I think I shouldn’t have released this post in the state it was in but I wanted to have something more to bite in as I felt there wasn’t enough in the original.
The audience for me here was less RL-based utility learning setups and rather more of a focus on devinterp perspectives but I don’t know it well enough to write something good about it.
So lesson learnt, and thank you for the feedback.
Crystals in NNs: Technical Companion Piece
Have You Tried Thinking About It As Crystals?
If you want to look more into the symmetry learning direction I like GDL as a way of thinking about it:
(More canonical resource:)
http://geometricdeeplearning.com/
(My favourite explainer:)
https://arxiv.org/abs/2508.02723
I’m curious about the details of your model when it comes to long-time horizon planning:
There is often a recursive structure to doing tasks that take months to years, of which a simplified version might look like:
Decompose the task into substasks
Attempt to solve a subtask
If you succeed, go on to the next subtask
If you fail, either try again, i.e. (2), or revisit your overall plan, i.e. (1)
The skills needed to execute this loop well include:
Making and revising plans
Intuitions for promising directions
Task decomposition
Noticing and correcting mistakes
Let’s call this bucket of skills long-horizon agency skills. It seems like for long enough tasks, these are the primary skills determining success, and importantly they are applied recursively many times. Such that, improving at long-horizon tasks is mostly loaded on improving at long-horizon agency, while improving at short-horizon tasks is mostly loaded on improving at specialized knowledge.
I do understand that these are more of the justifications for why you might extrapolate data in the way that you’re doing yet I find myself a bit concerned with the lack of justification for this (in the post). This might just be because of infohazard reasons in which case, fair enough.
For example, I feel that this definition above applies to something like a bacterial colony developing antibiotic resistance:
It can make plans that maintains phenotypic diversity through bet-hedging—multiple strategies held in parallel. When antibiotics hit, resistant variants proliferate while sensitive variants die off. The population “revises its plan” through differential survival.
It essentially develops intuitions of where to go, evolved response patterns encode which resistance mechanisms work against which threats. These are “intuitions” (read search strategies and priors) optimized over billions of years.
We have task decomposition since subpopulations differentiate into functional roles—resistance expressers who bear metabolic costs, sensitive free-riders who benefit from herd protection, dormant persisters who survive through inactivity, and “scout” cells that test environmental conditions.
We also have error correction since failed strategies get pruned through death. Successful strategies get amplified through reproduction.
Now the above examples is obviously not the thing that you’re trying to talk about. The point I’m trying to make is that your planning definition applies to a bacteria colony and that it therefore is not specific enough?
In order to differentiate between a bacterial colony and a human there are a set of specific properties that I feel need more discussion to make the model rigorous:
What about self representations (e.g boundaries) as computational shortcuts to model your action policies as consistent in your environment?
What about task compositionality (specialized tasks combining)?
What about the challenges of online learning and the specific computational laws that show up there?
What about the specialized learning apparatus that we have in our brains?
Maybe a bacterial colony and humans are on the same planning spectrum and there’s some sort of search based version of the bitter lesson that says that “compute is all you need” yet it feels like there are phase transitions in between bacterial colonies and humans and that this is not a continous model. Does compute give you self representations? Does compute enable you to do online learning? Does compute + search give you the planning apparatus and memory bank that the brain has?
How do you know that 12+ hours tasks don’t require a set of representations that are not within what your planning model is based on? How do you know that this is not true for 48+ hours tasks?
To be clear, I applaude the effort of trying to forecast the future and if you can convince me that I’m wrong here it will definetely shorten my timelines. It makes sense to try the most obvious thing first and assuming a linear relationship seems like the most obvious thing. (yet I still have the nagging suspicion that the basis of your model is wrong as there are probably hidden phase transitions between going from a bacterial colony in planning function and a human.)
TL;DR
I guess the question I’m trying to ask is: What do you think the role of simulation and computation is for this field?
Longer:
Okay, this might be a stupid thought but one could consider MARL environments and for example https://github.com/metta-AI/metta (softmax) to be a sort of generator function of these sorts of reward functions potentially?
Something something it is easier to program constraints into how the reward function and have gradient descent discover it than it is to fully generate it from scratch.
I think there’s mainly a lot of theory work that’s needed here but there might be something to be said about having a simulation part as well where you do some sort of combinatorial search for good reward functions?
(Yes, the thought that it will solve itself if we just bring it in to a cooperative or similar MARL scenario and then do IRL on that is naive but I think it might be an interesting strategy if we think about it as combinatorial search problem that needs to satisfy certain requirements?)
Nor is this process about reality (as many delusional Buddhists seem to insist), but more like choosing to run a different OS on ones hardware.
(I kind of wanted to give some nuance on the reality part from the OS Swapping perspective. You’re of course right with some overzealous people believing they’ve found god and similar but I think there’s more nuance here)
If we instead take your perspective of OS swap I would say it is a bit like switching from Windows to Linux because you get less bloatware. To be more precise one of the main parts of the swap is the lessening of the entrenchments of your existing priors. It’s gonna take you a while to set up a good distro but you will be less deluded as a consequence and also closer to “reality” if reality is the ability to see what happens with the underlying bits in the system. As a consequence you can choose from more models and you start interpreting things more in real time and thus you’re closer to reality, what is happening now rather than the story of your last 5 years.
Finally on the pain of the swap, there are also more gradual forms of this, you can try out Ubuntu (mindfulness, loving kindness) before switching over. Seeing through your existing stories can happen in degrees, you don’t have to become enlightened to enjoy the benefits?
Also, I think that terminology can lead to specific induced states as it primes your mind for certain things.
One of the annoying things with meditation is of course that there’s n=1 primary experience that makes it hard to talk about yet from my perspective it seems a bit like insight cycling, dark night of the soul and the hell realms are something that can be related to a hyperstition or a specific way of practicing?
If you for example follow thai-forest tradition, mahamudra or dzogchen (potentially advaita though less certain) it seems that insights along those lines are more a consequence of not having established a strong enough 1 to 1 correspondence with loving awareness before doing intense concentration meditation? (Experience has always been happening, yet the basis for that experience might be different.)
It is a bit like the difference between dissolving into a warm open bath or a warm embrace or hug of the world versus seeing through the world to an abyss where there is no ground. That groundlessness seems to be shaped by what is there to meet it and so I’m a bit worried about the temporal cycling language as it seems to predicate a path on what has no ground?
I don’t really have a good solution here as people seem to be going through those sort of experiences that you’re talking about and it isn’t like I’ve not gotten depressive episodes after longer meditation epxperiences either. Yet I don’t know if I would call it a dark night of the soul for it implies a necessity of personation with the suffering and that is not what is primary? Language is a prior for experience and so I would just use different language myself but whatever.
Man I’m noticing this is hard to put into words, hopefully some of it made sense and I appreciate the effort for a more standardised cybernetic basis to talk about these things through.
dissolution of desire. An altered trait where your brain’s reinforcement learning algorithm is no longer abstracted into desire-as-suffering.
Would you analogize this term to the insights of “dukkha”? I find an important thing here to be the equal taste of joy and sorrow from the perspective of dukkha and so it might be worth emphasising? (maybe I’m off with that though.)
Here’s an extension of what you said in terms of dullness and sharpness within attention based practices. (Partly to check that I understand)
Dullness = subcriticality and distance in cascading below the criticality line
Monkey mind = supercriticality and cascading above the criticality line (activates for whatever shows up)
If we look at the 10 stages of TMI (9-stage Elephant path), the progression goes something like distracted mind → subcriticality (stage 2-3) → practices to increase cascading of brain (4-5) → practices for the attention to calibrate around the criticality line (6-10)
Also this is why the tip to meet your meditation freshly wherever it is appearing is important because it is a criticality tuning process that is different for everyone?
(I very much like this way of thinking about this, nice!)
Intuition Pump: The AI Society
Based on a true map of the territory. (I really like this advice a good exploration strat seems similar to the one about taking photographies, it is really just about taking a bunch of them and you’ll learn what works over time.)
I really appreciated this post.
I didn’t know that you had concepts for aliveness and boggling within the rationality sphere as I find these two of my most previous states that I’ve been cultivating over the last couple of years and they’ve always felt semi-orthogonal to more classic rationality (which I associate more with the betting, TDT and deep empiricism stuff).
Meditation seems to bring aliveness, boggling and focusing forth quite well and I just really appreciate that they’re things you place high value on as I find them some of the best ways of getting out of pre-existent frames. (Which for me seems one of the best ways of becoming more rational)
On character alignment for LLMs.
I would like to propose that we think of a John Rawls style original position (https://en.wikipedia.org/wiki/Original_position) as one view when looking at character prompting for LLMs. More specifically I would want you to imagine that you’re on a social network or similar and that you’re put into a word with mixtures of AI and human systems, how do you program the AI in order to make the situation optimal? You’re a random person among all of the people, this means that some AIs are aligned to you some are not. Most likely, the majority of AIs will be run by larger corporations since the amount of AIs will be proportional to the power you have.
How would you prompt each LLM agent? What are their important characteristics? What happens if they’re thought of as “tool-aligned”?
If we’re getting more internet based over time and AI systems are more human in that they can flawlessly pass the turing test, I think the veil of ignorance style thinking becomes more and more applicable.
Think more of how you would design a societly of LLMs and what if the entire society of LLMs had this alignment rather than just the individual LLM.
This is a nice way to get around the problems raised in Andrew Critch’s post on consciousness as well since it is a lot less conflationary
Adults will pre-mortem plan by thinking most of their plans will fail. They will therefore they have a dozen layers of backups and action plans prepared in advance. This is also so that other people can feel that they’re safe because someone had already planned this. (“Yeah, I knew this would happen” type of vibe.)
The question would then be “How will my first 10 layers of plans go wrong and how can I take this into account?”
A quick example of this might look like this:
Race dynamics will happen, therefore we need controls and coordination (90%)
Coordination will break down therefore we need better systems (90%)
We will find these better systems yet they will not be implemented by default (85%)
We will therefore have to develop new coordination systems yet they can’t be too big since they won’t be implemented (80%)
We should therefore work on developing proven coordination systems where they empirically work and then scale them
(Other plan maybe around tracking compute or other governance measures)
And I’ve now coincidentally arrived at the same place as Audrey Tang...
But we need more layers than this because systems will fail in various more ways as well!
Cool stuff, I like this direction. Some random thoughts from having thought about this before are (not necessarily related to the BT-method and similar as that more seems like a nice way to potentially measure it.):
In order to get a good predictive orthogonal basis of values in LLMs I think you would need some sort of correlational clustering model for you need somewhere to start.
We want a robust basis for the thing that partly makes value learning of AIs useful is to know how they will act in the future and in order to know this we need to target robust features and that is easier if we can establish a set of values that are as non-correlated with each other as possible.
I think that shard theory makes sense for both humans and llms and that we have specific values like virtue style thinking that shows up in different contexts and so it is a good idea to potentially map them out.
As a consequence I think this paper by the Meaning Alignment Institute is quite undervalued because they construct a methodology to create a relational value graph that maps on really well to contextual value representations: https://arxiv.org/abs/2404.10636
The problem is of course that due to character level representations and simulacra you will have multiple different characters but it would be interesting to see the extent that a character shows up underneath the existing system.
There are a bunch of existing human value maps that are just about coming up with names for things and mapping data onto an orthogonal basis and then displaying that data, big 5 traits being one of them, there’s also some policy focused versions of these.
Anyways, just a bunch of ideas here but it would be really cool if someone continued this line of work.