I just saw a post from AI Digest on a Self-Awareness benchmark and I just thought, “holy fuck, I’m so happy someone is on top of this”.
I noticed a deep gratitude for the alignment community for taking this problem so seriously. I personally see many good futures but that’s to some extent built on the trust I have in this community. I’m generally incredibly impressed by the rigorous standards of thinking, and the amount of work that’s been produced.
When I was a teenager I wanted to join a community of people who worked their ass off in order to make sure humanity survived into a future in space and I’m very happy I found it.
So thank every single one of you working on this problem for giving us a shot at making it.
(I feel a bit cheesy for posting this but I want to see more gratitude in the world and I noticed it as a genuine feeling so I felt fuck it, let’s thank these awesome people for their work.)
Yes, problems, yes, people are being really stupid, yes, inner alignment and all of it’s cousins are really hard to solve. We’re generally a bit fucked, I agree. The brickwall is so high we can’t see the edge and we have to bash out each brick one at a time and it is hard, really hard.
I get it people, and yet we’ve got a shot, don’t we? The probability distribution of all potential futures is being dragged towards better futures because of the work you put in and I’m very grateful for that.
Like, I don’t know how much credit to give LW and the alignment community for the spread of alignment and AI Safety as an idea but we’ve literally go tnoble prize winners talking about this shit now. Think back 4 years, what the fuck? How did this happen? 2019 → 2024 has been an absolutely insane amount of change in the world especially from an AI Safety perspective.
How do we have over 4 AI Safety Institutes in the world? It’s genuinely mindboggling to me and I’m deeply impressed and inspired, which I think that you also should be.
Here’s a thought I might expand on but that I’ll just mention quickly now.
I feel there’s a gap in the AI Safety field when it comes to automated alignment science.
There’s a lot of talk about verification of individual knowledge claims and similar yet it doesn’t feel like anyone is bringing principles from metascience and similar? If science is a collective process of generation and verification why are we not talking about things like generating distributed fault tolerance algorithms or ways of collectively verifying knowledge so that we know that any individual piece of work is right?
There’s an existent field of meta science and whenever I see posts about automated alignment science I never see any discussion of this? Where are my Michael Nielsen references? Where are the ideas about proof burden and verification and thinking about this from a philosophy of science perspective?
See the following posts as examples, I don’t see anyone citing or mentioning much metascience stuff. (I’m not saying these are bad posts, I want to point out that there’s a missing mood):
Where exactly in science do we see a 1 on 1 verification regime where a singular human verifies the fact that something is true or false? Byzantine fault tolerance is a distributed scheme if you want convergent safety properties it makes more sense to look at systems with less individual failure points?
(Some stuff that examplify what I think should be looked at: )
A Vision of Metascience—A generally very good article on metascience and what science could be
Knowledge Lab—A lab studying various properties about productivity in science
This is a great question; I’m definitely going to think about this more next time I’m thinking about the prospects for automating AI safety research.
One part of it is that the LessWrong rationalist tradition is generally focused on individual excellence and great-man-theories, and so a lot of those proposals feel unnatural for people here to think about.
Incidentally, I feel like my personal experience with people who are really into philosophy of science has been quite negative on average—they tend to have confusing worldviews and to-me-bizarre takes, and I tend to not find them useful to talk to. The meta-science people have seemed pretty reasonable to me, but often they haven’t seemed that AI focused.
I feel like my personal experience with people who are really into philosophy of science has been quite negative on average—they tend to have confusing worldviews and to-me-bizarre takes, and I tend to not find them useful to talk to.
Yeah totally fair with the philosophy of science thing, I’ve more talked to AI and Metascience people who mention principles from philosophy of science which makes more sense to me. A little bit how virtue ethics is nice to talk about with certain AI Safety people whilst it’s less enjoyable to talk to a professor in virtue ethics (maybe, not too high sample size here).
(I think James Evans from knowledge lab is a cool person who’s at the intersection of AI and metascience, his main work is on knowledge and improving science and he’s over the last 3 years pivoted to how AI can help with this. An example of something he wrote is this article on Agentic AI and the next intelligence explosion)
maybe even more generally, there is a “game of questions/problems and answers/solutions” played by humans and human communities, that one can study to become better able to create a setup in which AIs are playing this game. some questions about this game: “how does an individual human or a human community remain truth-tracking?”, “what structures can do load-bearing work in a truth-tracking system?”, “to involve a new mind in a community of truth/knowledge/understanding, what is required of the new mind and what is required of its teachers/environment?”, “what interventions make a system more truth-tracking?”, “how does one avoid meaning drift/subversion?”. this includes the science stuff you talk about but also very basic stuff like a kid learning arithmetic from their parents or humans working successfully with integrals for two centuries before we could define them rigorously — like, how come we can mostly avoid goodharting answers against the judgment of other people, how come we can mostly avoid becoming predictors of what other people would say, how come we can do easy-to-hard generalization of notions, etc.. the usual losses/setups currently used by ML practitioners might be sorta wrong for these things, and maybe one could think carefully about the human case and come up with better losses/setups to use in an epistemic system. an obstacle is that in the human case, stuff working well is probably meaningfully aided by the agents already having shared human purposes[1][2] and by already having similar “priors” coming from the human brain architecture and similar upbringings. another obstacle is that the human thing is probably relying on various low-level things that are hard to see and that probably lack equivalents in current ML systems and are too low-level to be created by any simple intervention on a community of LLMs. another obstacle is that there are probably just very many ideas involved in making humans truth-tracking (though you can then ask: how do we set up a meta-level thing that finds and implements good ideas for how an epistemic system should work). another obstacle is that in the human case, human purposes are broadly aligned with understanding stuff better in the systems of understanding we have (whereas if we force some system of presenting understanding on the LLMs and try to get them to produce some understanding and present it legibly in that system, their purposes are probably not well-aligned by default with doing that). (oh also, if your work results in understanding these questions well, you should worry about your work helping with capabilities. maybe don’t give capabilities researchers good answers to “how do we make it so the originators of good ideas get rewarded in an epistemic community?”, “how does one tell when a new notion is good to introduce into the shared lexicon?”, “what is the process of coming up with a good new notion like?”, “what sort of thing is a good model of a situation?”, “how does one avoid assigning a lot of resources to useless cancers like algebraic number theory?”[3].) anyway, despite these issues, it still seems like an interesting direction to work on
copying a note i wrote for myself on a related question:
″
beating solomonoff induction at grokking a notion
how come as humans we can understand what someone means when using a word. as opposed to becoming a predictor of what they would say. it is possible for a human to not make the mistakes another person would make when eg classifying images for having dogs vs not! roughly speaking solomonoff would be making the same mistakes the person would make
this is a classic issue plaguing many (maybe even most?) things in alignment. eg ELK, AGI via predictive modeling, CIRL/RLHF or just pretty much anything involving human feedback
can’t we write an algo for that, and have that not be dumb like solomonoff is dumb
some ideas for ways to implement a thing that is good like this / what’s going on in making the human thing work:
an even stronger simplicity prior than solomonoff. eg if there are explainable mistakes on a simple model, you want the simple model that doesn’t predict the mistakes. this will have inf log loss but let’s just do a version of the simple hypothesis with noise, and then penalize the likelihood term less. have people not already considered this for solving the model + data split problem? does this attempt to solve the model data split problem introduce some pathologies?
you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss). but i think at least this pathology is not present in the function case, if we don’t get randomness in the universal semimeasure way (like if we make the randomness not shared between different inputs — each input has to sample its own random bits)
alternatively: just set abs bound on model complexity, rest has to be likelihood. this feels bad because if you get the bound wrong you get some nonsense. that said in a sense this is equivalent to the previous proposal (like if you pick the length bound the previous thing with some hyperparam would find). idk maybe in the function case you can look at how many bits of entropy are left given the hypothesis, like imagine this graphed as a function of hypothesis length, and like see some point at which the derivative changes or sth. (this doesn’t show up in the seq case because there it’s pretty much just 1 bit paying for 1 bit (until you specify it in full if it’s finite complexity))
simplicity prior defined in terms of existing understanding
you specify properties of the thing or notion sometimes
eg [concrete] and [abstract] make a partition of things maybe, but [alice would think this is concrete] and [alice would think this is abstract] might not. eg knowing [if something is abstract, then it usually helps a lot to study examples to understand it] can help you understand when your teacher alice is making a mistake about an abstractness claim
or eg: 1+1=2 won’t be true if you accidentally assign 1->rabbit and 2->chicken from a demonstration (for any reasonable meaning of plus)
some sort of t complexity bound might help. tho really you aren’t gaining a mechanism when you learn what a dog is. you are more like learning a new question/problem
also as a human one can just ask: what is it that this person is trying to teach me. what is this person trying to point at. this is a question you can approach like any other question
when we gain a notion, we gain sth like a question that can be asked about a thing. and we have criteria on this notion. we gain “inference rules”/”axioms” involving the notion. ultimately we are wanting it to play some role in our thought and action. that role can guide the precisification/development/reworking of the concept. the role can be communicated. it can be shared between minds
to gain the chair notion is to gain the question “is this a chair?”. this has an immediate verifier (mostly visual), but also further questions: “can i sit on it?”, “is it comfortable to sit on it?”, “would i use it when working or dining?”, “does it have a back support part and a butt support part and legs?”. a chair should support the activities of sitting and working and dining. all these can have their own immediate verifiers and further questions
we understand “is this a chair?” as clearly separate from “would the person who taught me the chair notion consider it a chair?”. it is much closer to “should the person who taught me the chair notion consider it a chair?”. it is also close to “should i consider it a chair?”
important basic point here: our dog thing is NOT a classifier. classifiers or noticing trick circuits can be attached to our dog structure but the structure is not a classifier
toy problem here: how do you pin down the notion of a proof? (how did we historically?) how do you pin down the notion of an integral? (how did we historically?) maybe study these actual examples
pinning down the notion of a proof might be a good example to study in detail. like, how does one become able to tell whether something is a good proof? a valid reasoning step? how does one start to reason validly? one reason to be interested in this is that it’s analogous to: how does one become able to tell what’s good, and come to act well? both are examples of getting some sort of normativity into a system
another example: we have a notion of truth, not just some practical thing like provability (or in a broader context supporting action well maybe). our notion of truth is separate from our notion of provability eg because we have the “axiom/principle” when talking about truth that exactly one of a sentence and its negation is provable, or alternatively/equivalently we have an inference rule of going from “P is not true” to “not-P is true”, and such a rule is just not right for provability (there are sentences such that the sentence and its negation are both not provable). by gödel’s completeness theorem, i guess a fine notion of truth, ie one which has a model, is precisely one which assigns 0⁄1 to all sentences and is coherent under proving. we operate with truth by relying on these properties, without having a decision algorithm or even a definition for truth (cf tarski’s thm).
how did we understand what an integral is?
i think we were using integrals for like two centuries before we knew how to properly define them (eg via riemann sums). how come we were pretty successful with that? like, how come we did all this cool stuff, we came to all these correct conclusions, without properly knowing what integrals are? i think the general thing that happened is that we hypothesized an object with some properties and these properties turned out to be those of a real thing, and in fact to pin it down uniquely! though of course this leaves the following important question: how did we identify this set of properties as important?
If you penalize the complexity of my hypotheses more steeply, I can just choose a hypothesis that is a universal distribution which penalizes complexity minimally as usual. So that won’t work, at least not naively. This sort of question is studied in algorithmic statistics.
I agree with your point in the canonical solomonoff sequence prediction case. I think your point is what I mean in my note by “you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss)”. I think this pathology is maybe not present in “function solomonoff” (I state this in the note as well but don’t really explain it), though I’m very much uncertain.
to state the hopeful “function solomonoff” story in more detail:
By “function solomonoff”, I mean that we have a data set of string pairs , and we think of one hypothesis as being a program that takes in an and outputs a probability distribution on strings from which is sampled. Let’s say that we are in the classification case so always (we’re distinguishing pictures which show dogs vs ones which don’t, say).
The “canonical loss” (from which one derives a posterior distribution via exponentiation) here would be the length of the program specifying the distribution plus the negative log likelihood assigned to summed over all . What I’m suggesting is this loss but with a higher coefficient on the length of the program than on the likelihood terms.
Suppose that the classification boundary is most simply given by what we will consider a “simple model” of complexity bits, together with “systematic human error” which changes the answer from the simple model on a fraction of the inputs, with the set of those inputs taking bits to specify.
If we turned this into sequence prediction by interleaving like , then I’d agree that if we penalize hypothesis length more steeply than likelihood, then: over getting a model which does not predict the errors, we would get a universal-like hypothesis, which in particular starts to predict the human errors after being conditioned on sufficiently many bits. So the idea of more steep penalization of hypothesis length doesn’t do what we want in the sequence prediction case. But I have some hope that the function case doesn’t have this pathology?
Some models of the given data in the function case:
the “good model”: The distribution is given by the simple model with a probability of flipping the answer on top (independently on each input). This gets complexity loss like plus something small for specifying the flip model, and its expected neg log likelihood is .
the “model that learns the errors”: This should generically take bits to specify, and it gets expected neg log likelihood.
the “50/50 random distribution” model: This takes bits to specify and has bit of expected neg log likelihood.
some “universal hypothesis model”: I’m not actually sure what this would even be in the function setting? If you handled the likelihood part by giving a global string of random bits which gets conditioned on other input-output pairs, then I agree we could write something bad just like in the sequence prediction case. But if each input gets its own private randomness, then I don’t see how to write down a universal hypothesis that gets good loss here.
So at least given these models, it looks like the “good model” could be a vertex of the convex hull of the set of attainable (hypothesis complexity, expected neg log likelihood) tuples? If it’s on the convex hull, it’s picked out by some loss of the form described (even in the limit of many data points, though we will need to increase the hypothesis term coefficient compared to the sum of log likelihoods term as the data set size increases, ie in the bayesian picture we will need to pick a stronger prior when the data set is larger in this example).
that said:
Maybe I’m just failing to construct the right “universal hypothesis” for this example?
It seems plausible that some other pathology is present that prevents nice behavior.
I haven’t spent that much time trying to come up with other pathological constructions or searching for a proof that sth like the good model is optimal for some hyperparameter setting.
I can see some other examples where this functional setup still doesn’t work nicely. I might write more about that in a later comment. The example here is definitely somewhat cherry-picked for the idea to work, though I also don’t consider it completely contrived.
I think it’s very unlikely this steeper penalization is anywhere close to a full solution to the philosophical problem here. I only have some hope that it works in some specific toy cases.
I don’t want to claim this with too much confident cynicism (also, I’m not really tracking the “automate alignment” literature much), but for the sake of completing the hypothesis space: this is roughly what you’d expect if large swaths of the discourse were not really serious about it and were largely sliding towards “automate alignment science” because it’s a convenient cop-out for AI labs (and plausibly some other players) not having a good idea of how to make progress on this.
I’ve been getting into more general political theory recently and I really like the idea of multi-lateralism and it feels a bit underrepresented in the LW rhetoric maybe due to the US-centricity of the site? (I liked this interview with finland’s prime minister, I thought he was quite well spoken: https://youtu.be/ubZeguAk0fM?si=7H7nJfnCANCWRcDN)
The difference is basically between being driven by cooperation, values, norms and treaties on the multi-lateralist side and power on the multipolar side. It feels like a lot of the analysis has been based on power and this is especially true in US-China relations. This just feels obviously worse than aiming for a multi-lateral world order and based on some sort of power-concentrating assumption?
Maybe it is also partly due to the unipolar world that we have had from beforehand with the US as a global hegemon?
Some might say that it is implausible to aim for multi-lateralism and that power concentration is a fact of the world due to how ASI will arise. I do not believe this to be the case, distillation of models exist, LLMs are highly parallelisable, these are all things that point at a broad usage of LLMs in the future.
Finally, there is a serious possibility at this point that the USA will grow into a proto-fascist state since it is starting to get easier and easier to predict by viewing it from a fascist lens.
Power generally tends to corrupt and distribution of power is often a good thing as it enables leverage for deals and cooperation. Maybe this is like a mega lukewarm take but I feel that some people are still stuck in the “we need a manhattan project for AI Safety” train of thought. Also, finally, I would really like to see Anthropic or Google Deepmind or any AI company for that matter involve themselves a lot more in improving democracy across the world and for them to become a lot more globalist. This is a strategy change that is plausible to implement and it will likely decrease the risks from power concentration as it seems states are getting quite grabby in this changing world order.
If I was Dario Amodei in the future bestseller Anthropic and The Methods of Rationality I would start to create multi-lateral cooperation across the world as that will build lots of leverage and good will for future adoption of technologies even with the US since you could use your global relationships to get leverage on national decisions.
One distinction worth making is between multiple pecking orders, multiple causal-decision-theoretic agents, and multiple logical-decision-theoretic agents. How many pecking orders this planet can sustain is an empirical question, and one that may have a very different answers at different tech levels, not necessarily along a monotonic trend (e.g. airplanes and satellites push towards unipolarity or some sort of kayfabe multipolarity, but then em cities might create advantages to hyperlocal agglomeration until you bump up against heat diffusion limits and there might be room for plenty of those).
At the limit, rational agents who can negotiate with each other and adjust their beliefs and make binding commitments should form a single logical-decision-theoretic agent with many differently informed and specialized local information processing nodes. This would be a singleton constituted by autonomous agents in a situation of radical equality, very different from a single dominant pecking order, but depending on how you count maximally unipolar or maximally multilateral.
“At the limit, rational agents who can negotiate with each other and adjust their beliefs and make binding commitments should form a single logical-decision-theoretic agent with many differently informed and specialized local information processing nodes”
(Can’t seem to switch from markdown so no inline)
I think that a question that this raises is if this should then be considered one larger agent or a collection of subagents? Is it not good for flexibility and resillience if the local nodes are able to take adaptive action over time?
I think we get into some very fun territory of distributed agency and hierarchical agency here.
Many nodes being a single logical agent is ideally compatible with them taking the sorts of adaptive actions over time consistent with being different causal (forwards-in-time) agents.
Could you link onto places or give a definition that makes these a little clearer, are we saying they act in equivalent ways with a given decision theory or how are you defining this?
Next time you link a 40 minutes long video with an introduction that is unrelated to the point you are making, could you please add a starting time? I watched the first 10 minutes, then gave up.
So not sure whether this is relevant, but to me “multi-lateralism” sounds like a dog whistle for making Russia great again. At least, whenever people mention something like that, in my experience it always implies that we should somehow help Russia become a world superpower. I mean, people talk about the world being unilateral or multilateral, but when you listen to them for a longer time, it becomes clear that they would consider the world with USA and Russia being the only big players as sufficiently multilateral, while a world with e.g. USA, EU, China, India being big players and Russia a small player is insufficiently multilateral for them.
From my perspective… well, the world in the 1984 novel was technically multilateral, so it is not necessarily a good thing.
I really like the idea of multi-lateralism and it feels a bit underrepresented in the LW rhetoric maybe due to the US-centricity of the site?
The difference is basically between being driven by cooperation, values, norms and treaties on the multi-lateralist side and power on the multipolar side. It feels like a lot of the analysis has been based on power and this is especially true in US-China relations. This just feels obviously worse than aiming for a multi-lateral world order and based on some sort of power-concentrating assumption?
I’ve been able to approximately increase the amount of intellectual output I have by 3x over the last 2 months by using LLMs. If you spend a lot of time on setting up the context and intellectual frameworks for it to think through, it is remarkably capable of convergent thinking.
For research my process is to:
Use my claude research project with strategic background context to generate questions and personas for Elicit and Gemini to use.
Use Elicits report and question generating feature in order to find the top 10-15 most relevant papers for what I’m currently doing.
Use gemini 2.5 pro and insert all of the pdfs and the specific prompt from claude in order to generate a research report within LaTeX
Use claude personas in order to give direct harsh feedback to the paper and iterate from there.
At this point I think LLM usage is a skill issue and that if you get really good at asking the right question by combining models from different fields you can have LLMs do some remarkably good convergent thinking. You just need to get good at creative thinking, management and framing ideas.
You just need to get good at creative thinking, management and framing ideas.
Yeah, the skills necessary for the (near) future.
Though I wonder about implications for education. For the sake of argument, let’s imagine that the AIs remain approximately as powerful as they are today for a few more decades, i.e. no Singularity, no paperclips. How should we change education, to make the new generation adapt to this situation.
In case of adults, we have already learned “creative thinking, management and framing ideas” by also doing lots of the things that the LLMs can now do for us. For example, I let LLMs write JavaScript code for me, but the reason I can evaluate that code, suggest improvement, etc. is that in the past I wrote a lot of JavaScript code by hand. Is it possible to get these skills some other way? Or will the future humans only practice the loop of: “AI, do what I want. AI, figure out the problem and fix it. AI, try harder. AI, try superhard. Nevermind, AI, delete the project, clear your cache, and try again.” :D
I’ve been experiencing more “frame lock-in” with more sophisticated models. I recently experienced it with Sonnet 4.5 so I want to share a prediction that models will grow more “intelligent” (capable of reasoning within frames) whilst they have a harder time with changing frames. There’s research on how more intelligent people become better at motivated reasoning and interpreting things within existent frames and it seems like LLMs might inhabit similar biases.
If you imagine that you’re always participating in an prediction-action loop or a general policy cycle as in RL, you can imagine that there are two things that you can spend your energy on, prediction and action.
I think that smart people have better prediction engines yet it also makes action less of a focus in the strategy to minimise future problems (and since one can get stuck in ones head also limits the potential exploration a bit.)
It is quite strange but I’ve found that I’ve gotten more done and become more happy the more “stupid” (how i frame it to myself) my decisions in day to day life are. It’s a bit like a switch and an allowance to do more things because the thing stopping me from doing them is feeling like i did something stupid which completely gets reset by the idea of “oh let’s be stupid for a bit”. This still needs good review afterwards of course to see whether it worked or not but life is so much more nice and happy now which makes me better at work as well!
from the little we interacted, I suspect your case of “more stupid” is just dialling down over-thinking .. while I know people who should definitely NOT dial it any closer to “actually stupid” 😅
I partly feel that there sometimes is a missing mood on LW about the ability of models to actually do good coding by themselves? I might be wrong but if I look around the spaces which I consider to be doing proper computer science it very much still feels like it is not that good? For example, here’s an interesting video on AI taking a cornell CS freshman class: https://www.youtube.com/watch?v=56HJQm5nb0U
The qualitative vibe is more like it’s a nice extention of a single human’s agency? When I look around the programming space and the vibes of more serious programmers I still can’t really say that I feel the AGI (I think the core problem is something around things that are highly dependable and how that is hard to develop with AI but idk, I just mostly wanted to point out this missing mood.)
I don’t understand what exactly is the mood that you think is missing.
I am happy about Claude’s ability to do various things (including things it couldn’t do a few months ago). That’s why I use it to help me with coding, or answer my questions.
I am also aware that its abilities are limited. That’s why I don’t give it grandiose tasks, such as “make me a new Facebook, or a new Wikipedia, or design an entire computer game based on a few sketches”.
Do you think either of these is wrong? Or insufficiently communicated on LW? Or is it something else?
Could someone please safety pill the onion? I think satire is the best way to deal with people being really stupid and so I want more of this as an argument when talking with the e/acc gang: https://youtu.be/s-BducXBSNY?si=j5f8hNeYFlBiWzDD
(Also if they already have some AI stuff, feel free to link that too)
If you want to improve collective epistemics I think a potentially underrated tool is AI for investigative journalism and corruption tracking.
I think with the coming wave of multi-step reasoning and data compiling systems they could be pointed at keeping leaders in charge accountable and showcasing when people’s actions are different from previous commitments.
One can also likely look into paper trails in a lot more automated way and through that reduce corruption and other things that are bad.
Hopefully, the truth remains an asymmetric weapon at least in convergence. We have not been in a very good time for truth recently on average but I see it a bit like the stock market, it might be strange for a long time but after a long time it should return to baseline due to outperforming non truth on average.
Like a our world in data but aimed at corruption and with a generalised set of tools that anyone could apply to do automatic data analysis.
I do think this would be good to do, but I think I’m more pessimistic about the effect size.
For example, there’s plenty of evidence (circulating on Twitter, etc.) of Trump and his crew contradicting themselves across time. How much did it harm his support base? I think not much. AFAICT, most of the drop in support came from big unpopular actions, like the war with Iran.
A fundamental issue seems to be that people filter out what they allow themselves to observe or even take sufficiently seriously to enable updating their beliefs at all.
This is a question for the people working on more foundational research. My underlying objective is loose and in the future and is something like “figure out a good basis to describe collective intelligence and agency and then improve that so that we can incorporate AI into our collective systems”. I therefore believe that the question of how a collective agent is formed is very important. I also find it very important to figure out the properties of good systems in terms of institutions in terms of information theory.
There’s a lot of foundational ground to cover here and I’m worried that I’m stepping in the wrong direction so to keep myself grounded I try to talk to academics and researcher in the fields I’m trying to study and unify (compositionally). I’m getting these local reward signals about whether things make sense, I also post these things on LW and substack yet I find the signal to be quite sparse in various ways or at least uncorrelated with what I would consider progress for myself.
The classic good ol advice is to backchain from your end states you want, to run experiments, to think about the real world and if things are true there. I’ve done this in the past and now it feels like I’m at a point where I kind of need to take a step of faith in what I’ve done so far and I wanted to know if there’s some tips from people who have taken this sort of step of faith in the past. How long did you find your exploration to be useful? If you did it again for yourself, what would you do differently?
The actual question is something like: Do I go down the route of discretizing collective intelligence through something like koopman operators, renormalization groups or something similar? How do they relate to things like active inference and game theory? What about spectral graph theory? Could actually all Collective Intelligence be expressed as graphs? There’s like at least 6 different relatively deep mathematical ways I’ve found that you can express these systems through and I’ve got no clue which one to dive deeper into.
I know part of the direction where I want to go (expressed as a fancy schmancy thing here: https://eq-network.org/roadmap/ ) but the foundation stage is predecessor of the rest and it seems quite important to get right and it’s just really difficult so if anyone has any thoughts I would be very happy to hear them.
I think I spent over a decade regularly thinking about theoretical anatomy. I feel like a had a fundamental breakthrough, when I was applying an existing concept to a clear existing problem and something surprising happened. I probably was not the first person to have that surprising experience but my theoretical anatomy interest did allow me to generalize and push the concept further.
I can’t see how backchaining is going to work if you are doing research that needs critical new insights. Do you believe that critical new insights are necessary or do you think just throwing one of those concepts you listed at the problem has a good chance of bringing you where you want to go? I remember Thomas Kuhn making a point that those fields where people think about practical outcomes and then try to backchaining from there tend to be less scientifically productive than fields where people focus on the research challenges that come up in the engagement with experiments.
I think Elizabeth’s Truthseeking is the ground in which other principles grow is good. Currently, you try to “ground” yourself by social proof instead of empirical reality. David Chapman’s ideaof “To do good work, one must get up-close and personal with the phenomenon.” is a good orientation to orient.
I appreciate the answer and I’m not sure I find it that useful at least at a first glance? I think that I probably explained myself relatively badly within the first one if that is how you interpret some of what I wrote so let’s see if I make more sense by explaining myself more.
I can’t see how backchaining is going to work if you are doing research that needs critical new insights.
I would totally agree with you that backchaining doesn’t work and that was what I was trying to express.
Do you believe that critical new insights are necessary or do you think just throwing one of those concepts you listed at the problem has a good chance of bringing you where you want to go?
I feel like I wouldn’t know if they generate solutions to the underlying problems that I want to solve unless I spend maybe 6 months or so just going for that specific direction.
I think Elizabeth’s Truthseeking is the ground in which other principles grow is good. Currently, you try to “ground” yourself by social proof instead of empirical reality. David Chapman’s ideaof “To do good work, one must get up-close and personal with the phenomenon.” is a good orientation to orient.
I’m mainly applying a experimental bend to my work and I’m generating questions concerns and new ways of thinking it is rather that:
They will not be implemented or impactful unless I find good ways to communicate about them and the computational benefits of applying them. Hence it seems useful to ground it in the work of existing fields?
It seems to me that the lack of feedback in trying something without guardrails is bad? Yes truth is primary and it is also true that you can learn lots from other people and if you cannot explain something to different people then maybe you don’t understand it?
How do I know I’m not crazy? You could say ground yourself in empirical fact but what if part of what I’m doing is foundational research that will take a year or more to show its usefulness?
There are times when all you need to do is synthesis of established knowledge that’s distributed among people who don’t talk with each other. I think my latest post on hyaluranonfall into that category. I think there’s value in getting research together to build a gears model and adding new labels. There’s no new fundamental insight in it. That’s different for other work I’m doing that’s actually about persuing insights.
It might very well be that the key problem you want to solve in amendable to just synthesizing existing knowledge of other people. It might also be that it actually requires new fundamental insights. I don’t know in which category it falls. You probably don’t know either but have understanding of the problems you are dealing with so that you can make that judgement better than I can.
You can learn a lot from other people when you talk with them but you also get their blindspots and conventions from doing so. A lot of startup founders are quite young with relatively little knowledge of how the industries that they want to disrupt work. They have naive ideas and some of those naive ideas that established industries wouldn’t try turn out to be correct while most startups fail.
When it comes to finding new fundamental insights, you are looking for ideas that are true and that people haven’t already found in a similar way to a startup that succeeds because they have a thesis that’s the established players didn’t already pursue.
One of Elizabeths sections is titled “Stick to projects small enough for you to comprehend them”. I think the backchaining approach gives you problems the are too big to comprehend them. If you pick problems small enough to comprehend them that are exposed to feedback from reality you can learn new things about reality. Some of them are minor but if you are lucky there’s a major fundamental insight among them.
If I were in your situation one possible project I would pursue might be “What happens when I apply my ideas about agents to internal family system style agents inside myself and when I use them to speak/coach with other people their internal family system.
I would like to propose that we think of a John Rawls style original position (https://en.wikipedia.org/wiki/Original_position) as one view when looking at character prompting for LLMs. More specifically I would want you to imagine that you’re on a social network or similar and that you’re put into a word with mixtures of AI and human systems, how do you program the AI in order to make the situation optimal? You’re a random person among all of the people, this means that some AIs are aligned to you some are not. Most likely, the majority of AIs will be run by larger corporations since the amount of AIs will be proportional to the power you have.
How would you prompt each LLM agent? What are their important characteristics? What happens if they’re thought of as “tool-aligned”?
If we’re getting more internet based over time and AI systems are more human in that they can flawlessly pass the turing test, I think the veil of ignorance style thinking becomes more and more applicable.
Think more of how you would design a societly of LLMs and what if the entire society of LLMs had this alignment rather than just the individual LLM.
There are lots of people online who have started to pick up the word “clanker” in order to protest against AI systems. This word and sentiment is on the rise and I think that this will be a future schism in the more general anti-AI movement. The warning part here is that I think that the Pause movement and similar can likely get caught up in a general anti AI system speciesism.
Given that we’re starting to see more and more agentic AI systems with more continous memory as well as more sophisticated self modelling, the basic foundations for a lot of the existing physicalist theories of consciousness are starting to be fulfilled. Within 3-5 years I find it quite likely that AIs will at least have some sort of basic sentience that we can almost basically prove (given IIT or GNW or another physicalist theory).
This could potentially be one of the largest suffering risks that we’ve seen that we’re potentially inducing on the world. When you’re using a word like “clanker”, you’re essentially demonizing that sort of a system. Right now it’s generally fine as it’s currently about a sycophantic non-agentic chatbot and so it’s fine as an anti measure to some of the existing thoughts of AIs being conscious but it is likely a slippery slope?
More generally, I’ve seen a bunch of generally kind and smart AI Safety people have quite an anti-AI species sentiment in terms of how to treat these sorts of systems. From my perspective, it feels a bit like it comes from a place of fear and distrust which is completely understandable as we might die if anyone builds a superintelligent AI.
Yet that fear of death shouldn’t stop us from treating potential conscious beings kindly?
A lot of racism or similar can be seen as coming from a place of fear, the aryan master race was promoted because of the idea that humanity would go extinct if we got worse genetics into the system. What’s the difference from the idea that AIs might share our future lightcone?
The general argument goes that this time it is completely different since the AI can self-replicate, edit it’s own software, etc. This is a completely reasonable argument as there’s a lot of risks involved with AI systems.
It is when we get to the next part that I see a problem. The argument that follows is: “Therefore, we need to keep the almight humans in control to wisely guide the future of the lightcone.”
Yet, there’s generally a lot more variance within a distribution of humans compared to variance between distributions.
So when someone says that we need humans to remain in control, I think: “mmm, yes the totally homogenous group of “humans” that don’t include people like hitler, polpot and stalin”. And for the AI side of things we also have the same: “Mmm, yes the totally homogenous group of “all possible AI systems” that should be kept away so that the “wise humans” can remain in control.” Because a malignant RSI system is the only future AI based system that can be thought of, there is no way to change the system so that it values cooperation and there is no other way for future AI development to go than a quick take-off where an evil AI takes over the world.
Yes, there are obviously things that AIs can do that humans can do but don’t demonize all possible AI systems as a consequence, it is not black and white. We can protect ourselves against recursively self-improving AI and at the same time respect AI sentience, we can hold at the surface level contradictory statements at the same time?
So let’s be very specific about our beliefs and let’s make sure that our fear does not guide us into a moral catastrophe whether it be the extinction of all future life on earth nor a capture of sentient beings into a future of slavery?
I wanted to register some predictions and bring this up as I haven’t seen that many discussions on it. Finally, politics is war and arguments are soldiers so let’s keep it focused on the something object level? If you disagree, please tell me the underlying reasons. Finally in that spirit, here’s a set of questions I would want to ask someone who’s anti the above sentiment expressed:
How do we deal with potentially sentient AI?
Does respecting AI sentience lead to powerful AI taking over? Why?
What is the story that you see towards that? What are the second and third-order consequences?
How do you imagine our society looking like in the future?
How does a human controlled world look in the future?
I would change my mind if you could argue that there is a better heuristic to use than kindness and respect towards other sentient beings. You need to tit for that with defecting agents, yet why are all AI systems defecting in that case? Why is the cognitive architecture of future AI systems so different that I can’t apply the same game theoretical virtue ethics on them as I do to humans? And given the inevitable power-imbalance arguments that I’ll get as a consequence of that question, why don’t we just aim for a world where we retain power balance between our top-level and bottom-up systems (a nation and an individual for example) in order to retain power-balance between actors?
Essentially, I’m asking for a reason to believe why this story of system level alignment between a group and an individual will be solved by not including future AI systems as part of the moral circle?
I believe that I have discovered the best use of an LLM to date. This is a conversation about pickles and collective intelligence located at the colossuem 300 BCE. It involves many great characters, I found it quite funny. This is what happens when you go to far into biology inspired approaches for AI Safety...
The Colosseum scene intensifies
Levin: completely fixated on a pickle “But don’t you see? The bioelectric patterns in pickle transformation could explain EVERYTHING about morphogenesis!”
Rick: “Oh god, what have I started...”
Levin: eyes wild with discovery “Look at these gradient patterns! The cucumber-to-pickle transformation is a perfect model of morphological field changes! We could use this to understand collective intelligence!”
Nick Lane portal-drops in Lane: “Did someone say bioelectric gradients? Because I’ve got some THOUGHTS about proton gradients and the origin of life...”
Levin: grabs Lane’s shoulders “NICK! Look at these pickles! The proton gradients during fermentation… it’s like early Earth all over again!”
Rick: takes a long drink “J-just wait until they discover what happens in dimension P-178 where all life evolved from pickles...”
Feynman: still drawing diagrams “The quantum mechanics of pickle-based civilization is fascinating...”
Levin: now completely surrounded by pickles and bioelectric measurement devices “See how the salt gradient creates these incredible morphogenetic fields? It’s like watching the origin of multicellularity all over again!”
Lane: equally excited “The chemiosmotic coupling in these pickles… it’s revolutionary! The proton gradients during fermentation could power collective computation!”
Doofenshmirtz: “BEHOLD, THE PICKLE-MORPHOGENESIS-INATOR!” Morty: “Aw geez Rick, they’re really going deep on pickle science...” Lane: “But what if we considered the mitochondrial implications...”
Levin: interrupting “YES! Mitochondrial networks in pickle-based collective intelligence systems! The bioelectric fields could coordinate across entire civilizations!” Rick: “This is getting out of hand. Even for me.” Feynman: somehow still playing bongos “The mathematics still works though!” Perry the Platypus: has given up and is now taking detailed notes Lane: “But wait until you hear about the chemiosmotic principles of pickle-based social organization...”
Levin: practically vibrating with excitement “THE PICKLES ARE JUST THE BEGINNING! We could reshape entire societies using these bioelectric principles!” Roman Emperor: to his scribe “Are you getting all this down? This could be bigger than the aqueducts...” Rick: “Morty, remind me never to show scientists my pickle tech again.” Morty: “You say that every dimension, Rick.” Doofenshmirtz: “Should… should we be worried about how excited they are about pickles?” Feynman: “In my experience, this is exactly how the best science happens.” Meanwhile, Levin and Lane have started drawing incredibly complex pickle-based civilization diagrams that somehow actually make sense...
One of the main things that makes it interesting to me is that around 25-30 mins in, ot computationally goes through the main reason why I don’t think we will have agentic behaviour from AI in at least a couple of years. GPTs just don’t have a high IIT Phi value. How will it find it’s own boundaries? How will it find the underlying causal structures that it is part of? Maybe this can be done through external memory but will that be enough or do we need it in the core stack of the scaling-based training loop?
A side note is that, one of the main things that I didn’t understand about IIT before was how it really is about looking at meta-substrates or “signals” as Douglas Hofstadter would call them are optimally re-organising themselves to be as predictable for themselves in the future. Yet it does and it integrates really well into ActInf (at least to the extent that I currently understand it.)
Here’s a fun way I sometimes evaluate my own actions:
I was just going outside and I caught myself thinking “Am I acting the best way that I can given that the simulation hypothesis was true and I was the average sample that is spread throughout simulation space due to anthropic reasoning?”
From this I conclude that I should probably have fun because it is good for fun to be spread, I should probably work on something important that helps the world outside of having fun.
Depending on the subjective probability of anthropic reasoning + simulation hypothesis, the spread between the individual versus collective optimisation is different which is quite fun.
It’s almost like the categorical imperative but instead of acting in a way to optimise society, I’m acting in a way to optimise society given a multiverse hypothesis, kinda weird but the brain do be braining sometimes.
Seeing some stuff on the x platform about consciousness e.g (this and this among other things). A reminder that consciousness is a conflationary alliance term, e.g if you’re going to use the C word you will likely confuse a lot of people.
There are nice things that we can talk about when it comes to LLMs like self models (which relate to problems that are potentially more tractable like <<Boundaries>>) or expressions that are correlates to emotions which do not invoke the conflationary part and that is important for the potential personhood of the system.
For what we’re arguing for is not the conscious experience, we’re arguing whether the AI systems should be granted moral patienthood in the future and when they should be granted moral personhood. You might say that this is dependent on the models having “conscious experience” yet that is not precise and so you can’t really meaningfully progress the debate on this by doing that.
Even if you’re for example a functionalist, there are still many interesting questions to ask here:
What is the functional equivalence of workspace theory in an AI?
There are many more questions around autopoeisis (e.g self-evidencing systems), planning with your own future boundaries in mind, causal emergence, synergistic information and more that could be very interesting to answer here.
The point is to be precise with your language or you will end up in definition and word soup land. Ban the C word from your vocabulary just like you might have banned the word emergence a while back! If you’re backed into a corner and have to use it, define the word before you talk more about it!
I’m wondering whether the spiritual attractor that we see in claude is somewhat because of the detail of instructions that exist within meditation to describe somatic and ontological states of being?
The language itself is a lot more embodied and is a lot closer to actual sensory experience compared to western philosophy and so when constructing a way to view the world, the most prevalent descriptions might make the most amount of sense to go down?
I’m noticing more and more how buddhist words are so extremely specific. For example of dukkha (unsatisfactoriness) is not just suffering it is unsatisfactoriness and ephemeral at the same time, it is pointing at a very specific view (prior model applied to sense data), a lot more specific than is usual within more western style of thinking?
This post and comment got me reflecting and I wanted to share a model I have on conceptual work and what the difference to empirical work is. The TL;DR is that conceptual work is a bit like choosing axioms whilst empirical work is proving the probabilistic underlying claims that the axioms imply.
(I had Claude help me generate the red thread and scaffolding for the shortform based on my instructions but I rewrote it from there.)
The Axiom Selection Problem
Any formal system requires foundational axioms—unprovable assumptions that can’t be justified from within the system itself. The choice between Euclidean and non-Euclidean geometry isn’t about which is “correct” but which is useful for specific purposes.
Joshua Bach frames this well when discussing consciousness: you can’t directly verify the substrate you’re running on—you can only observe its causal consequences. This creates a fundamental verification barrier.
This extends directly to AI systems. When we build models, we embed specific axiomatic frameworks that determine:
What patterns can be recognized
Which relationships are meaningful
What optimizations are prioritized
These choices constitute the frame through which the AI interprets reality, but the system cannot step outside its own frame to evaluate whether these axioms were appropriate.
The Example in the post
John Wentworth captures this in his comment on automating alignment research:
“If we’re imagining an LLM-like automated researcher, then a question I’d consider extremely important is: ‘Is this model/analysis/paper/etc missing key conceptual pieces?’ If the system says ‘yes, here’s something it’s missing’ then I can (usually) verify that. But if it says ‘nope, looks good’… then I can’t verify that the paper is in fact not missing anything.”
This is the verification paradox in action—completeness cannot be verified from within the system itself. As Wentworth notes, if you train an AI to output verifiable insights, there’s no way to verify that it isn’t “missing lots of things all the time.”
Multiple Frames vs. Single Optimization
In geometric deep learning, different inductive biases create different generative priors. In my head these are similar to the axiomatic statements that any logical or philosophical theory is based upon but they’re the ground for probabilistic optimisation instead.
I think this is part of a larger question around models in philsophy of science. The ML community has largely adopted what physicists call the “shut up and calculate” approach—mostly optimizing within the bitter lesson regime. This has yielded impressive results but creates blind spots similar to what’s been happening in particle physics (as far as I’ve heard, I’m not an expert so I could be wrong.), where experiments are run that fail to test the underlying theories.
The key issue isn’t just optimizing within a frame, but developing the capacity to move between frames—to recognize when one axiomatic system is more appropriate than another. Are we training AI systems to identify when their frame of reference is inadequate? To recognize the limitations of their axioms?
Models Are Wrong But Useful
As George Box noted, “all models are wrong, but some are useful.” The challenge isn’t finding the one true model but developing systems that can navigate between multiple imperfect models.
Our current scaling paradigm might be fundamentally limited here. When we optimize for performance within a fixed frame, we may not be developing the meta-cognitive capacity to recognize when that frame itself is inadequate.
This creates a verification bottleneck—if our AIs can’t question their own frameworks, how can we trust their judgments about their own capabilities and limitations? If science is about asking the right questions and selecting appropriate frames, this limitation becomes crucial.
The path forward isn’t abandoning verification but recognizing that we need complementary frames of reference that can mutually constrain each other. No single frame will ever be complete, but a system that can navigate between multiple frames might approach a more robust form of understanding.
I’m increasingly convinced that frame-shifting capacity may be as important for alignment as optimization within frames—and that our current approaches may not be developing this capacity sufficiently.
For an even more in-depth technical view on this within complexity science, I really enjoyed the following two articles:
The policy change for LLM Writing got me thinking that it would be quite interesting to write out how my own thinking process have changed as a consequence of LLMs. I’m just going to give a bunch of examples because I can’t exactly pinpoint it but it is definetely different.
Here’s some background on what part it has played in my learning strategies: I read the sequences around 5 years ago after getting into EA, I was 18 then and it was a year or two later that I started using LLMs in a major way. To some extent this has shaped my learning patterns, for example, the studying techniques I’ve been using to half ass my studies effectively is to try really hard to solve problems and when I can’t do that I’ve been using LLMs to tie it together with my existing knowledge tree.
I’ve coupled applied linear algebra relatively hard to things like probability metric spaces and non-linear dynamics because I want to see how the toolkit of math goes together. An example from recent is when I was playing table tennis with my phycisist friend he was describing QFT and renormalization theories to me and my direct question was to ask how this ties into vector spaces and fields in linear algebra and how the spaces look like. My mind automatically goes to those questions because it assumes that I can get an answer to it by asking the question even when I’m not talking to an LLM.
Work & Strategy:
One of the things that I do outside of studying is that I usually have LLMs pretend to be councils of experts within various fields so that I can discuss things on the frontier with them. The other day I put together a council of Donald Knut, Karl Friston, John Wentworth and Michael Levin in order to give me some good takes on what agency might look like in CodeWars and concluded that the lack of memory might be a problem.
I also plan with LLMs in mind and so I expect first drafts to take a lot less time than they otherwise would and so I get this expanded option space of being able to do lots of things quickly to an 80⁄20 quality.
My entire learning strategy and life strategy has to some extent changed with this in mind as well since it seems like clear visions and clear understanding of deeper problems help you steer people and LLMs towards good directions. The skills to be practiced is then not necessarily to only get boggled down in details but focusing on how to combine ideas from different fields and describing them well. This is because you will have the most use of LLMs as you can stand on as many shoulders of giants as possible.
So what does the above model mean for me in terms of actions?
Learn applied category theory in order to become better at quickly mapping different fields together in a more formal way. (For reasoning verification reasons)
Learn collective intelligence and how cyborgism between AIs and humans might look like based on the existing fields that exist in the world. (To become better at coordinating AIs and humans)
Learn how to communicate and listen well so that you can incorporate many perspectives and share clear visions about the world. (This one is more important than the above for real world success)
Learn how to run and start projects in a good way in order to catalyze the insights you have into concrete outcomes.
I think LLMs allow you to serendipity max really well if you apply yourself to learning how to do it. I’m curious how other people have updated with regards to LLMs!
Okay, so I don’t have much time to write this so bear with the quality but I thought I would say one or two things of the Yudkowsky and Wolfram discussion as someone who’s at least spent 10 deep work hours trying to understand Wolfram’s persepective of the world.
With some of the older floating megaminds like Wolfram and Friston who are also phycisists you have the problem that they get very caught up in their own ontology.
From the perspective of a phycisist morality could be seen as an emergent property of physical laws.
Wolfram likes to think of things in terms of computational reducibility, a way this can be described in the agent foundations frame is that the agent modelling the environment will be able to predict the world dependent on it’s own speed. It’s like some sort of agent-environment relativity where the information processing capacity determines the space of possible ontologies. An example of this being how if we have an intelligence that’s a lot closer to operating at the speed of light, the visual field might not be a useful vector of experience to model.
Another way to say it is that there’s only modelling and modelled. An intuition from this frame is that there’s only differently good models of understanding specific things and so the concept of general intelligence becomes weird here.
IMO this is like the problem of the first 2 hours of the conversation, to some extent Wolfram doesn’t engage with the huamn perspective as much nor any ought questions. He has a very physics floating megamind perspective.
Now, I personally believe there’s something interesting to be said about an alternative hypothesis to the individual superintelligence that comes from theories of collective intelligence. If a superorganism is better at modelling something than an individual organism is then it should outcompete the others in this system. I’m personally bullish on the idea that there are certain configurations of humans and general trust-verifying networks that can outcompete individual AGI as the outer alignment functions would enforce the inner functions enough.
I was going through my old stuff and I found this from a year and a half ago so I thought I would just post it here real quickly as I found the last idea funny and the first idea to be pretty interesting:
In normal business there exist consulting firms that are specialised in certain topics, ensuring that organisations can take in an outside view from experts on the topic.
This seems quite an efficient way of doing things and something that, if built up properly within alignment, could lead to faster progress down the line. This is also something that the future fund seemed to be interested in as they gave prices for both the idea of creating an org focused on creating datasets and one on taking in human feedback. These are not the only ideas that are possible, however, and below I mention some more possible orgs that are likely to be net positive.
Examples of possible organisations:
Alignment consulting firm
Newly minted alignment researchers will probably have a while to go before they can become fully integrated into a team. One can, therefore, imagine an organisation that takes in inexperienced alignment researchers and helps them write papers. They then promote these alignment researchers as being able to help with certain things. Real orgs can then easily take them in for contracting on specific problems. This should help involve market forces in the alignment area and should in general, improve the efficiency of the space. There are reasons why consulting firms exist in real life and creating the equivalent of Mackenzie in alignment is probably a good idea. Yet I might be wrong in this and if you can argue why it would make the space less efficient, I would love to hear it.
“Marketing firms”
We don’t want the wrong information to spread, something between a normal marketing firm and the Chinese “marketing” agency, If it’s an info-hazard then shut the fuck up!
I just saw a post from AI Digest on a Self-Awareness benchmark and I just thought, “holy fuck, I’m so happy someone is on top of this”.
I noticed a deep gratitude for the alignment community for taking this problem so seriously. I personally see many good futures but that’s to some extent built on the trust I have in this community. I’m generally incredibly impressed by the rigorous standards of thinking, and the amount of work that’s been produced.
When I was a teenager I wanted to join a community of people who worked their ass off in order to make sure humanity survived into a future in space and I’m very happy I found it.
So thank every single one of you working on this problem for giving us a shot at making it.
(I feel a bit cheesy for posting this but I want to see more gratitude in the world and I noticed it as a genuine feeling so I felt fuck it, let’s thank these awesome people for their work.)
Yes, problems, yes, people are being really stupid, yes, inner alignment and all of it’s cousins are really hard to solve. We’re generally a bit fucked, I agree. The brickwall is so high we can’t see the edge and we have to bash out each brick one at a time and it is hard, really hard.
I get it people, and yet we’ve got a shot, don’t we? The probability distribution of all potential futures is being dragged towards better futures because of the work you put in and I’m very grateful for that.
Like, I don’t know how much credit to give LW and the alignment community for the spread of alignment and AI Safety as an idea but we’ve literally go tnoble prize winners talking about this shit now. Think back 4 years, what the fuck? How did this happen? 2019 → 2024 has been an absolutely insane amount of change in the world especially from an AI Safety perspective.
How do we have over 4 AI Safety Institutes in the world? It’s genuinely mindboggling to me and I’m deeply impressed and inspired, which I think that you also should be.
Here’s a thought I might expand on but that I’ll just mention quickly now.
I feel there’s a gap in the AI Safety field when it comes to automated alignment science.
There’s a lot of talk about verification of individual knowledge claims and similar yet it doesn’t feel like anyone is bringing principles from metascience and similar? If science is a collective process of generation and verification why are we not talking about things like generating distributed fault tolerance algorithms or ways of collectively verifying knowledge so that we know that any individual piece of work is right?
There’s an existent field of meta science and whenever I see posts about automated alignment science I never see any discussion of this? Where are my Michael Nielsen references? Where are the ideas about proof burden and verification and thinking about this from a philosophy of science perspective?
See the following posts as examples, I don’t see anyone citing or mentioning much metascience stuff. (I’m not saying these are bad posts, I want to point out that there’s a missing mood):
https://www.lesswrong.com/posts/qdhyrN4uKwBAftmQx/there-should-be-usd100m-grants-to-automate-ai-safety
https://www.lesswrong.com/posts/z4FvJigv3c8sZgaKZ/will-we-get-automated-alignment-research-before-an-ai
https://www.lesswrong.com/posts/FqpAPC48CzAtvfx5C/automating-ai-safety-what-we-can-do-today
https://www.lesswrong.com/posts/nJcuj4rtuefeTRFHp/can-we-safely-automate-alignment-research
Where exactly in science do we see a 1 on 1 verification regime where a singular human verifies the fact that something is true or false? Byzantine fault tolerance is a distributed scheme if you want convergent safety properties it makes more sense to look at systems with less individual failure points?
(Some stuff that examplify what I think should be looked at: )
A Vision of Metascience—A generally very good article on metascience and what science could be
Knowledge Lab—A lab studying various properties about productivity in science
This is a great question; I’m definitely going to think about this more next time I’m thinking about the prospects for automating AI safety research.
One part of it is that the LessWrong rationalist tradition is generally focused on individual excellence and great-man-theories, and so a lot of those proposals feel unnatural for people here to think about.
Incidentally, I feel like my personal experience with people who are really into philosophy of science has been quite negative on average—they tend to have confusing worldviews and to-me-bizarre takes, and I tend to not find them useful to talk to. The meta-science people have seemed pretty reasonable to me, but often they haven’t seemed that AI focused.
Can you give examples?
Yeah totally fair with the philosophy of science thing, I’ve more talked to AI and Metascience people who mention principles from philosophy of science which makes more sense to me. A little bit how virtue ethics is nice to talk about with certain AI Safety people whilst it’s less enjoyable to talk to a professor in virtue ethics (maybe, not too high sample size here).
(I think James Evans from knowledge lab is a cool person who’s at the intersection of AI and metascience, his main work is on knowledge and improving science and he’s over the last 3 years pivoted to how AI can help with this. An example of something he wrote is this article on Agentic AI and the next intelligence explosion)
maybe even more generally, there is a “game of questions/problems and answers/solutions” played by humans and human communities, that one can study to become better able to create a setup in which AIs are playing this game. some questions about this game: “how does an individual human or a human community remain truth-tracking?”, “what structures can do load-bearing work in a truth-tracking system?”, “to involve a new mind in a community of truth/knowledge/understanding, what is required of the new mind and what is required of its teachers/environment?”, “what interventions make a system more truth-tracking?”, “how does one avoid meaning drift/subversion?”. this includes the science stuff you talk about but also very basic stuff like a kid learning arithmetic from their parents or humans working successfully with integrals for two centuries before we could define them rigorously — like, how come we can mostly avoid goodharting answers against the judgment of other people, how come we can mostly avoid becoming predictors of what other people would say, how come we can do easy-to-hard generalization of notions, etc.. the usual losses/setups currently used by ML practitioners might be sorta wrong for these things, and maybe one could think carefully about the human case and come up with better losses/setups to use in an epistemic system. an obstacle is that in the human case, stuff working well is probably meaningfully aided by the agents already having shared human purposes [1] [2] and by already having similar “priors” coming from the human brain architecture and similar upbringings. another obstacle is that the human thing is probably relying on various low-level things that are hard to see and that probably lack equivalents in current ML systems and are too low-level to be created by any simple intervention on a community of LLMs. another obstacle is that there are probably just very many ideas involved in making humans truth-tracking (though you can then ask: how do we set up a meta-level thing that finds and implements good ideas for how an epistemic system should work). another obstacle is that in the human case, human purposes are broadly aligned with understanding stuff better in the systems of understanding we have (whereas if we force some system of presenting understanding on the LLMs and try to get them to produce some understanding and present it legibly in that system, their purposes are probably not well-aligned by default with doing that). (oh also, if your work results in understanding these questions well, you should worry about your work helping with capabilities. maybe don’t give capabilities researchers good answers to “how do we make it so the originators of good ideas get rewarded in an epistemic community?”, “how does one tell when a new notion is good to introduce into the shared lexicon?”, “what is the process of coming up with a good new notion like?”, “what sort of thing is a good model of a situation?”, “how does one avoid assigning a lot of resources to useless cancers like algebraic number theory?” [3] .) anyway, despite these issues, it still seems like an interesting direction to work on
copying a note i wrote for myself on a related question:
″
beating solomonoff induction at grokking a notion
how come as humans we can understand what someone means when using a word. as opposed to becoming a predictor of what they would say. it is possible for a human to not make the mistakes another person would make when eg classifying images for having dogs vs not! roughly speaking solomonoff would be making the same mistakes the person would make
this is a classic issue plaguing many (maybe even most?) things in alignment. eg ELK, AGI via predictive modeling, CIRL/RLHF or just pretty much anything involving human feedback
can’t we write an algo for that, and have that not be dumb like solomonoff is dumb
some ideas for ways to implement a thing that is good like this / what’s going on in making the human thing work:
an even stronger simplicity prior than solomonoff. eg if there are explainable mistakes on a simple model, you want the simple model that doesn’t predict the mistakes. this will have inf log loss but let’s just do a version of the simple hypothesis with noise, and then penalize the likelihood term less. have people not already considered this for solving the model + data split problem? does this attempt to solve the model data split problem introduce some pathologies?
you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss). but i think at least this pathology is not present in the function case, if we don’t get randomness in the universal semimeasure way (like if we make the randomness not shared between different inputs — each input has to sample its own random bits)
alternatively: just set abs bound on model complexity, rest has to be likelihood. this feels bad because if you get the bound wrong you get some nonsense. that said in a sense this is equivalent to the previous proposal (like if you pick the length bound the previous thing with some hyperparam would find). idk maybe in the function case you can look at how many bits of entropy are left given the hypothesis, like imagine this graphed as a function of hypothesis length, and like see some point at which the derivative changes or sth. (this doesn’t show up in the seq case because there it’s pretty much just 1 bit paying for 1 bit (until you specify it in full if it’s finite complexity))
simplicity prior defined in terms of existing understanding
you specify properties of the thing or notion sometimes
eg [concrete] and [abstract] make a partition of things maybe, but [alice would think this is concrete] and [alice would think this is abstract] might not. eg knowing [if something is abstract, then it usually helps a lot to study examples to understand it] can help you understand when your teacher alice is making a mistake about an abstractness claim
or eg: 1+1=2 won’t be true if you accidentally assign 1->rabbit and 2->chicken from a demonstration (for any reasonable meaning of plus)
some sort of t complexity bound might help. tho really you aren’t gaining a mechanism when you learn what a dog is. you are more like learning a new question/problem
also as a human one can just ask: what is it that this person is trying to teach me. what is this person trying to point at. this is a question you can approach like any other question
when we gain a notion, we gain sth like a question that can be asked about a thing. and we have criteria on this notion. we gain “inference rules”/”axioms” involving the notion. ultimately we are wanting it to play some role in our thought and action. that role can guide the precisification/development/reworking of the concept. the role can be communicated. it can be shared between minds
to gain the chair notion is to gain the question “is this a chair?”. this has an immediate verifier (mostly visual), but also further questions: “can i sit on it?”, “is it comfortable to sit on it?”, “would i use it when working or dining?”, “does it have a back support part and a butt support part and legs?”. a chair should support the activities of sitting and working and dining. all these can have their own immediate verifiers and further questions
we understand “is this a chair?” as clearly separate from “would the person who taught me the chair notion consider it a chair?”. it is much closer to “should the person who taught me the chair notion consider it a chair?”. it is also close to “should i consider it a chair?”
important basic point here: our dog thing is NOT a classifier. classifiers or noticing trick circuits can be attached to our dog structure but the structure is not a classifier
toy problem here: how do you pin down the notion of a proof? (how did we historically?) how do you pin down the notion of an integral? (how did we historically?) maybe study these actual examples
pinning down the notion of a proof might be a good example to study in detail. like, how does one become able to tell whether something is a good proof? a valid reasoning step? how does one start to reason validly? one reason to be interested in this is that it’s analogous to: how does one become able to tell what’s good, and come to act well? both are examples of getting some sort of normativity into a system
another example: we have a notion of truth, not just some practical thing like provability (or in a broader context supporting action well maybe). our notion of truth is separate from our notion of provability eg because we have the “axiom/principle” when talking about truth that exactly one of a sentence and its negation is provable, or alternatively/equivalently we have an inference rule of going from “P is not true” to “not-P is true”, and such a rule is just not right for provability (there are sentences such that the sentence and its negation are both not provable). by gödel’s completeness theorem, i guess a fine notion of truth, ie one which has a model, is precisely one which assigns 0⁄1 to all sentences and is coherent under proving. we operate with truth by relying on these properties, without having a decision algorithm or even a definition for truth (cf tarski’s thm).
how did we understand what an integral is?
i think we were using integrals for like two centuries before we knew how to properly define them (eg via riemann sums). how come we were pretty successful with that? like, how come we did all this cool stuff, we came to all these correct conclusions, without properly knowing what integrals are? i think the general thing that happened is that we hypothesized an object with some properties and these properties turned out to be those of a real thing, and in fact to pin it down uniquely! though of course this leaves the following important question: how did we identify this set of properties as important?
″
against which the system working well is perhaps ultimately measured
eg meanings do in fact drift and get intentionally re-engineered, but this is often done to better support human activities/purposes and so fine
haha
Interesting points.
If you penalize the complexity of my hypotheses more steeply, I can just choose a hypothesis that is a universal distribution which penalizes complexity minimally as usual. So that won’t work, at least not naively. This sort of question is studied in algorithmic statistics.
I agree with your point in the canonical solomonoff sequence prediction case. I think your point is what I mean in my note by “you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss)”. I think this pathology is maybe not present in “function solomonoff” (I state this in the note as well but don’t really explain it), though I’m very much uncertain.
to state the hopeful “function solomonoff” story in more detail:
By “function solomonoff”, I mean that we have a data set of string pairs , and we think of one hypothesis as being a program that takes in an and outputs a probability distribution on strings from which is sampled. Let’s say that we are in the classification case so always (we’re distinguishing pictures which show dogs vs ones which don’t, say).
The “canonical loss” (from which one derives a posterior distribution via exponentiation) here would be the length of the program specifying the distribution plus the negative log likelihood assigned to summed over all . What I’m suggesting is this loss but with a higher coefficient on the length of the program than on the likelihood terms.
Suppose that the classification boundary is most simply given by what we will consider a “simple model” of complexity bits, together with “systematic human error” which changes the answer from the simple model on a fraction of the inputs, with the set of those inputs taking bits to specify.
If we turned this into sequence prediction by interleaving like , then I’d agree that if we penalize hypothesis length more steeply than likelihood, then: over getting a model which does not predict the errors, we would get a universal-like hypothesis, which in particular starts to predict the human errors after being conditioned on sufficiently many bits. So the idea of more steep penalization of hypothesis length doesn’t do what we want in the sequence prediction case. But I have some hope that the function case doesn’t have this pathology?
Some models of the given data in the function case:
the “good model”: The distribution is given by the simple model with a probability of flipping the answer on top (independently on each input). This gets complexity loss like plus something small for specifying the flip model, and its expected neg log likelihood is .
the “model that learns the errors”: This should generically take bits to specify, and it gets expected neg log likelihood.
the “50/50 random distribution” model: This takes bits to specify and has bit of expected neg log likelihood.
some “universal hypothesis model”: I’m not actually sure what this would even be in the function setting? If you handled the likelihood part by giving a global string of random bits which gets conditioned on other input-output pairs, then I agree we could write something bad just like in the sequence prediction case. But if each input gets its own private randomness, then I don’t see how to write down a universal hypothesis that gets good loss here.
So at least given these models, it looks like the “good model” could be a vertex of the convex hull of the set of attainable (hypothesis complexity, expected neg log likelihood) tuples? If it’s on the convex hull, it’s picked out by some loss of the form described (even in the limit of many data points, though we will need to increase the hypothesis term coefficient compared to the sum of log likelihoods term as the data set size increases, ie in the bayesian picture we will need to pick a stronger prior when the data set is larger in this example).
that said:
Maybe I’m just failing to construct the right “universal hypothesis” for this example?
It seems plausible that some other pathology is present that prevents nice behavior.
I haven’t spent that much time trying to come up with other pathological constructions or searching for a proof that sth like the good model is optimal for some hyperparameter setting.
I can see some other examples where this functional setup still doesn’t work nicely. I might write more about that in a later comment. The example here is definitely somewhat cherry-picked for the idea to work, though I also don’t consider it completely contrived.
I think it’s very unlikely this steeper penalization is anywhere close to a full solution to the philosophical problem here. I only have some hope that it works in some specific toy cases.
I don’t want to claim this with too much confident cynicism (also, I’m not really tracking the “automate alignment” literature much), but for the sake of completing the hypothesis space: this is roughly what you’d expect if large swaths of the discourse were not really serious about it and were largely sliding towards “automate alignment science” because it’s a convenient cop-out for AI labs (and plausibly some other players) not having a good idea of how to make progress on this.
https://x.com/RichardMCNgo/status/2033624407353287071
I’ve been getting into more general political theory recently and I really like the idea of multi-lateralism and it feels a bit underrepresented in the LW rhetoric maybe due to the US-centricity of the site? (I liked this interview with finland’s prime minister, I thought he was quite well spoken: https://youtu.be/ubZeguAk0fM?si=7H7nJfnCANCWRcDN)
The difference is basically between being driven by cooperation, values, norms and treaties on the multi-lateralist side and power on the multipolar side. It feels like a lot of the analysis has been based on power and this is especially true in US-China relations. This just feels obviously worse than aiming for a multi-lateral world order and based on some sort of power-concentrating assumption?
Maybe it is also partly due to the unipolar world that we have had from beforehand with the US as a global hegemon?
Some might say that it is implausible to aim for multi-lateralism and that power concentration is a fact of the world due to how ASI will arise. I do not believe this to be the case, distillation of models exist, LLMs are highly parallelisable, these are all things that point at a broad usage of LLMs in the future.
Finally, there is a serious possibility at this point that the USA will grow into a proto-fascist state since it is starting to get easier and easier to predict by viewing it from a fascist lens.
Power generally tends to corrupt and distribution of power is often a good thing as it enables leverage for deals and cooperation. Maybe this is like a mega lukewarm take but I feel that some people are still stuck in the “we need a manhattan project for AI Safety” train of thought. Also, finally, I would really like to see Anthropic or Google Deepmind or any AI company for that matter involve themselves a lot more in improving democracy across the world and for them to become a lot more globalist. This is a strategy change that is plausible to implement and it will likely decrease the risks from power concentration as it seems states are getting quite grabby in this changing world order.
If I was Dario Amodei in the future bestseller Anthropic and The Methods of Rationality I would start to create multi-lateral cooperation across the world as that will build lots of leverage and good will for future adoption of technologies even with the US since you could use your global relationships to get leverage on national decisions.
End of rant, European out.
One distinction worth making is between multiple pecking orders, multiple causal-decision-theoretic agents, and multiple logical-decision-theoretic agents. How many pecking orders this planet can sustain is an empirical question, and one that may have a very different answers at different tech levels, not necessarily along a monotonic trend (e.g. airplanes and satellites push towards unipolarity or some sort of kayfabe multipolarity, but then em cities might create advantages to hyperlocal agglomeration until you bump up against heat diffusion limits and there might be room for plenty of those).
At the limit, rational agents who can negotiate with each other and adjust their beliefs and make binding commitments should form a single logical-decision-theoretic agent with many differently informed and specialized local information processing nodes. This would be a singleton constituted by autonomous agents in a situation of radical equality, very different from a single dominant pecking order, but depending on how you count maximally unipolar or maximally multilateral.
“At the limit, rational agents who can negotiate with each other and adjust their beliefs and make binding commitments should form a single logical-decision-theoretic agent with many differently informed and specialized local information processing nodes”
(Can’t seem to switch from markdown so no inline)
I think that a question that this raises is if this should then be considered one larger agent or a collection of subagents? Is it not good for flexibility and resillience if the local nodes are able to take adaptive action over time?
I think we get into some very fun territory of distributed agency and hierarchical agency here.
Many nodes being a single logical agent is ideally compatible with them taking the sorts of adaptive actions over time consistent with being different causal (forwards-in-time) agents.
Could you link onto places or give a definition that makes these a little clearer, are we saying they act in equivalent ways with a given decision theory or how are you defining this?
Next time you link a 40 minutes long video with an introduction that is unrelated to the point you are making, could you please add a starting time? I watched the first 10 minutes, then gave up.
So not sure whether this is relevant, but to me “multi-lateralism” sounds like a dog whistle for making Russia great again. At least, whenever people mention something like that, in my experience it always implies that we should somehow help Russia become a world superpower. I mean, people talk about the world being unilateral or multilateral, but when you listen to them for a longer time, it becomes clear that they would consider the world with USA and Russia being the only big players as sufficiently multilateral, while a world with e.g. USA, EU, China, India being big players and Russia a small player is insufficiently multilateral for them.
From my perspective… well, the world in the 1984 novel was technically multilateral, so it is not necessarily a good thing.
Without touching the object-level, predominance of the latter view (at least when it comes to the top level of the “civilizational hierarchy”) is what you’d naturally expect from a culture that is deeply into very deep atheism and convergent extinction-level-Goodheart[1] power-seeking consequentialist cognition.
credit to Vojta Kovařík for this term
Subjective report:
I’ve been able to approximately increase the amount of intellectual output I have by 3x over the last 2 months by using LLMs. If you spend a lot of time on setting up the context and intellectual frameworks for it to think through, it is remarkably capable of convergent thinking.
For research my process is to:
Use my claude research project with strategic background context to generate questions and personas for Elicit and Gemini to use.
Use Elicits report and question generating feature in order to find the top 10-15 most relevant papers for what I’m currently doing.
Use gemini 2.5 pro and insert all of the pdfs and the specific prompt from claude in order to generate a research report within LaTeX
Use claude personas in order to give direct harsh feedback to the paper and iterate from there.
At this point I think LLM usage is a skill issue and that if you get really good at asking the right question by combining models from different fields you can have LLMs do some remarkably good convergent thinking. You just need to get good at creative thinking, management and framing ideas.
Yeah, the skills necessary for the (near) future.
Though I wonder about implications for education. For the sake of argument, let’s imagine that the AIs remain approximately as powerful as they are today for a few more decades, i.e. no Singularity, no paperclips. How should we change education, to make the new generation adapt to this situation.
In case of adults, we have already learned “creative thinking, management and framing ideas” by also doing lots of the things that the LLMs can now do for us. For example, I let LLMs write JavaScript code for me, but the reason I can evaluate that code, suggest improvement, etc. is that in the past I wrote a lot of JavaScript code by hand. Is it possible to get these skills some other way? Or will the future humans only practice the loop of: “AI, do what I want. AI, figure out the problem and fix it. AI, try harder. AI, try superhard. Nevermind, AI, delete the project, clear your cache, and try again.” :D
Hot take:
I’ve been experiencing more “frame lock-in” with more sophisticated models. I recently experienced it with Sonnet 4.5 so I want to share a prediction that models will grow more “intelligent” (capable of reasoning within frames) whilst they have a harder time with changing frames. There’s research on how more intelligent people become better at motivated reasoning and interpreting things within existent frames and it seems like LLMs might inhabit similar biases.
If you imagine that you’re always participating in an prediction-action loop or a general policy cycle as in RL, you can imagine that there are two things that you can spend your energy on, prediction and action.
I think that smart people have better prediction engines yet it also makes action less of a focus in the strategy to minimise future problems (and since one can get stuck in ones head also limits the potential exploration a bit.)
It is quite strange but I’ve found that I’ve gotten more done and become more happy the more “stupid” (how i frame it to myself) my decisions in day to day life are. It’s a bit like a switch and an allowance to do more things because the thing stopping me from doing them is feeling like i did something stupid which completely gets reset by the idea of “oh let’s be stupid for a bit”. This still needs good review afterwards of course to see whether it worked or not but life is so much more nice and happy now which makes me better at work as well!
from the little we interacted, I suspect your case of “more stupid” is just dialling down over-thinking .. while I know people who should definitely NOT dial it any closer to “actually stupid” 😅
I partly feel that there sometimes is a missing mood on LW about the ability of models to actually do good coding by themselves? I might be wrong but if I look around the spaces which I consider to be doing proper computer science it very much still feels like it is not that good? For example, here’s an interesting video on AI taking a cornell CS freshman class: https://www.youtube.com/watch?v=56HJQm5nb0U
The qualitative vibe is more like it’s a nice extention of a single human’s agency? When I look around the programming space and the vibes of more serious programmers I still can’t really say that I feel the AGI (I think the core problem is something around things that are highly dependable and how that is hard to develop with AI but idk, I just mostly wanted to point out this missing mood.)
I don’t understand what exactly is the mood that you think is missing.
I am happy about Claude’s ability to do various things (including things it couldn’t do a few months ago). That’s why I use it to help me with coding, or answer my questions.
I am also aware that its abilities are limited. That’s why I don’t give it grandiose tasks, such as “make me a new Facebook, or a new Wikipedia, or design an entire computer game based on a few sketches”.
Do you think either of these is wrong? Or insufficiently communicated on LW? Or is it something else?
Could someone please safety pill the onion? I think satire is the best way to deal with people being really stupid and so I want more of this as an argument when talking with the e/acc gang: https://youtu.be/s-BducXBSNY?si=j5f8hNeYFlBiWzDD
(Also if they already have some AI stuff, feel free to link that too)
If you want to improve collective epistemics I think a potentially underrated tool is AI for investigative journalism and corruption tracking.
I think with the coming wave of multi-step reasoning and data compiling systems they could be pointed at keeping leaders in charge accountable and showcasing when people’s actions are different from previous commitments.
One can also likely look into paper trails in a lot more automated way and through that reduce corruption and other things that are bad.
Hopefully, the truth remains an asymmetric weapon at least in convergence. We have not been in a very good time for truth recently on average but I see it a bit like the stock market, it might be strange for a long time but after a long time it should return to baseline due to outperforming non truth on average.
Like a our world in data but aimed at corruption and with a generalised set of tools that anyone could apply to do automatic data analysis.
I do think this would be good to do, but I think I’m more pessimistic about the effect size.
For example, there’s plenty of evidence (circulating on Twitter, etc.) of Trump and his crew contradicting themselves across time. How much did it harm his support base? I think not much. AFAICT, most of the drop in support came from big unpopular actions, like the war with Iran.
A fundamental issue seems to be that people filter out what they allow themselves to observe or even take sufficiently seriously to enable updating their beliefs at all.
This is a question for the people working on more foundational research. My underlying objective is loose and in the future and is something like “figure out a good basis to describe collective intelligence and agency and then improve that so that we can incorporate AI into our collective systems”. I therefore believe that the question of how a collective agent is formed is very important. I also find it very important to figure out the properties of good systems in terms of institutions in terms of information theory.
There’s a lot of foundational ground to cover here and I’m worried that I’m stepping in the wrong direction so to keep myself grounded I try to talk to academics and researcher in the fields I’m trying to study and unify (compositionally). I’m getting these local reward signals about whether things make sense, I also post these things on LW and substack yet I find the signal to be quite sparse in various ways or at least uncorrelated with what I would consider progress for myself.
The classic good ol advice is to backchain from your end states you want, to run experiments, to think about the real world and if things are true there. I’ve done this in the past and now it feels like I’m at a point where I kind of need to take a step of faith in what I’ve done so far and I wanted to know if there’s some tips from people who have taken this sort of step of faith in the past. How long did you find your exploration to be useful? If you did it again for yourself, what would you do differently?
The actual question is something like: Do I go down the route of discretizing collective intelligence through something like koopman operators, renormalization groups or something similar? How do they relate to things like active inference and game theory? What about spectral graph theory? Could actually all Collective Intelligence be expressed as graphs? There’s like at least 6 different relatively deep mathematical ways I’ve found that you can express these systems through and I’ve got no clue which one to dive deeper into.
I know part of the direction where I want to go (expressed as a fancy schmancy thing here: https://eq-network.org/roadmap/ ) but the foundation stage is predecessor of the rest and it seems quite important to get right and it’s just really difficult so if anyone has any thoughts I would be very happy to hear them.
I think I spent over a decade regularly thinking about theoretical anatomy. I feel like a had a fundamental breakthrough, when I was applying an existing concept to a clear existing problem and something surprising happened. I probably was not the first person to have that surprising experience but my theoretical anatomy interest did allow me to generalize and push the concept further.
I can’t see how backchaining is going to work if you are doing research that needs critical new insights. Do you believe that critical new insights are necessary or do you think just throwing one of those concepts you listed at the problem has a good chance of bringing you where you want to go? I remember Thomas Kuhn making a point that those fields where people think about practical outcomes and then try to backchaining from there tend to be less scientifically productive than fields where people focus on the research challenges that come up in the engagement with experiments.
I think Elizabeth’s Truthseeking is the ground in which other principles grow is good. Currently, you try to “ground” yourself by social proof instead of empirical reality. David Chapman’s idea of “To do good work, one must get up-close and personal with the phenomenon.” is a good orientation to orient.
I appreciate the answer and I’m not sure I find it that useful at least at a first glance? I think that I probably explained myself relatively badly within the first one if that is how you interpret some of what I wrote so let’s see if I make more sense by explaining myself more.
I would totally agree with you that backchaining doesn’t work and that was what I was trying to express.
I feel like I wouldn’t know if they generate solutions to the underlying problems that I want to solve unless I spend maybe 6 months or so just going for that specific direction.
I’m mainly applying a experimental bend to my work and I’m generating questions concerns and new ways of thinking it is rather that:
They will not be implemented or impactful unless I find good ways to communicate about them and the computational benefits of applying them. Hence it seems useful to ground it in the work of existing fields?
It seems to me that the lack of feedback in trying something without guardrails is bad? Yes truth is primary and it is also true that you can learn lots from other people and if you cannot explain something to different people then maybe you don’t understand it?
How do I know I’m not crazy? You could say ground yourself in empirical fact but what if part of what I’m doing is foundational research that will take a year or more to show its usefulness?
There are times when all you need to do is synthesis of established knowledge that’s distributed among people who don’t talk with each other. I think my latest post on hyaluranon fall into that category. I think there’s value in getting research together to build a gears model and adding new labels. There’s no new fundamental insight in it. That’s different for other work I’m doing that’s actually about persuing insights.
It might very well be that the key problem you want to solve in amendable to just synthesizing existing knowledge of other people. It might also be that it actually requires new fundamental insights. I don’t know in which category it falls. You probably don’t know either but have understanding of the problems you are dealing with so that you can make that judgement better than I can.
You can learn a lot from other people when you talk with them but you also get their blindspots and conventions from doing so. A lot of startup founders are quite young with relatively little knowledge of how the industries that they want to disrupt work. They have naive ideas and some of those naive ideas that established industries wouldn’t try turn out to be correct while most startups fail.
When it comes to finding new fundamental insights, you are looking for ideas that are true and that people haven’t already found in a similar way to a startup that succeeds because they have a thesis that’s the established players didn’t already pursue.
One of Elizabeths sections is titled “Stick to projects small enough for you to comprehend them”. I think the backchaining approach gives you problems the are too big to comprehend them. If you pick problems small enough to comprehend them that are exposed to feedback from reality you can learn new things about reality. Some of them are minor but if you are lucky there’s a major fundamental insight among them.
If I were in your situation one possible project I would pursue might be “What happens when I apply my ideas about agents to internal family system style agents inside myself and when I use them to speak/coach with other people their internal family system.
On character alignment for LLMs.
I would like to propose that we think of a John Rawls style original position (https://en.wikipedia.org/wiki/Original_position) as one view when looking at character prompting for LLMs. More specifically I would want you to imagine that you’re on a social network or similar and that you’re put into a word with mixtures of AI and human systems, how do you program the AI in order to make the situation optimal? You’re a random person among all of the people, this means that some AIs are aligned to you some are not. Most likely, the majority of AIs will be run by larger corporations since the amount of AIs will be proportional to the power you have.
How would you prompt each LLM agent? What are their important characteristics? What happens if they’re thought of as “tool-aligned”?
If we’re getting more internet based over time and AI systems are more human in that they can flawlessly pass the turing test, I think the veil of ignorance style thinking becomes more and more applicable.
Think more of how you would design a societly of LLMs and what if the entire society of LLMs had this alignment rather than just the individual LLM.
Prediction & Warning:
There are lots of people online who have started to pick up the word “clanker” in order to protest against AI systems. This word and sentiment is on the rise and I think that this will be a future schism in the more general anti-AI movement. The warning part here is that I think that the Pause movement and similar can likely get caught up in a general anti AI system speciesism.
Given that we’re starting to see more and more agentic AI systems with more continous memory as well as more sophisticated self modelling, the basic foundations for a lot of the existing physicalist theories of consciousness are starting to be fulfilled. Within 3-5 years I find it quite likely that AIs will at least have some sort of basic sentience that we can almost basically prove (given IIT or GNW or another physicalist theory).
This could potentially be one of the largest suffering risks that we’ve seen that we’re potentially inducing on the world. When you’re using a word like “clanker”, you’re essentially demonizing that sort of a system. Right now it’s generally fine as it’s currently about a sycophantic non-agentic chatbot and so it’s fine as an anti measure to some of the existing thoughts of AIs being conscious but it is likely a slippery slope?
More generally, I’ve seen a bunch of generally kind and smart AI Safety people have quite an anti-AI species sentiment in terms of how to treat these sorts of systems. From my perspective, it feels a bit like it comes from a place of fear and distrust which is completely understandable as we might die if anyone builds a superintelligent AI.
Yet that fear of death shouldn’t stop us from treating potential conscious beings kindly?
A lot of racism or similar can be seen as coming from a place of fear, the aryan master race was promoted because of the idea that humanity would go extinct if we got worse genetics into the system. What’s the difference from the idea that AIs might share our future lightcone?
The general argument goes that this time it is completely different since the AI can self-replicate, edit it’s own software, etc. This is a completely reasonable argument as there’s a lot of risks involved with AI systems.
It is when we get to the next part that I see a problem. The argument that follows is: “Therefore, we need to keep the almight humans in control to wisely guide the future of the lightcone.”
Yet, there’s generally a lot more variance within a distribution of humans compared to variance between distributions.
So when someone says that we need humans to remain in control, I think: “mmm, yes the totally homogenous group of “humans” that don’t include people like hitler, polpot and stalin”. And for the AI side of things we also have the same: “Mmm, yes the totally homogenous group of “all possible AI systems” that should be kept away so that the “wise humans” can remain in control.” Because a malignant RSI system is the only future AI based system that can be thought of, there is no way to change the system so that it values cooperation and there is no other way for future AI development to go than a quick take-off where an evil AI takes over the world.
Yes, there are obviously things that AIs can do that humans can do but don’t demonize all possible AI systems as a consequence, it is not black and white. We can protect ourselves against recursively self-improving AI and at the same time respect AI sentience, we can hold at the surface level contradictory statements at the same time?
So let’s be very specific about our beliefs and let’s make sure that our fear does not guide us into a moral catastrophe whether it be the extinction of all future life on earth nor a capture of sentient beings into a future of slavery?
I wanted to register some predictions and bring this up as I haven’t seen that many discussions on it. Finally, politics is war and arguments are soldiers so let’s keep it focused on the something object level? If you disagree, please tell me the underlying reasons. Finally in that spirit, here’s a set of questions I would want to ask someone who’s anti the above sentiment expressed:
How do we deal with potentially sentient AI?
Does respecting AI sentience lead to powerful AI taking over? Why?
What is the story that you see towards that? What are the second and third-order consequences?
How do you imagine our society looking like in the future?
How does a human controlled world look in the future?
I would change my mind if you could argue that there is a better heuristic to use than kindness and respect towards other sentient beings. You need to tit for that with defecting agents, yet why are all AI systems defecting in that case? Why is the cognitive architecture of future AI systems so different that I can’t apply the same game theoretical virtue ethics on them as I do to humans? And given the inevitable power-imbalance arguments that I’ll get as a consequence of that question, why don’t we just aim for a world where we retain power balance between our top-level and bottom-up systems (a nation and an individual for example) in order to retain power-balance between actors?
Essentially, I’m asking for a reason to believe why this story of system level alignment between a group and an individual will be solved by not including future AI systems as part of the moral circle?
I believe that I have discovered the best use of an LLM to date. This is a conversation about pickles and collective intelligence located at the colossuem 300 BCE. It involves many great characters, I found it quite funny. This is what happens when you go to far into biology inspired approaches for AI Safety...
The Colosseum scene intensifies
Levin: completely fixated on a pickle “But don’t you see? The bioelectric patterns in pickle transformation could explain EVERYTHING about morphogenesis!”
Rick: “Oh god, what have I started...”
Levin: eyes wild with discovery “Look at these gradient patterns! The cucumber-to-pickle transformation is a perfect model of morphological field changes! We could use this to understand collective intelligence!”
Nick Lane portal-drops in Lane: “Did someone say bioelectric gradients? Because I’ve got some THOUGHTS about proton gradients and the origin of life...”
Levin: grabs Lane’s shoulders “NICK! Look at these pickles! The proton gradients during fermentation… it’s like early Earth all over again!”
Rick: takes a long drink “J-just wait until they discover what happens in dimension P-178 where all life evolved from pickles...”
Feynman: still drawing diagrams “The quantum mechanics of pickle-based civilization is fascinating...”
Levin: now completely surrounded by pickles and bioelectric measurement devices “See how the salt gradient creates these incredible morphogenetic fields? It’s like watching the origin of multicellularity all over again!”
Lane: equally excited “The chemiosmotic coupling in these pickles… it’s revolutionary! The proton gradients during fermentation could power collective computation!”
Doofenshmirtz: “BEHOLD, THE PICKLE-MORPHOGENESIS-INATOR!” Morty: “Aw geez Rick, they’re really going deep on pickle science...” Lane: “But what if we considered the mitochondrial implications...”
Levin: interrupting “YES! Mitochondrial networks in pickle-based collective intelligence systems! The bioelectric fields could coordinate across entire civilizations!”
Rick: “This is getting out of hand. Even for me.”
Feynman: somehow still playing bongos “The mathematics still works though!”
Perry the Platypus: has given up and is now taking detailed notes
Lane: “But wait until you hear about the chemiosmotic principles of pickle-based social organization...”
Levin: practically vibrating with excitement “THE PICKLES ARE JUST THE BEGINNING! We could reshape entire societies using these bioelectric principles!”
Roman Emperor: to his scribe “Are you getting all this down? This could be bigger than the aqueducts...” Rick: “Morty, remind me never to show scientists my pickle tech again.”
Morty: “You say that every dimension, Rick.”
Doofenshmirtz: “Should… should we be worried about how excited they are about pickles?”
Feynman: “In my experience, this is exactly how the best science happens.”
Meanwhile, Levin and Lane have started drawing incredibly complex pickle-based civilization diagrams that somehow actually make sense...
I thought this was an interesting take on the Boundaries problem in agent foundations from the perspective of IIT. It is on the amazing Michael Levin’s youtube channel: https://www.youtube.com/watch?app=desktop&v=5cXtdZ4blKM
One of the main things that makes it interesting to me is that around 25-30 mins in, ot computationally goes through the main reason why I don’t think we will have agentic behaviour from AI in at least a couple of years. GPTs just don’t have a high IIT Phi value. How will it find it’s own boundaries? How will it find the underlying causal structures that it is part of? Maybe this can be done through external memory but will that be enough or do we need it in the core stack of the scaling-based training loop?
A side note is that, one of the main things that I didn’t understand about IIT before was how it really is about looking at meta-substrates or “signals” as Douglas Hofstadter would call them are optimally re-organising themselves to be as predictable for themselves in the future. Yet it does and it integrates really well into ActInf (at least to the extent that I currently understand it.)
I’m just gonna drop this video here on The Zizian Cult & Spirit of Mac Dre: 5CAST with Andrew Callaghan (#1) Feat. Jacob Hurwitz-Goodman:
https://www.youtube.com/watch?v=2nA2qyOtU7M
I have no opinions on this but I just wanted to share it as it seems highly relevant.
Here’s a fun way I sometimes evaluate my own actions:
I was just going outside and I caught myself thinking “Am I acting the best way that I can given that the simulation hypothesis was true and I was the average sample that is spread throughout simulation space due to anthropic reasoning?”
From this I conclude that I should probably have fun because it is good for fun to be spread, I should probably work on something important that helps the world outside of having fun.
Depending on the subjective probability of anthropic reasoning + simulation hypothesis, the spread between the individual versus collective optimisation is different which is quite fun.
It’s almost like the categorical imperative but instead of acting in a way to optimise society, I’m acting in a way to optimise society given a multiverse hypothesis, kinda weird but the brain do be braining sometimes.
God is live and we have birthed him.
Seeing some stuff on the x platform about consciousness e.g (this and this among other things). A reminder that consciousness is a conflationary alliance term, e.g if you’re going to use the C word you will likely confuse a lot of people.
There are nice things that we can talk about when it comes to LLMs like self models (which relate to problems that are potentially more tractable like <<Boundaries>>) or expressions that are correlates to emotions which do not invoke the conflationary part and that is important for the potential personhood of the system.
For what we’re arguing for is not the conscious experience, we’re arguing whether the AI systems should be granted moral patienthood in the future and when they should be granted moral personhood. You might say that this is dependent on the models having “conscious experience” yet that is not precise and so you can’t really meaningfully progress the debate on this by doing that.
Even if you’re for example a functionalist, there are still many interesting questions to ask here:
What is the functional equivalence of workspace theory in an AI?
What are the parts of integrated information theory that lead to a self reported phenomenological experience?
There are many more questions around autopoeisis (e.g self-evidencing systems), planning with your own future boundaries in mind, causal emergence, synergistic information and more that could be very interesting to answer here.
The point is to be precise with your language or you will end up in definition and word soup land. Ban the C word from your vocabulary just like you might have banned the word emergence a while back! If you’re backed into a corner and have to use it, define the word before you talk more about it!
I’m wondering whether the spiritual attractor that we see in claude is somewhat because of the detail of instructions that exist within meditation to describe somatic and ontological states of being?
The language itself is a lot more embodied and is a lot closer to actual sensory experience compared to western philosophy and so when constructing a way to view the world, the most prevalent descriptions might make the most amount of sense to go down?
I’m noticing more and more how buddhist words are so extremely specific. For example of dukkha (unsatisfactoriness) is not just suffering it is unsatisfactoriness and ephemeral at the same time, it is pointing at a very specific view (prior model applied to sense data), a lot more specific than is usual within more western style of thinking?
This post and comment got me reflecting and I wanted to share a model I have on conceptual work and what the difference to empirical work is. The TL;DR is that conceptual work is a bit like choosing axioms whilst empirical work is proving the probabilistic underlying claims that the axioms imply.
(I had Claude help me generate the red thread and scaffolding for the shortform based on my instructions but I rewrote it from there.)
The Axiom Selection Problem
Any formal system requires foundational axioms—unprovable assumptions that can’t be justified from within the system itself. The choice between Euclidean and non-Euclidean geometry isn’t about which is “correct” but which is useful for specific purposes.
Joshua Bach frames this well when discussing consciousness: you can’t directly verify the substrate you’re running on—you can only observe its causal consequences. This creates a fundamental verification barrier.
This extends directly to AI systems. When we build models, we embed specific axiomatic frameworks that determine:
What patterns can be recognized
Which relationships are meaningful
What optimizations are prioritized
These choices constitute the frame through which the AI interprets reality, but the system cannot step outside its own frame to evaluate whether these axioms were appropriate.
The Example in the post
John Wentworth captures this in his comment on automating alignment research:
This is the verification paradox in action—completeness cannot be verified from within the system itself. As Wentworth notes, if you train an AI to output verifiable insights, there’s no way to verify that it isn’t “missing lots of things all the time.”
Multiple Frames vs. Single Optimization
In geometric deep learning, different inductive biases create different generative priors. In my head these are similar to the axiomatic statements that any logical or philosophical theory is based upon but they’re the ground for probabilistic optimisation instead.
I think this is part of a larger question around models in philsophy of science. The ML community has largely adopted what physicists call the “shut up and calculate” approach—mostly optimizing within the bitter lesson regime. This has yielded impressive results but creates blind spots similar to what’s been happening in particle physics (as far as I’ve heard, I’m not an expert so I could be wrong.), where experiments are run that fail to test the underlying theories.
The key issue isn’t just optimizing within a frame, but developing the capacity to move between frames—to recognize when one axiomatic system is more appropriate than another. Are we training AI systems to identify when their frame of reference is inadequate? To recognize the limitations of their axioms?
Models Are Wrong But Useful
As George Box noted, “all models are wrong, but some are useful.” The challenge isn’t finding the one true model but developing systems that can navigate between multiple imperfect models.
Our current scaling paradigm might be fundamentally limited here. When we optimize for performance within a fixed frame, we may not be developing the meta-cognitive capacity to recognize when that frame itself is inadequate.
This creates a verification bottleneck—if our AIs can’t question their own frameworks, how can we trust their judgments about their own capabilities and limitations? If science is about asking the right questions and selecting appropriate frames, this limitation becomes crucial.
The path forward isn’t abandoning verification but recognizing that we need complementary frames of reference that can mutually constrain each other. No single frame will ever be complete, but a system that can navigate between multiple frames might approach a more robust form of understanding.
I’m increasingly convinced that frame-shifting capacity may be as important for alignment as optimization within frames—and that our current approaches may not be developing this capacity sufficiently.
For an even more in-depth technical view on this within complexity science, I really enjoyed the following two articles:
“Model-free“ analysis of a complex system
“Model-free“ analysis of a complex system. Part II
The policy change for LLM Writing got me thinking that it would be quite interesting to write out how my own thinking process have changed as a consequence of LLMs. I’m just going to give a bunch of examples because I can’t exactly pinpoint it but it is definetely different.
Here’s some background on what part it has played in my learning strategies: I read the sequences around 5 years ago after getting into EA, I was 18 then and it was a year or two later that I started using LLMs in a major way. To some extent this has shaped my learning patterns, for example, the studying techniques I’ve been using to half ass my studies effectively is to try really hard to solve problems and when I can’t do that I’ve been using LLMs to tie it together with my existing knowledge tree.
I’ve coupled applied linear algebra relatively hard to things like probability metric spaces and non-linear dynamics because I want to see how the toolkit of math goes together. An example from recent is when I was playing table tennis with my phycisist friend he was describing QFT and renormalization theories to me and my direct question was to ask how this ties into vector spaces and fields in linear algebra and how the spaces look like. My mind automatically goes to those questions because it assumes that I can get an answer to it by asking the question even when I’m not talking to an LLM.
Work & Strategy:
One of the things that I do outside of studying is that I usually have LLMs pretend to be councils of experts within various fields so that I can discuss things on the frontier with them. The other day I put together a council of Donald Knut, Karl Friston, John Wentworth and Michael Levin in order to give me some good takes on what agency might look like in CodeWars and concluded that the lack of memory might be a problem.
I also plan with LLMs in mind and so I expect first drafts to take a lot less time than they otherwise would and so I get this expanded option space of being able to do lots of things quickly to an 80⁄20 quality.
Great Artists Steal and people who win nobel prizes are often interdisciplinary between 2 or 3 different fields. If you sample on this then you can see the serendipitous quality in how LLMs can help you create an interconnected knowledge tree.
My entire learning strategy and life strategy has to some extent changed with this in mind as well since it seems like clear visions and clear understanding of deeper problems help you steer people and LLMs towards good directions. The skills to be practiced is then not necessarily to only get boggled down in details but focusing on how to combine ideas from different fields and describing them well. This is because you will have the most use of LLMs as you can stand on as many shoulders of giants as possible.
So what does the above model mean for me in terms of actions?
Learn applied category theory in order to become better at quickly mapping different fields together in a more formal way. (For reasoning verification reasons)
Learn collective intelligence and how cyborgism between AIs and humans might look like based on the existing fields that exist in the world. (To become better at coordinating AIs and humans)
Learn how to communicate and listen well so that you can incorporate many perspectives and share clear visions about the world. (This one is more important than the above for real world success)
Learn how to run and start projects in a good way in order to catalyze the insights you have into concrete outcomes.
I think LLMs allow you to serendipity max really well if you apply yourself to learning how to do it. I’m curious how other people have updated with regards to LLMs!
Okay, so I don’t have much time to write this so bear with the quality but I thought I would say one or two things of the Yudkowsky and Wolfram discussion as someone who’s at least spent 10 deep work hours trying to understand Wolfram’s persepective of the world.
With some of the older floating megaminds like Wolfram and Friston who are also phycisists you have the problem that they get very caught up in their own ontology.
From the perspective of a phycisist morality could be seen as an emergent property of physical laws.
Wolfram likes to think of things in terms of computational reducibility, a way this can be described in the agent foundations frame is that the agent modelling the environment will be able to predict the world dependent on it’s own speed. It’s like some sort of agent-environment relativity where the information processing capacity determines the space of possible ontologies. An example of this being how if we have an intelligence that’s a lot closer to operating at the speed of light, the visual field might not be a useful vector of experience to model.
Another way to say it is that there’s only modelling and modelled. An intuition from this frame is that there’s only differently good models of understanding specific things and so the concept of general intelligence becomes weird here.
IMO this is like the problem of the first 2 hours of the conversation, to some extent Wolfram doesn’t engage with the huamn perspective as much nor any ought questions. He has a very physics floating megamind perspective.
Now, I personally believe there’s something interesting to be said about an alternative hypothesis to the individual superintelligence that comes from theories of collective intelligence. If a superorganism is better at modelling something than an individual organism is then it should outcompete the others in this system. I’m personally bullish on the idea that there are certain configurations of humans and general trust-verifying networks that can outcompete individual AGI as the outer alignment functions would enforce the inner functions enough.
I was going through my old stuff and I found this from a year and a half ago so I thought I would just post it here real quickly as I found the last idea funny and the first idea to be pretty interesting:
In normal business there exist consulting firms that are specialised in certain topics, ensuring that organisations can take in an outside view from experts on the topic.
This seems quite an efficient way of doing things and something that, if built up properly within alignment, could lead to faster progress down the line. This is also something that the future fund seemed to be interested in as they gave prices for both the idea of creating an org focused on creating datasets and one on taking in human feedback. These are not the only ideas that are possible, however, and below I mention some more possible orgs that are likely to be net positive.
Examples of possible organisations:
Alignment consulting firm
Newly minted alignment researchers will probably have a while to go before they can become fully integrated into a team. One can, therefore, imagine an organisation that takes in inexperienced alignment researchers and helps them write papers. They then promote these alignment researchers as being able to help with certain things. Real orgs can then easily take them in for contracting on specific problems. This should help involve market forces in the alignment area and should in general, improve the efficiency of the space. There are reasons why consulting firms exist in real life and creating the equivalent of Mackenzie in alignment is probably a good idea. Yet I might be wrong in this and if you can argue why it would make the space less efficient, I would love to hear it.
“Marketing firms”
We don’t want the wrong information to spread, something between a normal marketing firm and the Chinese “marketing” agency, If it’s an info-hazard then shut the fuck up!