Jonas Hallgren comments on Jonas Hallgren’s Shortform

Jonas Hallgren 5 Apr 2026 10:10 UTC
32 points
8
Here’s a thought I might expand on but that I’ll just mention quickly now.

I feel there’s a gap in the AI Safety field when it comes to automated alignment science.
There’s a lot of talk about verification of individual knowledge claims and similar yet it doesn’t feel like anyone is bringing principles from metascience and similar? If science is a collective process of generation and verification why are we not talking about things like generating distributed fault tolerance algorithms or ways of collectively verifying knowledge so that we know that any individual piece of work is right?
There’s an existent field of meta science and whenever I see posts about automated alignment science I never see any discussion of this? Where are my Michael Nielsen references? Where are the ideas about proof burden and verification and thinking about this from a philosophy of science perspective?
See the following posts as examples, I don’t see anyone citing or mentioning much metascience stuff. (I’m not saying these are bad posts, I want to point out that there’s a missing mood):
Where exactly in science do we see a 1 on 1 verification regime where a singular human verifies the fact that something is true or false? Byzantine fault tolerance is a distributed scheme if you want convergent safety properties it makes more sense to look at systems with less individual failure points?
(Some stuff that examplify what I think should be looked at: )
- A Vision of Metascience—A generally very good article on metascience and what science could be
- Knowledge Lab—A lab studying various properties about productivity in science
- Buck 5 Apr 2026 15:51 UTC
  10 points
  2
  Parent
  This is a great question; I’m definitely going to think about this more next time I’m thinking about the prospects for automating AI safety research.
  One part of it is that the LessWrong rationalist tradition is generally focused on individual excellence and great-man-theories, and so a lot of those proposals feel unnatural for people here to think about.
  Incidentally, I feel like my personal experience with people who are really into philosophy of science has been quite negative on average—they tend to have confusing worldviews and to-me-bizarre takes, and I tend to not find them useful to talk to. The meta-science people have seemed pretty reasonable to me, but often they haven’t seemed that AI focused.
  - Mateusz Bagiński 6 Apr 2026 10:21 UTC
    3 points
    0
    Parent
    I feel like my personal experience with people who are really into philosophy of science has been quite negative on average—they tend to have confusing worldviews and to-me-bizarre takes, and I tend to not find them useful to talk to.
    Can you give examples?
  - Jonas Hallgren 6 Apr 2026 10:16 UTC
    2 points
    0
    Parent
    Yeah totally fair with the philosophy of science thing, I’ve more talked to AI and Metascience people who mention principles from philosophy of science which makes more sense to me. A little bit how virtue ethics is nice to talk about with certain AI Safety people whilst it’s less enjoyable to talk to a professor in virtue ethics (maybe, not too high sample size here).
    (I think James Evans from knowledge lab is a cool person who’s at the intersection of AI and metascience, his main work is on knowledge and improving science and he’s over the last 3 years pivoted to how AI can help with this. An example of something he wrote is this article on Agentic AI and the next intelligence explosion)
- Kaarel 5 Apr 2026 11:04 UTC
  8 points
  2
  Parent
  maybe even more generally, there is a “game of questions/problems and answers/solutions” played by humans and human communities, that one can study to become better able to create a setup in which AIs are playing this game. some questions about this game: “how does an individual human or a human community remain truth-tracking?”, “what structures can do load-bearing work in a truth-tracking system?”, “to involve a new mind in a community of truth/knowledge/understanding, what is required of the new mind and what is required of its teachers/environment?”, “what interventions make a system more truth-tracking?”, “how does one avoid meaning drift/subversion?”. this includes the science stuff you talk about but also very basic stuff like a kid learning arithmetic from their parents or humans working successfully with integrals for two centuries before we could define them rigorously — like, how come we can mostly avoid goodharting answers against the judgment of other people, how come we can mostly avoid becoming predictors of what other people would say, how come we can do easy-to-hard generalization of notions, etc.. the usual losses/setups currently used by ML practitioners might be sorta wrong for these things, and maybe one could think carefully about the human case and come up with better losses/setups to use in an epistemic system. an obstacle is that in the human case, stuff working well is probably meaningfully aided by the agents already having shared human purposes ^[1] ^[2] and by already having similar “priors” coming from the human brain architecture and similar upbringings. another obstacle is that the human thing is probably relying on various low-level things that are hard to see and that probably lack equivalents in current ML systems and are too low-level to be created by any simple intervention on a community of LLMs. another obstacle is that there are probably just very many ideas involved in making humans truth-tracking (though you can then ask: how do we set up a meta-level thing that finds and implements good ideas for how an epistemic system should work). another obstacle is that in the human case, human purposes are broadly aligned with understanding stuff better in the systems of understanding we have (whereas if we force some system of presenting understanding on the LLMs and try to get them to produce some understanding and present it legibly in that system, their purposes are probably not well-aligned by default with doing that). (oh also, if your work results in understanding these questions well, you should worry about your work helping with capabilities. maybe don’t give capabilities researchers good answers to “how do we make it so the originators of good ideas get rewarded in an epistemic community?”, “how does one tell when a new notion is good to introduce into the shared lexicon?”, “what is the process of coming up with a good new notion like?”, “what sort of thing is a good model of a situation?”, “how does one avoid assigning a lot of resources to useless cancers like algebraic number theory?” ^[3] .) anyway, despite these issues, it still seems like an interesting direction to work on
  
  copying a note i wrote for myself on a related question:
  
  ″
  
  beating solomonoff induction at grokking a notion
  - how come as humans we can understand what someone means when using a word. as opposed to becoming a predictor of what they would say. it is possible for a human to not make the mistakes another person would make when eg classifying images for having dogs vs not! roughly speaking solomonoff would be making the same mistakes the person would make
    this is a classic issue plaguing many (maybe even most?) things in alignment. eg ELK, AGI via predictive modeling, CIRL/RLHF or just pretty much anything involving human feedback
  - can’t we write an algo for that, and have that not be dumb like solomonoff is dumb
  - some ideas for ways to implement a thing that is good like this / what’s going on in making the human thing work:
    an even stronger simplicity prior than solomonoff. eg if there are explainable mistakes on a simple model, you want the simple model that doesn’t predict the mistakes. this will have inf log loss but let’s just do a version of the simple hypothesis with noise, and then penalize the likelihood term less. have people not already considered this for solving the model + data split problem? does this attempt to solve the model data split problem introduce some pathologies?
    you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss). but i think at least this pathology is not present in the function case, if we don’t get randomness in the universal semimeasure way (like if we make the randomness not shared between different inputs — each input has to sample its own random bits)
    alternatively: just set abs bound on model complexity, rest has to be likelihood. this feels bad because if you get the bound wrong you get some nonsense. that said in a sense this is equivalent to the previous proposal (like if you pick the length bound the previous thing with some hyperparam would find). idk maybe in the function case you can look at how many bits of entropy are left given the hypothesis, like imagine this graphed as a function of hypothesis length, and like see some point at which the derivative changes or sth. (this doesn’t show up in the seq case because there it’s pretty much just 1 bit paying for 1 bit (until you specify it in full if it’s finite complexity))
    
    simplicity prior defined in terms of existing understanding
    you specify properties of the thing or notion sometimes
    eg [concrete] and [abstract] make a partition of things maybe, but [alice would think this is concrete] and [alice would think this is abstract] might not. eg knowing [if something is abstract, then it usually helps a lot to study examples to understand it] can help you understand when your teacher alice is making a mistake about an abstractness claim
    or eg: 1+1=2 won’t be true if you accidentally assign 1->rabbit and 2->chicken from a demonstration (for any reasonable meaning of plus)
    
    some sort of t complexity bound might help. tho really you aren’t gaining a mechanism when you learn what a dog is. you are more like learning a new question/problem
    also as a human one can just ask: what is it that this person is trying to teach me. what is this person trying to point at. this is a question you can approach like any other question
    when we gain a notion, we gain sth like a question that can be asked about a thing. and we have criteria on this notion. we gain “inference rules”/”axioms” involving the notion. ultimately we are wanting it to play some role in our thought and action. that role can guide the precisification/development/reworking of the concept. the role can be communicated. it can be shared between minds
    to gain the chair notion is to gain the question “is this a chair?”. this has an immediate verifier (mostly visual), but also further questions: “can i sit on it?”, “is it comfortable to sit on it?”, “would i use it when working or dining?”, “does it have a back support part and a butt support part and legs?”. a chair should support the activities of sitting and working and dining. all these can have their own immediate verifiers and further questions
    we understand “is this a chair?” as clearly separate from “would the person who taught me the chair notion consider it a chair?”. it is much closer to “should the person who taught me the chair notion consider it a chair?”. it is also close to “should i consider it a chair?”
  important basic point here: our dog thing is NOT a classifier. classifiers or noticing trick circuits can be attached to our dog structure but the structure is not a classifier
  
  toy problem here: how do you pin down the notion of a proof? (how did we historically?) how do you pin down the notion of an integral? (how did we historically?) maybe study these actual examples
  
  pinning down the notion of a proof might be a good example to study in detail. like, how does one become able to tell whether something is a good proof? a valid reasoning step? how does one start to reason validly? one reason to be interested in this is that it’s analogous to: how does one become able to tell what’s good, and come to act well? both are examples of getting some sort of normativity into a system
  
  another example: we have a notion of truth, not just some practical thing like provability (or in a broader context supporting action well maybe). our notion of truth is separate from our notion of provability eg because we have the “axiom/principle” when talking about truth that exactly one of a sentence and its negation is provable, or alternatively/equivalently we have an inference rule of going from “P is not true” to “not-P is true”, and such a rule is just not right for provability (there are sentences such that the sentence and its negation are both not provable). by gödel’s completeness theorem, i guess a fine notion of truth, ie one which has a model, is precisely one which assigns ⁰⁄₁ to all sentences and is coherent under proving. we operate with truth by relying on these properties, without having a decision algorithm or even a definition for truth (cf tarski’s thm).
  
  how did we understand what an integral is?
  
  i think we were using integrals for like two centuries before we knew how to properly define them (eg via riemann sums). how come we were pretty successful with that? like, how come we did all this cool stuff, we came to all these correct conclusions, without properly knowing what integrals are? i think the general thing that happened is that we hypothesized an object with some properties and these properties turned out to be those of a real thing, and in fact to pin it down uniquely! though of course this leaves the following important question: how did we identify this set of properties as important?
  
  ″
  1. ↩︎
    against which the system working well is perhaps ultimately measured
  2. ↩︎
    eg meanings do in fact drift and get intentionally re-engineered, but this is often done to better support human activities/purposes and so fine
  3. ↩︎
    haha
  - Cole Wyeth 6 Apr 2026 0:15 UTC
    4 points
    0
    Parent
    Interesting points.
    If you penalize the complexity of my hypotheses more steeply, I can just choose a hypothesis that is a universal distribution which penalizes complexity minimally as usual. So that won’t work, at least not naively. This sort of question is studied in algorithmic statistics.
    - Kaarel 6 Apr 2026 8:25 UTC
      4 points
      0
      Parent
      I agree with your point in the canonical solomonoff sequence prediction case. I think your point is what I mean in my note by “you have pathology of not specifying even the hypothesis in the seq prediction case (like it’ll be better to drop bits and take the likelihood loss)”. I think this pathology is maybe not present in “function solomonoff” (I state this in the note as well but don’t really explain it), though I’m very much uncertain.
      
      to state the hopeful “function solomonoff” story in more detail:
      
      By “function solomonoff”, I mean that we have a data set of string pairs , and we think of one hypothesis as being a program that takes in an and outputs a probability distribution on strings from which is sampled. Let’s say that we are in the classification case so always (we’re distinguishing pictures which show dogs vs ones which don’t, say).
      The “canonical loss” (from which one derives a posterior distribution via exponentiation) here would be the length of the program specifying the distribution plus the negative log likelihood assigned to summed over all . What I’m suggesting is this loss but with a higher coefficient on the length of the program than on the likelihood terms.
      Suppose that the classification boundary is most simply given by what we will consider a “simple model” of complexity bits, together with “systematic human error” which changes the answer from the simple model on a fraction of the inputs, with the set of those inputs taking bits to specify.
      If we turned this into sequence prediction by interleaving like , then I’d agree that if we penalize hypothesis length more steeply than likelihood, then: over getting a model which does not predict the errors, we would get a universal-like hypothesis, which in particular starts to predict the human errors after being conditioned on sufficiently many bits. So the idea of more steep penalization of hypothesis length doesn’t do what we want in the sequence prediction case. But I have some hope that the function case doesn’t have this pathology?
      Some models of the given data in the function case:
      the “good model”: The distribution is given by the simple model with a probability of flipping the answer on top (independently on each input). This gets complexity loss like plus something small for specifying the flip model, and its expected neg log likelihood is .
      the “model that learns the errors”: This should generically take bits to specify, and it gets expected neg log likelihood.
      the “50/50 random distribution” model: This takes bits to specify and has bit of expected neg log likelihood.
      some “universal hypothesis model”: I’m not actually sure what this would even be in the function setting? If you handled the likelihood part by giving a global string of random bits which gets conditioned on other input-output pairs, then I agree we could write something bad just like in the sequence prediction case. But if each input gets its own private randomness, then I don’t see how to write down a universal hypothesis that gets good loss here.
      
      So at least given these models, it looks like the “good model” could be a vertex of the convex hull of the set of attainable (hypothesis complexity, expected neg log likelihood) tuples? If it’s on the convex hull, it’s picked out by some loss of the form described (even in the limit of many data points, though we will need to increase the hypothesis term coefficient compared to the sum of log likelihoods term as the data set size increases, ie in the bayesian picture we will need to pick a stronger prior when the data set is larger in this example).
      
      that said:
      
      Maybe I’m just failing to construct the right “universal hypothesis” for this example?
      It seems plausible that some other pathology is present that prevents nice behavior.
      I haven’t spent that much time trying to come up with other pathological constructions or searching for a proof that sth like the good model is optimal for some hyperparameter setting.
      
      I can see some other examples where this functional setup still doesn’t work nicely. I might write more about that in a later comment. The example here is definitely somewhat cherry-picked for the idea to work, though I also don’t consider it completely contrived.
      I think it’s very unlikely this steeper penalization is anywhere close to a full solution to the philosophical problem here. I only have some hope that it works in some specific toy cases.
- Mateusz Bagiński 6 Apr 2026 10:30 UTC
  5 points
  1
  Parent
  I don’t want to claim this with too much confident cynicism (also, I’m not really tracking the “automate alignment” literature much), but for the sake of completing the hypothesis space: this is roughly what you’d expect if large swaths of the discourse were not really serious about it and were largely sliding towards “automate alignment science” because it’s a convenient cop-out for AI labs (and plausibly some other players) not having a good idea of how to make progress on this.
  https://x.com/RichardMCNgo/status/2033624407353287071

Jonas Hallgren comments on Jonas Hallgren’s Shortform

beating solomonoff induction at grokking a notion

how did we understand what an integral is?