Kaarel comments on kh’s Shortform

Kaarel 17 Jun 2026 5:37 UTC
96 points
0
notes!

I’ve just posted a repo with a bunch of my notes from 2023–2026, mostly on topics with relevance to AI alignment; see here for more meta information. An assortment of 101 items from the vault:
- Lucius Bushnaq 17 Jun 2026 20:26 UTC
  11 points
  0
  Parent
  In a sense, it is extremely natural and obvious that any system handling sophisticated problems will be doing different things when handling different problems! But there is also a starting point from which this can be somewhat surprising: if you think of a neural net as a circuit (either just manifestly, or under some translation), then maybe you’d expect the same variables to be computed on each forward pass? It could be helpful here to consider how a Turing machine with a runtime bound can always be unrolled into a circuit that simulates [the contents of its tape and the position of its pointer] at all time steps.⁶ Whether tape cell 13 has a 0 or a 1 written on it at time step 42 is in one sense the same variable on any input, but in another sense it can easily represent very different variables of the program on different inputs.
  Two remarks on how the current (March 2026) field trying to understand what AIs are doing relates to the issue of an AI doing different things on different inputs:
  The currently prevailing view in interpretability allows for this to some extent: it is common to think of a big transformer language model as doing various different things depending on the context/input. But the prevailing view still takes there to ultimately be some pre-determined finite list of variables (I mean: corresponding to SAE features) that could be getting determined in a model, and I think this is probably a defect of that view, because a system solving an open-ended variety of complicated problems should be able to determine [what auxiliary problems to solve]/[what auxiliary questions to answer] on the fly.⁷ (I should note: maybe it is not clear that a forward pass of a transformer is sophisticated enough for this to be true of it?)
  As one of the “finite list of variables” people^[1]: This is because at the moment, I primarily want to find and understand the variables underlying the general mechanisms which AIs use to have many different kinds of productive thoughts in the first place. I am not particularly trying to find and understand variables defined only within the causal structure of these thoughts. I believe the former might indeed be described as a pre-determined finite list of variables. I agree the latter can’t be, at least not usefully.^[2]
  To use your analogy: I think of myself as trying to understand something like the basic makeup of a UTM, figuring out the tapes, heads, registers, tables and so on. I am not yet trying to say very much about the inner structures of the many different programs that could be run on that UTM.
  I agree that some “finite list of variables” people seem to me to not distinguish between these different levels. I think that this is probably a mistake.
  1. ^
    Loosely speaking.
  2. ^
    With a finite context window and finite external memory there is technically a ceiling on how many different thoughts an AI is capable of having.
- Caleb Biddulph 17 Jun 2026 6:48 UTC
  8 points
  2
  Parent
  I am generally happy about people publishing their thoughts about AI alignment, even if they’re unpolished, so thanks for doing this! But as-is, this is so many links that I don’t know where to start, and my first instinct is to scroll past it and probably never read any of it. Are there a few notes that you recommend looking at, which you’re particularly proud of or think may be particularly useful?
  - Kaarel 17 Jun 2026 6:59 UTC
    6 points
    0
    Parent
    I generally recommend looking first at the presentation slides and then under “uncategorized AI safety”. Specifically, I suggest Model-wise thinking.
- Gurkenglas 17 Jun 2026 17:28 UTC
  6 points
  4
  Parent
  scaling laws satisfied by mathematical things
  how does proof length scale with statement length?
  uncomputably fast in the average case:
  Suppose there were a computable bound on how quickly average proof length of provable statements scales. Then we could build a halting oracle: Given a program P, check each string in order of length for whether it is a proof that P halts. If it doesn’t halt, eventually the bound tells you that you can stop looking. If it does halt, this is provable by exhibiting a computation trace.
  - Kaarel 17 Jun 2026 18:18 UTC
    3 points
    0
    Parent
    oh right, because a computable bound on the average proof length would imply a computable worst case bound as well (since there are only statements of length ), good point! I guess two remaining directions here are:
    
    can we say some more stuff about what the distribution of proof lengths is like?
    is there an interesting scaling law for the statements a reasonable mathematical community (such as the human one) actually proves? (I think sth like this is what I’m most interested in here)
    - Gurkenglas 17 Jun 2026 19:34 UTC
      7 points
      0
      Parent
      Have some scatterplots for mathlib4!
      - Mateusz Bagiński 23 Jun 2026 12:00 UTC
        2 points
        0
        Parent
        Maybe I’m reading this wrong, but how are you getting so many proofs with a length between 1 and 2 tokens? Is it `trivial` sort of stuff? Even then, though, I don’t understand how you get a fractional length.
        Gurkenglas 23 Jun 2026 12:29 UTC
        2 points
        0
        Parent
        Here’s an example. You can just click/scroll around in the files to get an impression of how they look. https://github.com/leanprover-community/mathlib4/blob/master/Mathlib/Logic/Function/Basic.lean#L457
        Each coordinate is increased by a random number between 0 and 1 in order to make the points more visible.
- Q Home 18 Jun 2026 6:05 UTC
  3 points
  0
  Parent
  Was pleasantly surprised to see so much writing about non-prosaic alignment.
  I generally like the idea that good concepts are concepts that make a bunch of useful inferences easy. (c.) good concepts are concepts that support inferences
  Lets gooooooo! I’ve been saying the same thing since forever. I mean 8 months ago (see “Lead 1”).
  I also believe the same idea is important for defining optimization. You can try a definition like this:
  Definition 1. “Optimization is the ability to achieve more (F) with less (G). Where G is a simpler function/algorithm which allows to make cheap but important inferences about a more complex function/algorithm F—and computing G many times significantly simplifies computing F.”
  (This is an attempt to formalize the intuition that any algorithm for solving a complicated problem has to have “the main subroutine” which does most of the work and reveals something non-trivial about the algorithm.)
  Lookup tables and chaotic algorithms which are successful by accident^[1] (like Rube Goldberg machines) are ruled out by this definition. Atoms in a rock are not an optimizer because they achieve less with more (all the complex quantum mechanic dynamics create a mere unmoving rock which is easily modeled by only a couple of variables). Ditto a bottle cap.
  If you’re interested, here’s more thoughts about my definition/metric of optimization:
  more thoughts
  I think one of the big conceptual problems with my definition is that it doesn’t treat non-algorithmic knowledge as optimization/intelligence.
  Consider an evaluation function updating on experience. Intuitively, this function becomes smarter with time. Yet my metric won’t detect the increase in knowledge as increase in optimization/intelligence. Because the algorithm didn’t become more complicated or more algorithmically efficient.
  Another example is a human becoming wiser with experience (without their thinking becoming more complicated).
  However, there are also heuristics successful at predicting reality due to luck (= lucking into a simple region of reality) or due to memorization (= using spurious correlations to compress a lookup table of reality). Some heuristics feel like “spurious patterns in the world”, other heuristics feel like “wisdom” or “deep knowledge”. What’s the difference?
  Definition 2. One attempt (very fresh, not at all well thought-out) to define “deep” knowledge:
  We have an algorithm A predicting X.
  It’s good at predicting only simple/natural instances of X.
  A is similar to A’ which becomes gradually better at predicting only simple/natural instances of X.^[2]
  The idea is that deep knowledge has to be about something natural and obtained in a natural way (= due to some bias which is generally useful only in natural situations).
  you could hope to understand a neural net in terms of what problems it is solving / what questions it is answering. like, on the way to doing the thing it is supposed to be doing, it might be solving some subproblems (c.) understanding a NN in terms of what it is doing
  I’m surprised you didn’t bring up something like the above here. You already defined abstractions as things helping to make inferences. Why not define algorithm-related abstractions the same way? (Maybe you’ve considered it but didn’t think it is useful.)
  What the heck is it for a sentence to mean anything (for sentences which aim to talk about the actual real world, let’s say)? (c.) between verificationism and holism
  What if we keep miking the “abstractions are things which make inferences cheap” idea and say something like...
  Definition 3. “The meaning of a linguistic unit is the simplest thing (M) which can be used to make a large amount of cheap and correct inferences about everything the unit implies or is implied by.”^[3]
  Supposed consequences of this view:
  - You can have M (= meaning) in your head even if you didn’t make all those cheap inferences.
  - People with very different levels of language experience can have similar M.
  - M has no specific translation into any language.
  - M will tend to be compositional, because it’s one of the simplest ways to get many cheap inferences.
  - M both depends on the entirety of language and can be talked about separately.
  P.S.
  Ben Finegold reference? You’re deep in chess. I’m a chess player too. My meta is to play crappy openings which minimize “the variance of responses by an average opponent”. I also had a research project about the longest non-trivial middlegames (scroll down for interactive examples).
  1. ^
    Is it even possible? A chaotic algorithm which outputs the correct answer most of the time “by accident”? Sounds like an oxymoron.
  2. ^
    The definition will have to deal with algorithms pretending (sandbagging) to only work on simple/natural instances of X.
  3. ^
    Warning: haven’t thought this through at all. It probably should be combined with Definition 2.
  4. ^
    The definition will have to deal with algorithms pretending (sandbagging) to only work on simple/natural instances of X.

Kaarel comments on kh’s Shortform

notes!

P.S.