I’ve just posted a repo with a bunch of my notes from 2023–2026, mostly on topics with relevance to AI alignment; see here for more meta information. An assortment of 101 items from the vault:
AI safety presentation slides (or really self-contained slideuments):
In a sense, it is extremely natural and obvious that any system handling sophisticated problems will be doing different things when handling different problems! But there is also a starting point from which this can be somewhat surprising: if you think of a neural net as a circuit (either just manifestly, or under some translation), then maybe you’d expect the same variables to be computed on each forward pass? It could be helpful here to consider how a Turing machine with a runtime bound can always be unrolled into a circuit that simulates [the contents of its tape and the position of its pointer] at all time steps.6 Whether tape cell 13 has a 0 or a 1 written on it at time step 42 is in one sense the same variable on any input, but in another sense it can easily represent very different variables of the program on different inputs.
Two remarks on how the current (March 2026) field trying to understand what AIs are doing relates to the issue of an AI doing different things on different inputs:
The currently prevailing view in interpretability allows for this to some extent: it is common to think of a big transformer language model as doing various different things depending on the context/input. But the prevailing view still takes there to ultimately be some pre-determined finite list of variables (I mean: corresponding to SAE features) that could be getting determined in a model, and I think this is probably a defect of that view, because a system solving an open-ended variety of complicated problems should be able to determine [what auxiliary problems to solve]/[what auxiliary questions to answer] on the fly.7 (I should note: maybe it is not clear that a forward pass of a transformer is sophisticated enough for this to be true of it?)
As one of the “finite list of variables” people[1]: This is because at the moment, I primarily want to find and understand the variables underlying the general mechanisms which AIs use to have many different kinds of productive thoughts in the first place. I am not particularly trying to find and understand variables defined only within the causal structure of these thoughts. I believe the former might indeed be described as a pre-determined finite list of variables. I agree the latter can’t be, at least not usefully.[2]
To use your analogy: I think of myself as trying to understand something like the basic makeup of a UTM, figuring out the tapes, heads, registers, tables and so on. I am not yet trying to say very much about the inner structures of the many different programs that could be run on that UTM.
I agree that some “finite list of variables” people seem to me to not distinguish between these different levels. I think that this is probably a mistake.
I am generally happy about people publishing their thoughts about AI alignment, even if they’re unpolished, so thanks for doing this! But as-is, this is so many links that I don’t know where to start, and my first instinct is to scroll past it and probably never read any of it. Are there a few notes that you recommend looking at, which you’re particularly proud of or think may be particularly useful?
how does proof length scale with statement length?
uncomputably fast in the average case:
Suppose there were a computable bound on how quickly average proof length of provable statements scales. Then we could build a halting oracle: Given a program P, check each string in order of length for whether it is a proof that P halts. If it doesn’t halt, eventually the bound tells you that you can stop looking. If it does halt, this is provable by exhibiting a computation trace.
oh right, because a computable bound on the average proof length would imply a computable worst case bound as well (since there are only statements of length ), good point! I guess two remaining directions here are:
can we say some more stuff about what the distribution of proof lengths is like?
is there an interesting scaling law for the statements a reasonable mathematical community (such as the human one) actually proves? (I think sth like this is what I’m most interested in here)
Maybe I’m reading this wrong, but how are you getting so many proofs with a length between 1 and 2 tokens? Is it `trivial` sort of stuff? Even then, though, I don’t understand how you get a fractional length.
Lets gooooooo! I’ve been saying the same thing since forever. I mean 8 months ago (see “Lead 1”).
I also believe the same idea is important for defining optimization. You can try a definition like this:
Definition 1. “Optimization is the ability to achieve more (F) with less (G). Where G is a simpler function/algorithm which allows to make cheap but important inferences about a more complex function/algorithm F—and computing G many times significantly simplifies computing F.”
(This is an attempt to formalize the intuition that any algorithm for solving a complicated problem has to have “the main subroutine” which does most of the work and reveals something non-trivial about the algorithm.)
Lookup tables and chaotic algorithms which are successful by accident[1] (like Rube Goldberg machines) are ruled out by this definition. Atoms in a rock are not an optimizer because they achieve less with more (all the complex quantum mechanic dynamics create a mere unmoving rock which is easily modeled by only a couple of variables). Ditto a bottle cap.
If you’re interested, here’s more thoughts about my definition/metric of optimization:
more thoughts
I think one of the big conceptual problems with my definition is that it doesn’t treat non-algorithmic knowledge as optimization/intelligence.
Consider an evaluation function updating on experience. Intuitively, this function becomes smarter with time. Yet my metric won’t detect the increase in knowledge as increase in optimization/intelligence. Because the algorithm didn’t become more complicated or more algorithmically efficient.
Another example is a human becoming wiser with experience (without their thinking becoming more complicated).
However, there are also heuristics successful at predicting reality due to luck (= lucking into a simple region of reality) or due to memorization (= using spurious correlations to compress a lookup table of reality). Some heuristics feel like “spurious patterns in the world”, other heuristics feel like “wisdom” or “deep knowledge”. What’s the difference?
Definition 2. One attempt (very fresh, not at all well thought-out) to define “deep” knowledge:
We have an algorithm A predicting X.
It’s good at predicting only simple/natural instances of X.
A is similar to A’ which becomes gradually better at predicting only simple/natural instances of X.[2]
The idea is that deep knowledge has to be about something natural and obtained in a natural way (= due to some bias which is generally useful only in natural situations).
you could hope to understand a neural net in terms of what problems it is solving / what questions it is answering. like, on the way to doing the thing it is supposed to be doing, it might be solving some subproblems (c.) understanding a NN in terms of what it is doing
I’m surprised you didn’t bring up something like the above here. You already defined abstractions as things helping to make inferences. Why not define algorithm-related abstractions the same way? (Maybe you’ve considered it but didn’t think it is useful.)
What the heck is it for a sentence to mean anything (for sentences which aim to talk about the actual real world, let’s say)? (c.) between verificationism and holism
What if we keep miking the “abstractions are things which make inferences cheap” idea and say something like...
Definition 3. “The meaning of a linguistic unit is the simplest thing (M) which can be used to make a large amount of cheap and correct inferences about everything the unit implies or is implied by.”[3]
Supposed consequences of this view:
You can have M (= meaning) in your head even if you didn’t make all those cheap inferences.
People with very different levels of language experience can have similar M.
M has no specific translation into any language.
M will tend to be compositional, because it’s one of the simplest ways to get many cheap inferences.
M both depends on the entirety of language and can be talked about separately.
P.S.
Ben Finegold reference? You’re deep in chess. I’m a chessplayer too. My meta is to play crappy openings which minimize “the variance of responses by an average opponent”. I also had a research project about the longest non-trivial middlegames (scroll down for interactive examples).
notes!
I’ve just posted a repo with a bunch of my notes from 2023–2026, mostly on topics with relevance to AI alignment; see here for more meta information. An assortment of 101 items from the vault:
AI safety presentation slides (or really self-contained slideuments):
Variants of the alignment problem (at the AFFINE Superintelligence Alignment Seminar)
Verification-based alignment schemes (at the AFFINE Superintelligence Alignment Seminar)
Inspection of thinking as a source of ideas for understanding minds (at the AFFINE Superintelligence Alignment Seminar)
Model-wise thinking (at the AFFINE Superintelligence Alignment Seminar)
Intro to AGI safety (at the University of Tartu)
Impact cases for guardrails/monitoring/verification (at Mila)
Why fear a math AI? (at Mila)
on understanding:
good concepts are concepts that support inferences
predicting specific things
model-thinking in mathematics
how categoricity (or universal properties) relate to the implementation of mental structures
gaining the integral notion
cheap inferences
turning activity-situations into abstract games
a stylized genealogy of sonic concepts
condensation:
condensation coefficients from scale-freeness
condensation coefficients compatible with the loss being a positive combination of subset losses
linear condensation
limitations of condensation
condensation coding
valuing, ethics, metaethics:
mundane malignity
it is natural for a mind to remake the world
terminality causes unmooredness
seeing (meta)ethics clearer via mathematical logic
centering ethics on coming up with ways to act
there are many good and important value-flavored structures
the epistemics-ethics bridge
in one sense, human values are very complicated; in another sense, human values are very simple
argument against universal value extraction device
an illustration of the role of understanding-machinery in value development
on good reflection processes
trajectories of moral-reflective flight — an alternative to reflective equilibrium
some thoughts on consequentialist ethics in large worlds and an ethical problem relating to existence
examples of structure-making and why copy-making?
on pursuing ideals
ideal induction math:
polytime solomonoff
solomonoff function induction v1, solomonoff function induction v2
solomonoff axiom induction
circuit occam learning
for circuit induction, regret and complexity are polynomially related
do bayes-NNs have the kolmogorov prior? (answer: no)
a silly NNbayes generalization result
on ideal inductions and attempts to use them to handle the AGI problem in principle:
could one do science with solomonoff on text?
a better attempt at getting science out of solomonoff
solomonoff function induction gives at least some const probability to bad behavior on ood inputs
solomonoff induction is weird
which input-output pairs does solomonoff induction need to see to generalize correctly?
beating solomonoff induction at grokking a notion
solomonoff induction is not the paragon of generalization
philosophy of language, philosophy of science, epistemology
between verificationism and holism
some disanalogies between solomonoff induction and science
some considerations with an eye toward specifying a picture of meaning
compositional inference warranting
the correct theory of semantics
formalizing philosophy
assigning probabilities in a conception of the world you know to be inadequate
interpretability:
understanding a NN in terms of what it is doing
merge machines
readability of independent predicates implies steering vector existence and a sort of linear decomposability
some superposition geometry math
a memorizing construction
hyperparams we’d want to know for a theory of interpretability that studying toy models could help determine
a remark on the value of philosophy for understanding AIs
metaphysics:
a toy setting for studying some technological phenomena
human math and alien math are pretty orthogonal
scaling laws satisfied by mathematical things
a world of purposes
standard physics implies infinite computation causally downstream of us along certain extremely low (and decaying) amplitude quantum branches
wherefore discreteness and emergence of structures
the anima mundi analogy
on human futures, facing AGI:
instead of building a very different artificial system which is smarter than us, let’s just become smarter (tentatively indefinitely)
what is weltgeistbehandlung?
protection via instrumentalization / instrumentalizing the good
if you’re trying to make a super-human non-human system, you should still have making humanity a top understander again in mind
we should grow immensely but with great care
a human future isn’t petty
uncategorized AI safety:
outside-view arguments for (and against) AI doom
on a lab preparing for RSI
RSI structure hyperparams
a statement of the alignment problem
an argument for faster takeoff
some good questions in conceptual alignment
cases of verification
shortlist of approaches to getting research papers given a hypercomputer
mind uploads with ML (and specifically prediction)
some ideas for safe self-improvement
safe self-improvement is much easier than the AGI problem we face
a list of alignment training tricks
intention-setting speech for AFFINE seminar
some hyperparam changes for the field of non-prosaic alignment
varia:
be careful about internal deference
estonian word order
suboptimalities in written language
a look at russell’s paradox
many diagonalizations under one roof
a cluster of probability problems in R^n and probability that a random subspace contains a positive vector
As one of the “finite list of variables” people[1]: This is because at the moment, I primarily want to find and understand the variables underlying the general mechanisms which AIs use to have many different kinds of productive thoughts in the first place. I am not particularly trying to find and understand variables defined only within the causal structure of these thoughts. I believe the former might indeed be described as a pre-determined finite list of variables. I agree the latter can’t be, at least not usefully.[2]
To use your analogy: I think of myself as trying to understand something like the basic makeup of a UTM, figuring out the tapes, heads, registers, tables and so on. I am not yet trying to say very much about the inner structures of the many different programs that could be run on that UTM.
I agree that some “finite list of variables” people seem to me to not distinguish between these different levels. I think that this is probably a mistake.
Loosely speaking.
With a finite context window and finite external memory there is technically a ceiling on how many different thoughts an AI is capable of having.
I am generally happy about people publishing their thoughts about AI alignment, even if they’re unpolished, so thanks for doing this! But as-is, this is so many links that I don’t know where to start, and my first instinct is to scroll past it and probably never read any of it. Are there a few notes that you recommend looking at, which you’re particularly proud of or think may be particularly useful?
I generally recommend looking first at the presentation slides and then under “uncategorized AI safety”. Specifically, I suggest Model-wise thinking.
uncomputably fast in the average case:
Suppose there were a computable bound on how quickly average proof length of provable statements scales. Then we could build a halting oracle: Given a program P, check each string in order of length for whether it is a proof that P halts. If it doesn’t halt, eventually the bound tells you that you can stop looking. If it does halt, this is provable by exhibiting a computation trace.
oh right, because a computable bound on the average proof length would imply a computable worst case bound as well (since there are only statements of length ), good point! I guess two remaining directions here are:
can we say some more stuff about what the distribution of proof lengths is like?
is there an interesting scaling law for the statements a reasonable mathematical community (such as the human one) actually proves? (I think sth like this is what I’m most interested in here)
Have some scatterplots for mathlib4!
Maybe I’m reading this wrong, but how are you getting so many proofs with a length between 1 and 2 tokens? Is it `trivial` sort of stuff? Even then, though, I don’t understand how you get a fractional length.
Here’s an example. You can just click/scroll around in the files to get an impression of how they look. https://github.com/leanprover-community/mathlib4/blob/master/Mathlib/Logic/Function/Basic.lean#L457
Each coordinate is increased by a random number between 0 and 1 in order to make the points more visible.
Was pleasantly surprised to see so much writing about non-prosaic alignment.
Lets gooooooo! I’ve been saying the same thing since forever. I mean 8 months ago (see “Lead 1”).
I also believe the same idea is important for defining optimization. You can try a definition like this:
Definition 1. “Optimization is the ability to achieve more (F) with less (G). Where G is a simpler function/algorithm which allows to make cheap but important inferences about a more complex function/algorithm F—and computing G many times significantly simplifies computing F.”
(This is an attempt to formalize the intuition that any algorithm for solving a complicated problem has to have “the main subroutine” which does most of the work and reveals something non-trivial about the algorithm.)
Lookup tables and chaotic algorithms which are successful by accident[1] (like Rube Goldberg machines) are ruled out by this definition. Atoms in a rock are not an optimizer because they achieve less with more (all the complex quantum mechanic dynamics create a mere unmoving rock which is easily modeled by only a couple of variables). Ditto a bottle cap.
If you’re interested, here’s more thoughts about my definition/metric of optimization:
more thoughts
I think one of the big conceptual problems with my definition is that it doesn’t treat non-algorithmic knowledge as optimization/intelligence.
Consider an evaluation function updating on experience. Intuitively, this function becomes smarter with time. Yet my metric won’t detect the increase in knowledge as increase in optimization/intelligence. Because the algorithm didn’t become more complicated or more algorithmically efficient.
Another example is a human becoming wiser with experience (without their thinking becoming more complicated).
However, there are also heuristics successful at predicting reality due to luck (= lucking into a simple region of reality) or due to memorization (= using spurious correlations to compress a lookup table of reality). Some heuristics feel like “spurious patterns in the world”, other heuristics feel like “wisdom” or “deep knowledge”. What’s the difference?
Definition 2. One attempt (very fresh, not at all well thought-out) to define “deep” knowledge:
We have an algorithm A predicting X.
It’s good at predicting only simple/natural instances of X.
A is similar to A’ which becomes gradually better at predicting only simple/natural instances of X.[2]
The idea is that deep knowledge has to be about something natural and obtained in a natural way (= due to some bias which is generally useful only in natural situations).
I’m surprised you didn’t bring up something like the above here. You already defined abstractions as things helping to make inferences. Why not define algorithm-related abstractions the same way? (Maybe you’ve considered it but didn’t think it is useful.)
What if we keep miking the “abstractions are things which make inferences cheap” idea and say something like...
Definition 3. “The meaning of a linguistic unit is the simplest thing (M) which can be used to make a large amount of cheap and correct inferences about everything the unit implies or is implied by.”[3]
Supposed consequences of this view:
You can have M (= meaning) in your head even if you didn’t make all those cheap inferences.
People with very different levels of language experience can have similar M.
M has no specific translation into any language.
M will tend to be compositional, because it’s one of the simplest ways to get many cheap inferences.
M both depends on the entirety of language and can be talked about separately.
P.S.
Ben Finegold reference? You’re deep in chess. I’m a chess player too. My meta is to play crappy openings which minimize “the variance of responses by an average opponent”. I also had a research project about the longest non-trivial middlegames (scroll down for interactive examples).
Is it even possible? A chaotic algorithm which outputs the correct answer most of the time “by accident”? Sounds like an oxymoron.
The definition will have to deal with algorithms pretending (sandbagging) to only work on simple/natural instances of X.
Warning: haven’t thought this through at all. It probably should be combined with Definition 2.
The definition will have to deal with algorithms pretending (sandbagging) to only work on simple/natural instances of X.