I’m a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/
Leon Lang
Uncharitable Summary
Most likely there’s something in the intuitions which got lost when transmitted to me via reading this text, but the mathematics itself seems pretty tautological to me (nevertheless I found it interesting since tautologies can have interesting structure! The proof itself was not trivial to me!).
Here is my uncharitable summary:
Assume you have a Markov chain M_0 → M_1 → M_2 → … → M_n → … of variables in the universe. Assume you know M_n and want to predict M_0. The Telephone theorem says two things:
You don’t need to keep all information about M_n to predict M_0 as well as possible. It’s enough to keep the conditional distribution P(M_0 | M_n).
Additionally, in the long run, these conditional distributions grow closer and closer together: P(M_0 | M_n) ~ P(M_0 | M_{n+1})
That’s it. The first statement is tautological, and the second states that you cannot keep losing information. At some point, your uncertainty in M_0 stabilizes.
Further Thoughts
I think actually, John wants to claim that the conditional probabilities can be replaced by something else which carries information at a distance and stabilizes over time. Something like:
Average measurements
Pressure/temperature of a gas
Volume of a box
Length of a message
…
These things could then serve as “sufficient statistics” that contain everything one needs for making predictions. I have no idea of how one would go about finding such conserved quantities in general systems. John also makes a related remark:
“(Side note: the previous work already suggested conditional probability distributions as the type-signature of abstractions, but that’s quite general, and therefore not very efficient to work with algorithmically. Estimates-of-deterministic-constraints are a much narrower subset of conditional probability distributions.)”
The math was made easier in the proof by assuming that information at a distance precisely stabilizes at some point. In reality, it may slowly converge without becoming constant. For this, no proof or precise formulation is yet in the text.
John claims: “I believe that most abstractions used by humans in practice are summaries of information relevant at a distance. The theorems in this post show that those summaries are estimates/distributions of deterministic (in the limit) constraints in the systems around us.”This confuses me. It seems like this claims that we can also use summaries of very closeby objects to make predictions at arbitrary distance. However, the mathematics doesn’t show this: it only considers varying the “sender” of information,notvarying the “receiver” (which in the case of the theorem is M_0!). If you want to make predictionsaboutarbitrarily far away different things in the universe, then it’s unclear whether you can throw any information of closeby things away. (But maybe I misunderstood the text here?)
A somewhat more random comment:
I disagree with the claim that the intuitions behind information diagrams fall apart at higher degrees: if you’re “just fine” with negative information, then you can intersect arbitrarily many circles and get additivity rules for information terms. I actually wrote a paper about this, including how one can do this for other information quantities like Kolmogorov complexity and Kullback-Leibler divergence. What’s problematic about this is not the mathematics of intersecting circles, but that we largely don’t have good real-world interpretations and use cases for it.
How, if at all, does your alignment approach deal with deceptive alignment?
Summary
I summarize this post in a slightly reverse order. In AI alignment, one core question is how to think about utility maximization. What are agents doing that maximize utility? How does embeddedness play into this? What can we prove about such agents? Which types of systems become maximizers of utility in the first place?
This article reformulates expected utility maximization in equivalent terms in the hopes that the new formulation makes answering such questions easier. Concretely, a utility function u is given, and the goal of a u-maximizer is to change the distribution M1 over world states X in such a way that E_M1[u(X)] is maximized. Now, assuming that the world is finite (an assumption John doesn’t mention but is discussed in the comments), one can find a>0, b such that a*u(X) + b = log P(X | M2) for some distribution/model of the world M2. Roughly, M2 assigns high probability to states X that have high utility u(X).
Then the equivalent goal of the u-maximizer becomes changing M1 such that E_M1[- log P(X | M2)] becomes minimal, which means minimizing H(X | M1) + D_KL(M1 | M2). The entropy H(X | M1) cannot be influenced in our world (due to thermodynamics) or can, by a mathematical trick, be assumed to be fixed, meaning that the problem reduces to just minimizing the KL-distance of distributions D_KL(M1 | M2). Another way of saying this is that we want to minimize the average number of bits required to describe the world state X when using the Shannon-Fano code of M2. A final tangential claim is that for powerful agents/optimization processes, the initial M1 with which the world starts shouldn’t matter so much for the achieved end result of this process.
John then speculates on how this reformulation might be useful, e.g. for selection theorems.
OpinionThis is definitely thought-provoking.
What I find interesting about this formulation is that it seems a bit like “inverse generative modeling”: usually in generative modeling in machine learning, we start out with a “true distribution” M1’ of the world and try to “match” a model distribution M2’ to it. This can then be done by maximizing average log P(X | M2’) for X that are samples from M1’, i.e. by performing maximum likelihood. So in some sense, a “utility” is maximized there as well.But in John’s post, the situation is reversed: the agent has a utility function corresponding to a distribution M2 that weights up desired world states, and the agent tries to match the real-world distribution M1 to that.
If an agent is now both engaging in generative modeling (to build its world model) and in utility maximization, then it seems like the agent could also collapse both objectives into one: start out with the “wrong” prediction by already assuming the desired world state M2 and then get closer to predicting correctly by changing the real world. Noob question: is this what the predictive processing people are talking about? I’m wondering this since when I heard people saying things like “all humans do is just predicting the world”, I never understood why humans wouldn’t then just sit in a dark room without anything going on, which is a highly predictable world-state. The answer might be that they start out predicting a desirable world, and their prediction algorithm is somehow weak and only manages to predict correctly by “just” changing the world. I’m not sure if I buy this.
One thing I didn’t fully understand in the post itself is why the entropy under M1 can always be assumed to be constant by a mathematical trick, though another comment explored this in more detail (a comment I didn’t read in full).
Minor: Two minus signs are missing in places, and I think the order of the distributions in the KL term is wrong.
- 27 Jan 2023 23:35 UTC; 2 points) 's comment on Highlights and Prizes from the 2021 Review Phase by (
To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.
I assume you would agree with the following rephrasing of your last sentence:
The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI.
If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our “intentions”. E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.
If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations “track reality” in the relevant way.
Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always “relative” to the AI’s physical capabilities?
MATS mentorships are often weekly, but only for limited time, unlike PhD programs that offer mentorship for several years. These years are probably often necessary to develop good research taste.
I upvoted since I think discussing what should or should not be discussed is important, but I tentatively disagree:
It seems unlikely that comments on lesswrong speed up capabilities research since the thoughts are probably just a subset of what the scaling teams know, and lesswrong is likely not their highest signal information source anyway.
Even from a safety perspective, it seems important to know which problems in capabilities research can be alleviated, since this will give a clearer picture of timelines.
I think we should have strong reasons before discouraging topics of discussion since lesswrong is not only a place for instrumental rationality but also epistemic rationality—maybe even more so.
That said, lesswrong is de facto one of the best places to discuss AI safety since the alignment forum is invite-only. thus, it seems that there should be some discussion around which tradeoffs to make on LW between “figuring out what’s true” and “not spreading info hazards”.
This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.
I really enjoyed reading this post! It’s quite well-written. Thanks for writing it.
The only critique is that I would have appreciated more details on how the linear regression parameters are trained and what exactly the projection is doing. John’s thread is a bit clarifying on this.One question: If you optimize the representation in the residual stream such that it corresponds to a particular chosen belief state, does the transformer than predict the next token as if in that belief state? I.e., does the transformer use the belief state for making predictions?
For what it’s worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I’m confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this).
Summary:
Claim 1: Goodhart’s Law is true
“Any measure which becomes the target ceases to be a good measure”
Examples:
Any math test supposed to find the best students will cease to work at the 10th iteration — people then “study to be good at the test”
Sugar was a good proxy for healthy food in the ancestral environment, but not today
Claim 2: If you want to condition yourself to a certain behavior with some reward, then that’s possible if only the delay between behavior and reward is small enough
Claim 3: Over time, we develop “taste”: inexplicable judgments of whether some stimulus may lead to progress toward our goals or not.
A “stimulus” can be as complex as “this specific hypothesis for how to investigate a disease”
Claim 4: Our Brains condition us, often without us noticing
With this, the article just means that dopamine spikes don’t exactly occur at the low-level reward, but already at points that predictably will lead to reward.
Since the dopamine hit itself can “feel rewarding”, this is a certain type of conditioning towards the behavior that preceded it.
In other words, the brain gives a dopamine hit in the same way as the dog trainer produces the “click” before the “treat”.
We often don’t “notice” this since we don’t usually explicitly think about why something feels good.
Conclusion: Your brain conditions you all the time toward proxy goals (“dopamine hits”), and Goodhart’s law means that conditioning is sometimes wrong
E.g., if you get an “anti-dopamine hit” for seeing the number on your bathroom scale, then this may condition you toward never looking at that number ever again, instead of the high-level goal of losing weight
Summary
This article thinks about what “general purpose search is” and why to expect it in advanced machine learning systems.
In general, we expect gradient descent to find “simple solutions” with lots of varying parameters (since they take a larger part in solution space) and “general solutions” that are helpful broadly (since we will put the system in diverse environments). Thus, we do expect search processes to emerge.
However, babble and prune will likely not be the resulting process: it’s not compute and memory efficient enough. Instead, John imagines a search process that starts with a constraint/problem and iteratively produces broad strokes of solutions that form new constraints of subproblems, until the problem is solved. If this is roughly correct, it will also mean that the search process is retargetable.
This leaves open how the broad strokes of solutions to constraints are found, which John expects requires heuristics that will often either output a solution itself, or a different problem whose solution is easier to generate, instead of babbling and then pruning. Some heuristics:
Relaxed problem: only consider time-constraint, or only consider immediately reducing Euclidean distance.
The specific relaxed problems are “heuristics”. The procedure to relax the problem is a “heuristic generator”.
One could consider this a “meta-heuristic”. However, the type-signature of “heuristic” is “problem in, solution out” or “problem in, other problem out”, and the type-signature of the meta heuristic is “problem in, heuristic out”, so these are different.
Finally, John gestures at the observation that heuristics seem to be environment-dependent but not goal-dependent, at least for similar types of goals (e.g. for the type “reach X city” or “do X thing this week”). This makes them more generalizable.
Other ThoughtsDon’t chess players sometimes do babble and prune? They might look at the board and literally “babble” possible moves, evaluate them, and search further in the best of those.
An alternative to that process would be to think “I want to capture the queen, how do I do this?” and then to explicitly think about moves that achieve that “constraint”. The original constraint/problem is just “win this game”, of which “capture the queen” is already a subconstraint/problem.
Still, I do remember Magnus Carlsen saying in an interview that he actually does do relatively exhaustive search in some chess situations. So it seems to at least be some search process of many he applies. But I also remember him saying that this is effortful.
The description of John leaves open the process with which the solutions to constraints are found. Doesn’t that process usually involve babbling?
In the case of finding stores, we may say “there is no babbling, the computer program just shows me the open stores that satisfy the constraint.”
But doesn’t the computer internally need to babble? I.e., to go through a database of all the options to find the ones satisfying the constraint!
In general, I would say babbling is required unless a solution to the constraint can already be retrieved in a somewhat cached form.
I’m not sure if “relax the problem” is a clear instruction. I feel like you already need to have something like a “natural abstraction of problems” in your head in order to be able to do that. This doesn’t really contradict what John is saying, but it highlights that there is some hidden complexity in this.
Summary
The goal of Goal factoring: in tradeoff situations, get all of the good with none of the bad
Parable of the orange: two people want the last orange. However, one only wanted the peel, the other the flesh — both can get what they want
Case study: preoccupied professor
Grading has lots of costs.
Grading produces a bag of benefits
The professor thought about how to reach all benefits without paying any of the costs
In the end, he found a system in which students could grade themselves
Goal Factoring algorithm:
Choose an action (that you consider doing, that has some (frequent) costs, or that you aren’t sure why you’re doing)
Prepare to accept all worlds (accept the idea of not doing the action. If hesitation comes up, this signals an implicit goal)
Factor the action out into goals
Remember the difference between wanting to do X and wanting to appear to do X
include goals like social standing, sense of self, etc.;
Check with button test (button removes action and achieves goals) whether these goals contain all the benefits of the action
Brainstorm replacement actions (first for one goal at a time. Then, combine replacement actions into a coherent plan)
Reality Check: Imagine the plan. Does system I protest? What does Murphyjitsu reveal?
Before starting the process: remind yourself that you are not forced to replace your action. This makes the brainstorming process less scary!
Thanks for your answer!
In the worlds where there’s not much future risk of a LWer someday posting a dangerous capabilities insight, there’s also less future benefit to LW posts, since we’re probably not generating many useful ideas in general (especially about AGI and AGI alignment).
This seems correct, though it’s still valuable to flesh out that it seems possible to have LW posts that are helpful for alignment but not for capabilities: namely, such posts that summarize insights from capabilities research that are known to ~all capabilities people while known to few alignment people.
The main reason I shifted more to your viewpoint now is that capabilities insights might influence people who do not yet know a lot about capabilities to work on that in the future, instead of working on alignment. Therefore, I’m also not sure if Marius’ heuristic “Has company-X-who-cares-mostly-about-capabilities likely thought about this already?” for deciding whether something is infohazardy is safe.‘We should require a high bar before we’re willing to not-post potentially-world-destroying information to LW, because LW has a strong commitment to epistemic rationality’ seems like an obviously terrible argument to me. People should not post stuff to the public Internet that destroys the world just because the place they’re posting is a website that cares about Bayesianism and belief accuracy.
Yes, that seems correct (though I’m a bit unhappy about you bluntly straw-manning my position). I think after reflection I would phrase my point as follows:
”There is a conflict between Lesswrongs commitment to epistemic rationality on the one hand, and the commitment to restrict info hazards on the other hand. Lesswrong’s commitment to epistemic rationality exists for good reasons, and should not be given up lightly. Therefore, whenever we restrict discussion and information about certain topics, we should have thought about this with great care.”
I don’t yet have a fleshed-out view on this, but I did move a bit in Tom’s direction.
Zeta Functions in Singular Learning Theory
In this shortform, I very briefly explain my understanding of how zeta functions play a role in the derivation of the free energy in singular learning theory. This is entirely based on slide 14 of the SLT low 4 talk of the recent summit on SLT and Alignment, so feel free to ignore this shortform and simply watch the video.
The story is this: we have a prior , a model , and there is an unknown true distribution . For model selection, we are interested in the evidence of our model for a data set , which is given by
where is the empirical KL divergence. In fact, we are interested in selecting the model that maximizes the average of this quantity over all data sets. The average is then given by
where is the Kullback-Leibler divergence.
But now we have a problem: how do we compute this integral? Computing this integral is what the free energy formula is about.
The answer: by computing a different integral. So now, I’ll explain the connection to different integrals we can draw.
Let
which is called the state density function. Here, is the Dirac delta function. For different , it measures the density of states (= parameter vectors) that have . It is thus a measure for the “size” of different level sets. This state density function is connected to two different things.
Laplace Transform to the Evidence
First of all, it is connected to the evidence above. Namely, let be the Laplace transform of . It is a function given by
In first step, we changed the order of integration, and in the second step we used the defining property of the Dirac delta. Great, so this tells us that ! So this means we essentially just need to understand .
Mellin Transform to the Zeta Function
But how do we compute ? By using another transform. Let be the Mellin transform of . It is a function (or maybe only defined on part of ?) given by
Again, we used a change in the order of integration and then the defining property of the Dirac delta. This is called a Zeta function.
What’s this useful for?
The Mellin transform has an inverse. Thus, if we can compute the zeta function, we can also compute the original evidence as
Thus, we essentially changed our problem to the problem of studying the zeta function To compute the integral of the zeta function, it is then useful to perform blowups to resolve the singularities in the set of minima of , which is where algebraic geometry enters the picture. For more on all of this, I refer, again, to the excellent SLT low 4 talk of the recent summit on singular learning theory.
It’s great to see Yoshua Bengio and other eminent AI scientists like Geoffrey Hinton actively engage in the discussion around AI alignment. He evidently put a lot of thought into this. There is a lot I agree with here.
Below, I’ll discuss two points of disagreement or where I’m surprised by his takes, to highlight potential topics of discussion, e.g. if someone wants to engage directly with Bengio.
Most of the post is focused on the outer alignment problem—how do we specify a goal aligned with our intent—and seems to ignore the inner alignment problem—how do we ensure that the specified goal is optimized for.
E.g., he makes an example of us telling the AI to fix climate change, after which the AI wipes out humanity since that fixes climate change more effectively than respecting our implicit constraints of which the AI has no knowledge. In fact, I think language models show that there may be quite some hope that AI models will understand our implicit intent. Under that view, the problem lies at least as much in ensuring that the AI cares.
He also extensively discusses the wireheading problem of entities (e.g., humans, corporations, or AI systems) that try to maximize their reward signal. I think we have reasons to believe that wireheading isn’t as much of a concern: inner misalignment will cause the agent to have some other goal than the precise maximization of the reward function, and once the agent is situationally aware, it has incentives to keep its goals from changing by gradient descent.
He does discuss the fact that our brains reward us for pleasure and avoiding pain, which is misaligned with the evolutionary goal of genetic fitness. In the alignment community, this is most often discussed as an inner alignment issue between the “reward function” of evolution and the “trained agent” being our genomes. However, his discussion highlights that he seems to view it as an outer alignment issue between evolution and our reward signals in the brain, which shape our adult brains through in-lifetime learning. This is also the viewpoint in Brain-Like-AGI Safety, as far as I remember, and also seems related to viewpoints discussed in shard theory.
“In fact, over two decades of work in AI safety suggests that it is difficult to obtain AI alignment [wikipedia], so not obtaining it is clearly possible.”
I agree with the conclusion, but I am surprised by the argument. It is true that we have seen over two decades of alignment research, but the alignment community has been fairly small all this time. I’m wondering what a much larger community could have done.
Yoshua Bengio was on David Krueger’s PhD thesis committee, according to David’s CV.
After filling out the form, I could click on “see previous responses”, which allowed me to see the responses of all other people who have filled out the form so far.
That is probably not intended?
I disagree with this. I think the most useful definition of alignment is intent alignment. Humans are effectively intent-aligned on the goal to not kill all of humanity. They may still kill all of humanity, but that is not an alignment problem but a problem in capabilities: humans aren’t capable of knowing which AI designs will be safe.
The same holds for intent-aligned AI systems that create unaligned successors.
Summary:
Sometimes, reinforcement learning goes wrong: how can this be prevented?
Example: math education
One student simply “learns to follow along”, and the other “learns to predict what comes next”
The other student may gain the ability to solve math problems on their own, while the first plausibly won’t.
Turbocharging, general notes:
Idea: You get better at the things you practice, and it pays off to think about what, mechanistically, you want to learn.
You won’t just learn “what you intend”:
If you intend to gain the skill of disarmament of people but hand the weapon back during training with a partner, then that is what you learn.
Example of math student revisited: are they…
Actively thinking about the symbols?
Calling up related material from memory?
Generating hypotheses (instead of falling prey to hindsight bias)?
Thinking about the underlying structure of the problem?
← These questions determine what’s actually practiced.
The Turbocharging Algorithm:
Select a skill to be acquired/improved
Select a practice method (to be evaluated or to be strengthened/developed)
Evaluate the resemblance between method and skill:
Does/Do the “practice trigger(s)” resemble the real-world trigger, or at least plausibly generalize?
Does/Do the “practice action(s)” resemble real-world actions, or at least plausibly generalize?
Possibly adjust the practice method in response to the previous answers
Further Notes
Declarative and Procedural Knowledge require different types of learning
Turbocharging is for procedural learning, which is more of what applied rationality is about
The article lists many counterexamples of the theory that turbocharging is “the one and only” way to gain procedural knowledge.
There may also be social reasons to give different answers depending on whether you are a mentor or mentee. I.e., answering “the better mentees were those who were smarter” seems like an uncomfortable thing to say, even if it’s true.
(I do not want to say that this social explanation is the only reason that answers between mentors and mentees differed. But I do think that one should take it into account in one’s models)