I’m a PhD student at the University of Amsterdam. I have research experience in multivariate information theory and equivariant deep learning and recently got very interested into AI alignment. https://langleon.github.io/
Leon Lang
We Should Prepare for a Larger Representation of Academia in AI Safety
Disentangling Shard Theory into Atomic Claims
Citability of Lesswrong and the Alignment Forum
A Short Dialogue on the Meaning of Reward Functions
Summaries: Alignment Fundamentals Curriculum
Andrew Huberman on How to Optimize Sleep
Andrew Ng wants to have a conversation about extinction risk from AI
Experiment Idea: RL Agents Evading Learned Shutdownability
Uncharitable Summary
Most likely there’s something in the intuitions which got lost when transmitted to me via reading this text, but the mathematics itself seems pretty tautological to me (nevertheless I found it interesting since tautologies can have interesting structure! The proof itself was not trivial to me!).
Here is my uncharitable summary:
Assume you have a Markov chain M_0 → M_1 → M_2 → … → M_n → … of variables in the universe. Assume you know M_n and want to predict M_0. The Telephone theorem says two things:
You don’t need to keep all information about M_n to predict M_0 as well as possible. It’s enough to keep the conditional distribution P(M_0 | M_n).
Additionally, in the long run, these conditional distributions grow closer and closer together: P(M_0 | M_n) ~ P(M_0 | M_{n+1})
That’s it. The first statement is tautological, and the second states that you cannot keep losing information. At some point, your uncertainty in M_0 stabilizes.
Further Thoughts
I think actually, John wants to claim that the conditional probabilities can be replaced by something else which carries information at a distance and stabilizes over time. Something like:
Average measurements
Pressure/temperature of a gas
Volume of a box
Length of a message
…
These things could then serve as “sufficient statistics” that contain everything one needs for making predictions. I have no idea of how one would go about finding such conserved quantities in general systems. John also makes a related remark:
“(Side note: the previous work already suggested conditional probability distributions as the type-signature of abstractions, but that’s quite general, and therefore not very efficient to work with algorithmically. Estimates-of-deterministic-constraints are a much narrower subset of conditional probability distributions.)”
The math was made easier in the proof by assuming that information at a distance precisely stabilizes at some point. In reality, it may slowly converge without becoming constant. For this, no proof or precise formulation is yet in the text.
John claims: “I believe that most abstractions used by humans in practice are summaries of information relevant at a distance. The theorems in this post show that those summaries are estimates/distributions of deterministic (in the limit) constraints in the systems around us.”This confuses me. It seems like this claims that we can also use summaries of very closeby objects to make predictions at arbitrary distance. However, the mathematics doesn’t show this: it only considers varying the “sender” of information,notvarying the “receiver” (which in the case of the theorem is M_0!). If you want to make predictionsaboutarbitrarily far away different things in the universe, then it’s unclear whether you can throw any information of closeby things away. (But maybe I misunderstood the text here?)
A somewhat more random comment:
I disagree with the claim that the intuitions behind information diagrams fall apart at higher degrees: if you’re “just fine” with negative information, then you can intersect arbitrarily many circles and get additivity rules for information terms. I actually wrote a paper about this, including how one can do this for other information quantities like Kolmogorov complexity and Kullback-Leibler divergence. What’s problematic about this is not the mathematics of intersecting circles, but that we largely don’t have good real-world interpretations and use cases for it.
How, if at all, does your alignment approach deal with deceptive alignment?
Distribution Shifts and The Importance of AI Safety
Summary
I summarize this post in a slightly reverse order. In AI alignment, one core question is how to think about utility maximization. What are agents doing that maximize utility? How does embeddedness play into this? What can we prove about such agents? Which types of systems become maximizers of utility in the first place?
This article reformulates expected utility maximization in equivalent terms in the hopes that the new formulation makes answering such questions easier. Concretely, a utility function u is given, and the goal of a u-maximizer is to change the distribution M1 over world states X in such a way that E_M1[u(X)] is maximized. Now, assuming that the world is finite (an assumption John doesn’t mention but is discussed in the comments), one can find a>0, b such that a*u(X) + b = log P(X | M2) for some distribution/model of the world M2. Roughly, M2 assigns high probability to states X that have high utility u(X).
Then the equivalent goal of the u-maximizer becomes changing M1 such that E_M1[- log P(X | M2)] becomes minimal, which means minimizing H(X | M1) + D_KL(M1 | M2). The entropy H(X | M1) cannot be influenced in our world (due to thermodynamics) or can, by a mathematical trick, be assumed to be fixed, meaning that the problem reduces to just minimizing the KL-distance of distributions D_KL(M1 | M2). Another way of saying this is that we want to minimize the average number of bits required to describe the world state X when using the Shannon-Fano code of M2. A final tangential claim is that for powerful agents/optimization processes, the initial M1 with which the world starts shouldn’t matter so much for the achieved end result of this process.
John then speculates on how this reformulation might be useful, e.g. for selection theorems.
OpinionThis is definitely thought-provoking.
What I find interesting about this formulation is that it seems a bit like “inverse generative modeling”: usually in generative modeling in machine learning, we start out with a “true distribution” M1’ of the world and try to “match” a model distribution M2’ to it. This can then be done by maximizing average log P(X | M2’) for X that are samples from M1’, i.e. by performing maximum likelihood. So in some sense, a “utility” is maximized there as well.But in John’s post, the situation is reversed: the agent has a utility function corresponding to a distribution M2 that weights up desired world states, and the agent tries to match the real-world distribution M1 to that.
If an agent is now both engaging in generative modeling (to build its world model) and in utility maximization, then it seems like the agent could also collapse both objectives into one: start out with the “wrong” prediction by already assuming the desired world state M2 and then get closer to predicting correctly by changing the real world. Noob question: is this what the predictive processing people are talking about? I’m wondering this since when I heard people saying things like “all humans do is just predicting the world”, I never understood why humans wouldn’t then just sit in a dark room without anything going on, which is a highly predictable world-state. The answer might be that they start out predicting a desirable world, and their prediction algorithm is somehow weak and only manages to predict correctly by “just” changing the world. I’m not sure if I buy this.
One thing I didn’t fully understand in the post itself is why the entropy under M1 can always be assumed to be constant by a mathematical trick, though another comment explored this in more detail (a comment I didn’t read in full).
Minor: Two minus signs are missing in places, and I think the order of the distributions in the KL term is wrong.
- 20 Jan 2023 20:54 UTC; 8 points) 's comment on The 2021 Review Phase by (
- 27 Jan 2023 23:35 UTC; 2 points) 's comment on Highlights and Prizes from the 2021 Review Phase by (
To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.
I assume you would agree with the following rephrasing of your last sentence:
The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI.
If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our “intentions”. E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.
If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations “track reality” in the relevant way.
Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always “relative” to the AI’s physical capabilities?
MATS mentorships are often weekly, but only for limited time, unlike PhD programs that offer mentorship for several years. These years are probably often necessary to develop good research taste.
I upvoted since I think discussing what should or should not be discussed is important, but I tentatively disagree:
It seems unlikely that comments on lesswrong speed up capabilities research since the thoughts are probably just a subset of what the scaling teams know, and lesswrong is likely not their highest signal information source anyway.
Even from a safety perspective, it seems important to know which problems in capabilities research can be alleviated, since this will give a clearer picture of timelines.
I think we should have strong reasons before discouraging topics of discussion since lesswrong is not only a place for instrumental rationality but also epistemic rationality—maybe even more so.
That said, lesswrong is de facto one of the best places to discuss AI safety since the alignment forum is invite-only. thus, it seems that there should be some discussion around which tradeoffs to make on LW between “figuring out what’s true” and “not spreading info hazards”.
This is very helpful, thanks! Actually, the post includes several sections, including in the appendix, that might be more interesting to many readers than the grant recommendations themselves. Maybe it would be good to change the title a bit so that people also expect other updates.
I really enjoyed reading this post! It’s quite well-written. Thanks for writing it.
The only critique is that I would have appreciated more details on how the linear regression parameters are trained and what exactly the projection is doing. John’s thread is a bit clarifying on this.One question: If you optimize the representation in the residual stream such that it corresponds to a particular chosen belief state, does the transformer than predict the next token as if in that belief state? I.e., does the transformer use the belief state for making predictions?
For what it’s worth, I think this comment seems clearly right to me, even if one thinks the post actually shows misalignment. I’m confused about the downvotes of this (5 net downvotes and 12 net disagree votes as of writing this).
Summary:
Claim 1: Goodhart’s Law is true
“Any measure which becomes the target ceases to be a good measure”
Examples:
Any math test supposed to find the best students will cease to work at the 10th iteration — people then “study to be good at the test”
Sugar was a good proxy for healthy food in the ancestral environment, but not today
Claim 2: If you want to condition yourself to a certain behavior with some reward, then that’s possible if only the delay between behavior and reward is small enough
Claim 3: Over time, we develop “taste”: inexplicable judgments of whether some stimulus may lead to progress toward our goals or not.
A “stimulus” can be as complex as “this specific hypothesis for how to investigate a disease”
Claim 4: Our Brains condition us, often without us noticing
With this, the article just means that dopamine spikes don’t exactly occur at the low-level reward, but already at points that predictably will lead to reward.
Since the dopamine hit itself can “feel rewarding”, this is a certain type of conditioning towards the behavior that preceded it.
In other words, the brain gives a dopamine hit in the same way as the dog trainer produces the “click” before the “treat”.
We often don’t “notice” this since we don’t usually explicitly think about why something feels good.
Conclusion: Your brain conditions you all the time toward proxy goals (“dopamine hits”), and Goodhart’s law means that conditioning is sometimes wrong
E.g., if you get an “anti-dopamine hit” for seeing the number on your bathroom scale, then this may condition you toward never looking at that number ever again, instead of the high-level goal of losing weight
There may also be social reasons to give different answers depending on whether you are a mentor or mentee. I.e., answering “the better mentees were those who were smarter” seems like an uncomfortable thing to say, even if it’s true.
(I do not want to say that this social explanation is the only reason that answers between mentors and mentees differed. But I do think that one should take it into account in one’s models)