Principles for Alignment/Agency Projects
“John, what do you think of this idea for an alignment research project?”
I get questions like that fairly regularly. How do I go about answering? What principles guide my evaluation? Not all of my intuitions for what makes a project valuable can easily be made legible, but I think the principles in this post capture about 80% of the value.
Tackle the Hamming Problems, Don’t Avoid Them
Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things (or Clever Reasons To Ignore The Hard Things), rather than just Directly Tackling The Hard Things.
The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI, and “just” try to align that AI without understanding the Hard Parts of alignment ourselves. The next most common pattern is to argue that, since Hard Parts are Hard, we definitely don’t have enough time to solve them and should therefore pretend that we’re going to solve alignment while ignoring them. Third most common is to go into field building, in hopes of getting someone else to solve the Hard Parts. (Admittedly these are not the most charitable summaries.)
There is value in seeing how dumb ideas fail. Most of that value is figuring out what the Hard Parts of the problem are—the taut constraints which we run into over and over again, which we have no idea how to solve. (If it seems pretty solvable, it’s probably not a Hard Part.) Once you can recognize the Hard Parts well enough to try to avoid them, you’re already past the point where trying dumb ideas has much value.
On a sufficiently new problem, there is also value in checking dumb ideas just in case the problem happens to be easy. Alignment is already past that point; it’s not easy.
You can save yourself several years of time and effort by actively trying to identify the Hard Parts and focus on them, rather than avoid them. Otherwise, you’ll end up burning several years on ideas which don’t actually leave the field better off. That’s one of the big problems with trying to circumvent the Hard Parts: when the circumvention inevitably fails, we are still no closer to solving the Hard Parts. (It has been observed both that alignment researchers mostly seem to not be tackling the Hard Parts, and that alignment research mostly doesn’t seem to build on itself; I claim that the latter is a result of the former.)
Mostly, I think the hard parts are things like “understand agency in general better” and “understand what’s going on inside the magic black boxes”. If your response to such things is “sounds hard, man”, then you have successfully identified (some of) the Hard Parts.
Have An Intuitive Story Of What We’re Looking For
One project going right now is looking at how modularity in trained systems corresponds to broad peaks in parameter space. Intuitive story for that: we have two “modules”, each with lots of stuff going on inside, but only a relatively-low-dimensional interface between them. Because each module has lots of stuff going on inside, but only a low-dimensional interface, there should be many ways to change around the insides of a module while keeping the externally-visible behavior the same. Because such changes don’t change behavior, they don’t change system performance. So, we expect that modularity implies lots of degrees-of-freedom in parameter space, i.e. broad peaks.
This story is way too abstract to be able to look for immediately in a trained net. How do we operationalize “modules”, and find them? How do we operationalize “changes in a module”, especially since parameter space may not line up very neatly with functional modules? But that’s fine; the story can be pretty abstract.
The point of the intuitive story is to steer our search. Without it, we risk blind empiricism: just cataloguing patterns without building general models/theory/understanding for what’s going on. In that mode, we can easily lose track of the big picture goal and end up cataloguing lots of useless stuff. An intuitive story gives us big-picture direction, and something to aim for. Even if it turns out to be wrong!
Operationalize
It’s relatively easy to make vague/abstract intuitive arguments. Most of the value and challenge is in finding the right operationalizations of the vague concepts involved in those arguments, such that the argument is robustly correct and useful. Because it’s where most of the value and most of the challenge is, finding the right operationalization should typically be the central focus of a project.
My abstraction work is a good example here. I started with some examples of abstraction and an intuitive story about throwing away information while keeping info relevant “far away”. Then, the bulk of the work was to operationalize that idea in a way which matched all the intuitive examples, and made the intuitive stories provable.
Derive the Ontology, Don’t Assume It
In ML interpretability, some methods look at the computation graph of the net. Others look at orthogonal directions in activation space. Others look at low-rank decompositions of the weight matrices. These are all “different ontologies” for interpretation. Methods which look at one of these ontologies will typically miss structure in the others; e.g. if run a graph clustering algorithm on the computation graph I probably won’t pick up interpretable concepts embedded in directions in activation space.
What we’d really like is to avoid assuming an ontology, and rather discover/derive the ontology itself as part of our project. For instance, we could run an experiment where we change one human-interpretable “thing” in the environment, and then look at how that changes the trained net; that would let us discover how the concept is embedded rather than assume it from the start (credit to Chu for this suggestion). Another approach is to start out with some intuitive story for why a particular ontology is favored—e.g. if we have a graph with local connectivity, then maybe the Telephone Theorem kicks in. Such an argument should (a) allow us to rule out interactions which circumvent the favored ontology, and (b) be testable in its own right, e.g. for the Telephone Theorem we can (in principle) check the convergence of mutual information to a limit.
Open The Black Box
Don’t just run a black-box experiment on a network, or try to prove a purely behavioral theorem. We want to talk about internal structure.
Partly, opening the black box is about tackling the Hard Parts rather than avoiding them. Not opening the black box is a red flag; it’s usually a sign of avoiding the Hard Parts.
Partly, opening the black box is about getting a very rich data channel. When we just work with a black box, we get relatively sparse data about what’s going on. When we open the black box, we can in-principle directly observe every gear and directly check what’s going on.
Relative Importance of These Principles
Tackle The Hamming Problems is probably the advice which is most important to follow for marginal researchers right now, but mostly I expect people who aren’t already convinced of it will need to learn it the hard way. (I certainly had to learn it the hard way, though I did that before starting to work on alignment.) Open the Black Box follows pretty naturally once you’re leaning in to the Hard Parts.
Once you’re past that stumbling block, I think the most important principles are Derive the Ontology and Operationalize. These two are important for opposing types of people. Some people tend to stay too abstract and avoid committing to an ontology, but never operationalize and therefore miss out on the main value-add. Other people operationalize prematurely, adopting ad-hoc operationalizations, and Deriving the Ontology pretty strongly dicourages that.
Have an Intuitive Story is especially helpful for people who tend to get lost in the weeds and go nowhere. Make sure you have an intuitive story, and use that story to guide everything else.
- How (not) to choose a research project by 9 Aug 2022 0:26 UTC; 79 points) (
- Resources that (I think) new alignment researchers should know about by 28 Oct 2022 22:13 UTC; 69 points) (
- Content and Takeaways from SERI MATS Training Program with John Wentworth by 24 Dec 2022 4:17 UTC; 28 points) (
- Resources that (I think) new alignment researchers should know about by 28 Oct 2022 22:13 UTC; 20 points) (EA Forum;
- 3 Jan 2023 8:31 UTC; 2 points) 's comment on [Simulators seminar sequence] #1 Background & shared assumptions by (
Coming back to this a few showers later.
A “cheat” is a solution to a problem that is invariant to a wide range of specifics about how the sub-problems (e.g. “hard parts”) could be solved individually. Compared to an “honest solution”, a cheat can solve a problem with less information about the problem itself.
A b-cheat (blind) is a solution that can’t react to its environment and thus doesn’t change or adapt throughout solving each of the individual sub-problems (e.g. plot armour). An a-cheat (adaptive/perceptive) can react to information it perceives about each sub-problem, and respond accordingly.
ML is an a-cheat because even if we don’t understand the particulars of the information-processing task, we can just bonk it with an ML algorithm and it spits out a solution for us.
In order to have a hope of finding an adequate cheat code, you need to have a good grasp of at least where the hard parts are even if you’re unsure of how they can be tackled individually. And constraining your expectation over what the possible sub-problems or sub-solutions should look like will expand the range of cheats you can apply, because now they need to be invariant to a smaller space of possible scenarios.
If effort spent on constraining expectation expands the search space, then it makes sense to at least confirm that there are no fully invariant solutions at the shallow layer before you iteratively deepen and search a larger range.
This relates to Wason’s 2-4-6 problem, where if the true rule is very simple like “increasing numbers”, subjects continuously try to test for models that are much more complex before they think to check the simplest models.
This is of course because they have the reasonable expectation that the human is more likely to make up such rules, but that’s kinda the point: we’re biased to think of solutions in the human range.
Limiting case analysis is when you set one or more variables of the object you’re analysing to their extreme values. This may give rise to limiting cases that are easier to analyse and could give you greater insights about the more general thing. It assumes away an entire dimension of variability, and may therefore be easier to reason about. For example, thinking about low-bandwidth oracles (e.g. ZFP oracle) with cleverly restrained outputs may lead to general insights that could help in a wider range of cases. They’re like toy problems.
”The art of doing mathematics consists in finding that special case which contains all the germs of generality.” — David Hilbert
Multiplex case analysis is sorta the opposite, and it’s when you make as few assumptions as possible about one or more variables/dimensions of the problem while reasoning about it. While it leaves open more possibilities, it could also make the object itself more featureless, fewer patterns, easier to play with in your working memory.
One thing to realise is that it constrains the search space for cheats, because your cheat now has to be invariant to a greater space of scenarios. This might make the search easier (smaller search space), but it also requires a more powerfwl or a more perceptive/adaptive cheat. It may make it easier to explore nodes at the base of the search tree, where discoveries or eliminations could be of higher value.
This can be very usefwl for extricating yourself from a stuck perspective. When you have a specific problem, a problem with a given level of entropy, your brain tends to get stuck searching for solutions in a domain that matches the entropy of the problem. (speculative claim)
It relates to one of Tversky’s experiments (I have not vetted this), where subjects were told to iteratively bet on a binary outcome (A or B), where P(A)=0.7. They got 2 money for correct and 0 for incorrect. Subjects tended to try to bet on A with frequency that matched the frequency of the outcome. Whereas the highest EV strategy is to always bet on A.
This also relates to the Inventor’s Paradox.
”The more ambitious plan may have more chances of success […] provided it is not based on a mere pretension but on some vision of the things beyond those immediately present.” ‒ Pólya
Consider the problem of adding up all the numbers from 1 to 99. You could attack this by going through 99 steps of addition like so: 1+2+3+...97+98+99
Or you could take a step back and find a more general problem-solving technique (an a-cheat). Ask yourself, how do you solve all 1-iterative addition problems? You could rearrange it as:
(1+99)+(2+98)+...+(48+52)+(49+51)+50=100×49+50
To land on this, you likely went through the realisation that you could solve any such series with ⌊N2⌋(S0+SN) and add ⌈N2⌉ if N is odd.
The point being that sometimes it’s easier to solve “harder” problems. This could be seen as, among other things, an argument for worst-case alignment.
I generally agree with you on the principle Tackle the Hamming Problems, Don’t Avoid Them.
That being said, some of the Hamming problems I see that are being avoided most on this forum, and in the AI alignment community, are
Do something that will affect policy in a positive way
Pick some actual human values, and then hand-encode these values into open source software components that can go into AI reward functions
I agree with 1 (but then it is called alignment forum, not the more general AI Safety forum). But I don’t see that 2 would do much good.
All narratives I can think of where 2 plays a significant part sounds like strawmen to me, perhaps you could help me?
Not sure what makes you think ‘strawmen’ at 2, but I can try to unpack this more for you.
Many warnings about unaligned AI start with the observation that it is a very bad idea to put some naively constructed reward function, like ‘maximize paper clip production’, into a sufficiently powerful AI. Nowadays on this forum, this is often called the ‘outer alignment’ problem. If you are truly worried about this problem and its impact on human survival, then it follows that you should be interested in doing the Hard Thing of helping people all over the world write less naively constructed reward functions to put into their future AIs.
John writes:
This pattern of outsourcing the Hard Part to the AI is definitely on display when it comes to 2 above. Academic AI/ML research also tends to ignore this Hard Part entirely, and implicitely outsources it to applied AI researchers, or even to the end users.
I know it’s a bit off-topic, but FWIW I don’t immediately share this intuition. If there are “many ways to change around the insides of a module while keeping the externally-visible behavior the same”, then if the whole network is just one “module” (i.e. it’s not modular at all), can’t I likewise say there are “many ways to change around the insides of [the one module which comprises the entire network] while keeping the externally-visible behavior the same”?
Yup. I’m also not entirely convinced by this argument, for the same reason.
Well, isn’t having multiple modules a precondition to something being modular? That seems like what’s happening in your example: it has only one module, so it doesn’t even make sense to apply John’s criterion.
I think Steven’s point is that if your explanation for modularity leading to broadness is that the parameters inside a module can take any configuration, conditioned on the output of the module staying the same, then you’re at least missing an additional step showing that a network consisting of two modules with n/2 parameters each has more freedom in those parameters than a network consisting of one module (just the entire network itself) with n parameters does. Otherwise you’re not actually pointing out how this favours modularity over non-modularity.
Which may seem rather non-obvious. Intuitively, you might think that the two modules scenario has more constraints on the parameters than the one module scenario, since there’s two places in the network where you’re demanding particular behaviour rather than one.
My own guiding intuition for why modularity seems to cause broadness goes the more circumspect path of “modularity seems connected to abstraction, abstraction seems connected to generality, generality seems connected to broadness in parameter space”.
I think the hand-wavy math we currently have also points more towards this connection. It seems to talk about how broadness is connected to dropping information about the input, as much as you can while still getting the right answer. Which sure looks suggestively like a statement about avoiding fine tuning. And modules are things that only give out small summaries of what goes on inside them, rather than propagating all the information they contain.
Doesn’t more constraints mean less freedom and therefore a less broadness in parameter space?
(Sorry if that’s a stupid question, I don’t really understand the reasoning behind the whole connection yet.)
(And thanks, the last two paragraphs were helpful, though I didn’t look into the math!)
Yes, that was the point. At least at first blush, this line of argument looks like it’s showing the opposite of what it purports to, so maybe it isn’t that great of an explanation.
On a separate note, I think the math I referenced above can now be updated to say: broadness is dependent on the number of orthogonal features a network has, and how large the norm of these features is. Where both feature orthogonality and norm are defined by the L2 Hilbert space norm, which you may know from quantum mechanics.
This neatly encapsulates, extends, and quantifies the “information loss” notion in Vivek’s linked post above. It also sounds a lot like it’s formalising intuitions about broadness being connected to “generality”, “simplicity”, and lack of “fine tuning”.
It also makes me suspect that the orthogonal feature basis is the fundamentally correct way to think about computations in neural networks.
Post on this incoming once I figure out how to explain it to people who haven’t used Hilbert space before.
Thanks for the post! As always I broadly agree, but I have a bunch of nitpicks.
I agree that avoiding the Hard parts is rarely productive, but you also don’t address one relevant concern: what if the Hard part is not merely Hard, but actually Impossible? In this case your advice can also be cashed out by trying to prove it is impossible instead of avoiding it. And just like with most impossibility results in TCS, it’s possible that even if the precise formulation is impossible, it often just means that you need to reframe the problem a bit.
I expect you would also say that a crucial hard part many people are avoiding is “how to learn human values?”, right? (Not the true names, but a useful pointer)
I want to note that the failure mode of blind theory here is to accept any story, and thus make the requirement of a story completely impotent to guide research. There’s an art (and hopefully a science) to finding stories that bias towards productive mistakes.
I expect you to partially disagree, but there’s not always a “right” operationalization, and there’s a failure mode where one falls in love with their neat operationalization, making the misses parts of the phenomena invisible.
I want to say that you should start with behavioral theorem, and often the properties you want to describe might make more sense behaviorally, but I guess you’re going to answer that we have evidence that this doesn’t work in Alignment and so it is avoiding the Hard part. Am I correct?
One formal example of this is the relativization barrier in complexity theory, which tells you that you can’t prove P≠NP (and a bunch of other separations) using only techniques using algorithms as blackboxes instead of looking at the structure.
Agreed that it’s a great pair of advice to keep in mind!
Yes, although I consider that one more debatable.
When there’s not a “right” operationalization, that usually means that the concepts involved were fundamentally confused in the first place.
Actually, I think starting from a behavioral theorem is fine. It’s just not where we’re looking to end up, and the fact that we want to open the black box should steer what starting points we look for, even when those starting points are behavioral.
Curious about the scope of the conceptual space where this belief was calibrated. It seems to me to tacitly say something like “everything that’s important is finitely characterizable”.
Maybe the “fundamentally confused” in your phrasing already includes the case of “stupidly tried to grab something that wasn’t humanly possible, even if in principle” as a confused way for a human, without making any claim of reality being conveniently compressible at all levels. (Note that this link explicitly disavows beauty at “all levels” too.)
I suppose you might also say “I didn’t make any claim of finiteness” but I do think something like “at least some humans are only a finite string away from grokking anything” is implicit if you expect there to be blogposts/textbooks that can operationalize everything relevant. It would be an even stronger claim than “finiteness”, it would be “human-typical length strings”
I believe Adam is pointing at something quite important, akin to a McNamara fallacy for formalization. To paraphrase:
In the case of something that has already been engineered (human brains with agency), we probably should grant that it is possible to operationalize everything relevant. But I want to pushback on the general version and would want “why do you believe simple-formalization is possible here, in this domain?” to be allowed to be asked.
[PS. am not a native speaker]
A “cheat” is a solution to a problem that is invariant to a wide range of scenarios for how the hard parts could be solved individually.
ML itself is a cheat. Even if we don’t understand the particulars of the information-processing task, we can just bonk it with an ML algorithm and it spits out a solution for us.
But in order to have a hope of finding an adequate cheat code, you need to have a good grasp of at least where the hard parts are even if you’re unsure about how they could be tackled individually. And constraining your expectation over what the possible subsolutions should look like expands the range of cheats you could apply, because now they need to be invariant to a smaller space of possible scenarios.[1]
Insofar as you’re saying that we can’t hope to find remotely adequate cheats unless we start with a rough understanding of what we even need to cheat over, I agree. I don’t think you’re saying that we shouldn’t be looking for cheats in the first place, but it could be interpreted that way. Yes, it has the problem that it doesn’t build upon itself as well as directly challenging the hard parts, but, realistically, I think the solution has to look like some kind of cheat.
There’s this funny dynamic where if you expand the range of plausible solutions you can search through (e.g. by constraining your expectation for what they need to be invariant to), it might become harder to locate a particular area of the search space. If effort spent on constraining expectation expands the search space, then it makes sense to at least confirm that there are no fully invariant solutions at the top layer before you iterate and search a broader range.
Regarding making use of a superintelligent AGI-system in a safe way, here are some things that possibly could help with that:
Working on making it so that the first superintelligent AGI is robustly aligned from the beginning.
Working on security measures so that if we didn’t do #1 (or thought we did, but didn’t), the AGI will be unable to “hack” itself out in some digital way (e.g. exploiting some OS flaw and getting internet access)
Developing and preparing techniques/strategies so that if we didn’t do #1 (or thought we did, but didn’t), we can obtain help from various instances of the AGI-system that get us towards a more and more aligned AGI-system, while (1) minimizing the causual influence of the AGI-systems and the ways they might manipulate us and (2) making requests in such a way that we can verify that what we are getting is what we actually want, greatly leveraging how verifying a system often is much easier than making it.
#2 and #3 seems to me as worth pursuing in addition to #1, but not instead of #1. Rather #2 and #3 could work as additional layers of alignment-assurance.
I do think genuine failure modes are being alluded to by “Clever Ways To Avoid Doing Hard Things”, but I think there also may be failure modes having to do with encouraging “everyone” to only work on “The Hard Things” in a direct way (without people also looking for potential workarounds and additional layers of alignment-assurance).
Also, consider if someone comes up with alignment methodologies for an AGI that don’t seem robust or fully safe, but do seem like they might have a decent/good chance of working in practice. Such alignment methodologies may be bad ideas if they are used as “the solution”, but if we have a “system of systems”, where some of the sub-systems themselves are AGIs that we have attempted to align based on different alignment methodologies, then we can e.g. see if the outputs from these different sub-systems converge.
Sincerely someone who does not call himself an alignment researcher, but who does self-identify as a “hobbyist alignment theorist”, and is working on a series where much of the focus is on Clever Techniques/Strategies That Might Work Even If We Haven’t Succeeded At The Hard Things (and thus maybe could provide additional layers of alignment-assurance).
This was very insightful. It seems like a great thing to point to, for the many newish-to-alignment people ideating research agendas (like myself). Thanks for writing and posting!
I think these are very good principles. Thank you for writing this post, John.
Thoughts on actually going about Tackling the Hamming Problems in real life:
My model of humans (and of doing research) says that in order for a human to actually successfully work on the Hard Parts, they probably need to enjoy doing so on a System 1 level. (Otherwise, it’ll probably be an uphill battle against subconscious flinches away from Hard Parts, procrastinating with easier problems, etc.)
I’ve personally found these essays to be insightful and helpful for (i.a.) training my S1 to enjoy steering towards Hard Parts. I’m guessing they could be helpful to many other people too.
I agree with that statement and this statement
seems true as well. However, there was something in this section that didn’t seem quite right to me.
Say that you have identified the Hamming Problem at lowest resolution be getting the outcome “AI doesn’t cause extinction or worse”. However, if you zoom in a little bit you might find that there are different narratives that lead to the same goal. For example:
AGI isn’t developed due to event(s)
AGI is developed safely due to event(s)
At this level I would say that it is correct to go for the easier narrative. Going for harder problems seem to be when you zoom into these narratives.
For each path you can imagine a set of events (e.g. research break-throughs) that are necessary and sufficient to solve the end-goal. Here I’m unsure but my intuition tells me that the marginal impact would often be greater working on the necessary parts that are the hardest as these are the ones that are least likely to be solved without intervention.
Of course working on something that isn’t necessary in any narrative would probably be easier in most cases but would never be a Hamming Problem.
This is exactly right, and those are the things which I would call Hamming Problems or the Hard Parts.
I suppose I would just like to see more people start at an earlier level and from that vantage point you might actually want to switch to a path with easier parts.