johnswentworth comments on Refactoring Alignment (attempt #2)

johnswentworth 30 Jul 2021 11:58 UTC
LW: 16 AF: 7
0
AF
For a while, I’ve thought that the strategy of “split the problem into a complete set of necessary sub-goals” is incomplete. It produces problem factorizations, but it’s not sufficient to produce good problem factorizations—it usually won’t cut reality at clean joints. That was my main concern with Evan’s factorization, and it also applies to all of these, but I couldn’t quite put my finger on what the problem was.
I think I can explain it now: when I say I want a factorization of alignment to “cut reality at the joints”, I think what I mean is that each subproblem should involve only a few components of the system (ideally just one).
Inner/outer alignment discussion usually assumes that our setup follows roughly the structure of today’s ML: there’s a training process before deployment, there’s a training algorithm with training data/environment and training objective, there’s a trained architecture and initial parameters. These are the basic “components” which comprise our system. There’s a lot of variety in the details, but these components are usually each good abstractions—i.e. there’s nice well-defined APIs/Markov blankets between each component.
Ideally, alignment should be broken into subproblems which each depend only on one or a few components. For instance, outer alignment would ideally be a property of only the training objective (though that requires pulling out the pointers problem part somehow). Inner alignment would ideally be a property of only the training architecture, initial parameters, and training algorithm (though that requires pulling out the Dr Nefarious issue somehow). Etc. The main thing we don’t want here is a “factorization” where some subsystems are involved in all of the subproblems, or where there’s heavy overlap.
Why is this useful? You could imagine that we want to divide the task of building an aligned AI up between several people. To avoid constant git conflicts, we want each group to design one or a few different subsystems, without too much overlap between them. Each of those subsystems is designed to solve a particular subproblem. Of course we probably won’t actually have different teams like that, but there are analogous benefits: when different subsystems solve different subproblems, we can tackle those subproblems relatively-independently, solve one without creating problems for the others. That’s the point of a problem factorization.
- abramdemski 2 Aug 2021 18:47 UTC
  LW: 7 AF: 6
  0
  AF Parent
  I think there’s another reason why factorization can be useful here, which is the articulation of sub-problems to try.
  For example, in the process leading up to inventing logical induction, Scott came up with a bunch of smaller properties to try for. He invented systems which got desirable properties individually, then growing combinations of desirable properties, and finally, figured out how to get everything at once. However, logical induction doesn’t have parts corresponding to those different subproblems.
  It can be very useful to individually achieve, say, objective robustness, even if your solution doesn’t fit with anyone else’s solutions to any of the other sub-problems. It shows us a way to do it, which can inspire other ways to do it.
  In other words: tackling the whole alignment problem at once sounds too hard. It’s useful to split it up, even if our factorization doesn’t guarantee that we can stick pieces back together to get a whole solution.
  Though, yeah, it’s obviously better if we can create a factorization of the sort you want.