Rohin Shah comments on Refactoring Alignment (attempt #2)

Rohin Shah 27 Jul 2021 8:06 UTC
LW: 11 AF: 8
AF
I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that’s done.
But how? In prosaic AI, only on-distribution behavior of the loss function can influence the end result.
I can see a few possible responses here.
1. Double down on the “correct generalization” story: hope to somehow avoid the multiple plausible generalizations, perhaps by providing enough training data, or appropriate inductive biases in the system (probably both).
2. Achieve objective robustness through other means. In particular, inner alignment is supposed to imply objective robustness. In this approach, inner-alignment technology provides the extra information to generalize the base objective appropriately.
I’m not too keen on (2) since I don’t expect mesa objectives to exist in the relevant sense. For (1), I’d note that we need to get it right on the situations that actually happen, rather than all situations. We can also have systems that only need to work for the next N timesteps, after which they are retrained again given our new understanding of the world; this effectively limits how much distribution shift can happen. Then we could do some combination of the following:
1. Build neural net theory. We currently have a very poor understanding of why neural nets work; if we had a better understanding it seems plausible we could have high confidence in when a neural net would generalize correctly. (I’m imagining that neural net theory goes from how-I-imagine-physics-looked before Newton, and the same after Newton.)
2. Use techniques like adversarial training to “robustify” the model against moderate distribution shifts (which might be sufficient to work for the next N timesteps, after which you “robustify” again).
3. Make these techniques work better through interpretability / transparency.
4. Use checks and balances. For example, if multiple generalizations are possible, train an ensemble of models and only do something if they all agree on it. Or train an actor agent combined with an overseer agent that has veto power over all actions. Or an ensemble of actors, each of which oversees the other actors and has veto power over them.
These aren’t “clean”, in the sense that you don’t get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for “all situations” that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).
What links here?
- Rohin Shah's comment on Re-Define Intent Alignment? by abramdemski (3 Aug 2021 7:22 UTC; 2 points)
- abramdemski 27 Jul 2021 15:30 UTC
  LW: 4 AF: 4
  AF Parent
  I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that’s done.
  But it seems to me that there’s something missing in terms of acceptability.
  The definition of “objective robustness” I used says “aligns with the base objective” (including off-distribution). But I think this isn’t an appropriate representation of your approach. Rather, “objective robustness” has to be defined something like “generalizes acceptably”. Then, ideas like adversarial training and checks and balances make sense as a part of the story.
  WRT your suggestions, I think there’s a spectrum from “clean” to “not clean”, and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor “cleaner” ideas than you do, but that doesn’t rule out this path for me.
  - Rohin Shah 28 Jul 2021 7:32 UTC
    LW: 2 AF: 2
    AF Parent
    The definition of “objective robustness” I used says “aligns with the base objective” (including off-distribution). But I think this isn’t an appropriate representation of your approach. Rather, “objective robustness” has to be defined something like “generalizes acceptably”. Then, ideas like adversarial training and checks and balances make sense as a part of the story.
    Yeah, strong +1.
    - abramdemski 28 Jul 2021 15:13 UTC
      LW: 2 AF: 2
      AF Parent
      Great! I feel like we’re making progress on these basic definitions.
- jbkjr 28 Jul 2021 18:36 UTC
  LW: 1 AF: 1
  AF Parent
  
  I’m not too keen on (2) since I don’t expect mesa objectives to exist in the relevant sense.
  
  Same, but how optimistic are you that we could figure out how to shape the motivations or internal “goals” (much more loosely defined than “mesa-objective”) of our models via influencing the training objective/reward, the inductive biases of the model, the environments they’re trained in, some combination of these things, etc.?
  
  These aren’t “clean”, in the sense that you don’t get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for “all situations” that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).
  
  Yup, if you want “clean,” I agree that you’ll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.
  - Rohin Shah 29 Jul 2021 6:20 UTC
    LW: 2 AF: 2
    AF Parent
    how optimistic are you that we could figure out how to shape the motivations or internal “goals” (much more loosely defined than “mesa-objective”) of our models via influencing the training objective/reward, the inductive biases of the model, the environments they’re trained in, some combination of these things, etc.?
    That seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call “outer alignment”). I probably should have mentioned that too, I was taking it as a given but I really shouldn’t have.
    For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.