I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that’s done.
But how? In prosaic AI, only on-distribution behavior of the loss function can influence the end result.
I can see a few possible responses here.
Double down on the “correct generalization” story: hope to somehow avoid the multiple plausible generalizations, perhaps by providing enough training data, or appropriate inductive biases in the system (probably both).
Achieve objective robustness through other means. In particular, inner alignment is supposed to imply objective robustness. In this approach, inner-alignment technology provides the extra information to generalize the base objective appropriately.
I’m not too keen on (2) since I don’t expect mesa objectives to exist in the relevant sense. For (1), I’d note that we need to get it right on the situations that actually happen, rather than all situations. We can also have systems that only need to work for the next N timesteps, after which they are retrained again given our new understanding of the world; this effectively limits how much distribution shift can happen. Then we could do some combination of the following:
Build neural net theory. We currently have a very poor understanding of why neural nets work; if we had a better understanding it seems plausible we could have high confidence in when a neural net would generalize correctly. (I’m imagining that neural net theory goes from how-I-imagine-physics-looked before Newton, and the same after Newton.)
Use techniques like adversarial training to “robustify” the model against moderate distribution shifts (which might be sufficient to work for the next N timesteps, after which you “robustify” again).
Make these techniques work better through interpretability / transparency.
Use checks and balances. For example, if multiple generalizations are possible, train an ensemble of models and only do something if they all agree on it. Or train an actor agent combined with an overseer agent that has veto power over all actions. Or an ensemble of actors, each of which oversees the other actors and has veto power over them.
These aren’t “clean”, in the sense that you don’t get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for “all situations” that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).
I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that’s done.
But it seems to me that there’s something missing in terms of acceptability.
The definition of “objective robustness” I used says “aligns with the base objective” (including off-distribution). But I think this isn’t an appropriate representation of your approach. Rather, “objective robustness” has to be defined something like “generalizes acceptably”. Then, ideas like adversarial training and checks and balances make sense as a part of the story.
WRT your suggestions, I think there’s a spectrum from “clean” to “not clean”, and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor “cleaner” ideas than you do, but that doesn’t rule out this path for me.
The definition of “objective robustness” I used says “aligns with the base objective” (including off-distribution). But I think this isn’t an appropriate representation of your approach. Rather, “objective robustness” has to be defined something like “generalizes acceptably”. Then, ideas like adversarial training and checks and balances make sense as a part of the story.
I’m not too keen on (2) since I don’t expect mesa objectives to exist in the relevant sense.
Same, but how optimistic are you that we could figure out how to shape the motivations or internal “goals” (much more loosely defined than “mesa-objective”) of our models via influencing the training objective/reward, the inductive biases of the model, the environments they’re trained in, some combination of these things, etc.?
These aren’t “clean”, in the sense that you don’t get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for “all situations” that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).
Yup, if you want “clean,” I agree that you’ll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.
how optimistic are you that we could figure out how to shape the motivations or internal “goals” (much more loosely defined than “mesa-objective”) of our models via influencing the training objective/reward, the inductive biases of the model, the environments they’re trained in, some combination of these things, etc.?
That seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call “outer alignment”). I probably should have mentioned that too, I was taking it as a given but I really shouldn’t have.
For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.
I like the addition of the pseudo-equivalences; the graph seems a lot more accurate as a representation of my views once that’s done.
I’m not too keen on (2) since I don’t expect mesa objectives to exist in the relevant sense. For (1), I’d note that we need to get it right on the situations that actually happen, rather than all situations. We can also have systems that only need to work for the next N timesteps, after which they are retrained again given our new understanding of the world; this effectively limits how much distribution shift can happen. Then we could do some combination of the following:
Build neural net theory. We currently have a very poor understanding of why neural nets work; if we had a better understanding it seems plausible we could have high confidence in when a neural net would generalize correctly. (I’m imagining that neural net theory goes from how-I-imagine-physics-looked before Newton, and the same after Newton.)
Use techniques like adversarial training to “robustify” the model against moderate distribution shifts (which might be sufficient to work for the next N timesteps, after which you “robustify” again).
Make these techniques work better through interpretability / transparency.
Use checks and balances. For example, if multiple generalizations are possible, train an ensemble of models and only do something if they all agree on it. Or train an actor agent combined with an overseer agent that has veto power over all actions. Or an ensemble of actors, each of which oversees the other actors and has veto power over them.
These aren’t “clean”, in the sense that you don’t get a nice formal guarantee at the end that your AI system is going to (try to) do what you want in all situations, but I think getting an actual literal guarantee is pretty doomed anyway (among other things, it seems hard to get a definition for “all situations” that avoids the no-free-lunch theorem, though I suppose you could get a probabilistic definition based on the simplicity prior).
But it seems to me that there’s something missing in terms of acceptability.
The definition of “objective robustness” I used says “aligns with the base objective” (including off-distribution). But I think this isn’t an appropriate representation of your approach. Rather, “objective robustness” has to be defined something like “generalizes acceptably”. Then, ideas like adversarial training and checks and balances make sense as a part of the story.
WRT your suggestions, I think there’s a spectrum from “clean” to “not clean”, and the ideas you propose could fall at multiple points on that spectrum (depending on how they are implemented, how much theory backs them up, etc). So, yeah, I favor “cleaner” ideas than you do, but that doesn’t rule out this path for me.
Yeah, strong +1.
Great! I feel like we’re making progress on these basic definitions.
Same, but how optimistic are you that we could figure out how to shape the motivations or internal “goals” (much more loosely defined than “mesa-objective”) of our models via influencing the training objective/reward, the inductive biases of the model, the environments they’re trained in, some combination of these things, etc.?
Yup, if you want “clean,” I agree that you’ll have to either assume a distribution over possible inputs, or identify a perturbation set over possible test environments to avoid NFL.
That seems great, e.g. I think by far the best thing you can do is to make sure that you finetune using a reward function / labeling process that reflects what you actually want (i.e. what people typically call “outer alignment”). I probably should have mentioned that too, I was taking it as a given but I really shouldn’t have.
For inductive biases + environments, I do think controlling those appropriately would be useful and I would view that as an example of (1) in my previous comment.