What does it mean to prevent scheming on a tiny model? If we care about generalization performance rather than proving generalization bounds, to what distribution do we want the tiny model to generalize?
prevent emergent scheming: the true answer this minute is that I don’t yet know, and would be using the “it should also work on a tiny model” as a guidepost to figure out how to answer this. I worry that any specific example I give won’t be general enough, and I don’t want to claim otherwise yet. doing that some now, though:
scheming might mean “reach reward region that is supposed to not happen but we can’t easily de-reward-ify”?
if so, seems like maybe scheming is only happening if we don’t get told the model is going to output actions that push it into that zone? I’m going to assume it’s not hidden scheming if it’s honest about it
a way of detecting scheming needs to not know what zone the model is trying to reach, just that it’s trying to reach some region that we intend it not to
so a language-free scheming model organism would need to have a reward signal, where there’s something like a zone to avoid but which is in some way rewarded? perhaps multiple of those? as well as plenty of zones we do want it to seek and which are rewarded normally
perhaps that zone to avoid can be something where there’s an async prevention process? eg, an area where rewards are given by the main reward function, but if a monitoring process gets to run, those rewards are deleted from the replay dataset and inverted, and the model has the ability to prevent the monitoring process from running, but no other incentive to do so?
maybe to make this realistic, interfering with the monitoring process is not absolute?
then, can you detect from only the model’s internals when it has learned to interfere with the monitoring process
not sure that’s quite what we want, but it’s a sketch.
in general for generalization bounds, I’m most interested in generalization bounds with respect to a space of simulation environments, so that’s what I’m inclined to sketch. eg, you could imagine trying to do generalization bounds on a probabilistic cellular automata or a pool of small, time-bounded programs or similar, where your model needs to learn to predict those programs. generalization bounds for iid data sources are also nonzero interesting, though. at the moment I’d be interested in any generalization bounds for neural networks at all.
I’m not sure what you mean by generalization performance? isn’t that what a bound constrains? if you just mean empirical, then my argument collapses to “if your image recognition doesn’t work on mnist, it probably won’t work on bigger things, but not vice versa”.
What does it mean to prevent scheming on a tiny model? If we care about generalization performance rather than proving generalization bounds, to what distribution do we want the tiny model to generalize?
prevent emergent scheming: the true answer this minute is that I don’t yet know, and would be using the “it should also work on a tiny model” as a guidepost to figure out how to answer this. I worry that any specific example I give won’t be general enough, and I don’t want to claim otherwise yet. doing that some now, though:
scheming might mean “reach reward region that is supposed to not happen but we can’t easily de-reward-ify”?
if so, seems like maybe scheming is only happening if we don’t get told the model is going to output actions that push it into that zone? I’m going to assume it’s not hidden scheming if it’s honest about it
a way of detecting scheming needs to not know what zone the model is trying to reach, just that it’s trying to reach some region that we intend it not to
so a language-free scheming model organism would need to have a reward signal, where there’s something like a zone to avoid but which is in some way rewarded? perhaps multiple of those? as well as plenty of zones we do want it to seek and which are rewarded normally
perhaps that zone to avoid can be something where there’s an async prevention process? eg, an area where rewards are given by the main reward function, but if a monitoring process gets to run, those rewards are deleted from the replay dataset and inverted, and the model has the ability to prevent the monitoring process from running, but no other incentive to do so?
maybe to make this realistic, interfering with the monitoring process is not absolute?
then, can you detect from only the model’s internals when it has learned to interfere with the monitoring process
not sure that’s quite what we want, but it’s a sketch.
in general for generalization bounds, I’m most interested in generalization bounds with respect to a space of simulation environments, so that’s what I’m inclined to sketch. eg, you could imagine trying to do generalization bounds on a probabilistic cellular automata or a pool of small, time-bounded programs or similar, where your model needs to learn to predict those programs. generalization bounds for iid data sources are also nonzero interesting, though. at the moment I’d be interested in any generalization bounds for neural networks at all.
I’m not sure what you mean by generalization performance? isn’t that what a bound constrains? if you just mean empirical, then my argument collapses to “if your image recognition doesn’t work on mnist, it probably won’t work on bigger things, but not vice versa”.