if you can’t mechinterp tiny, language-free model, you can’t mechinterp big model (success! this bar has been passed)
if you can’t prevent emergent scheming on tiny, language-free model, you can’t prevent emergent scheming on big model
as above for generalization bounds
as above for regret bounds
as above for regret bounds on CEV in particular*
if you can do it on small model doesn’t mean you can do it on big model. but a smoke test is that your method had better work on a model so small it can’t talk. if your method requires a model big enough to talk in order to do anything that seems promising, you probably aren’t robust against things that would arise even without the presence of language, and you might be getting punked by the “face”.
* (which I call out because CEV is often treated as a thing where you need an enormous model to make any statement about it, which I think is importantly misleading, because I think if a CEV-correctness certification is ever going to be possible, it should be one that monotonically improves with scale, but produces only-finitely-bad bounds on small model)
What does it mean to prevent scheming on a tiny model? If we care about generalization performance rather than proving generalization bounds, to what distribution do we want the tiny model to generalize?
prevent emergent scheming: the true answer this minute is that I don’t yet know, and would be using the “it should also work on a tiny model” as a guidepost to figure out how to answer this. I worry that any specific example I give won’t be general enough, and I don’t want to claim otherwise yet. doing that some now, though:
scheming might mean “reach reward region that is supposed to not happen but we can’t easily de-reward-ify”?
if so, seems like maybe scheming is only happening if we don’t get told the model is going to output actions that push it into that zone? I’m going to assume it’s not hidden scheming if it’s honest about it
a way of detecting scheming needs to not know what zone the model is trying to reach, just that it’s trying to reach some region that we intend it not to
so a language-free scheming model organism would need to have a reward signal, where there’s something like a zone to avoid but which is in some way rewarded? perhaps multiple of those? as well as plenty of zones we do want it to seek and which are rewarded normally
perhaps that zone to avoid can be something where there’s an async prevention process? eg, an area where rewards are given by the main reward function, but if a monitoring process gets to run, those rewards are deleted from the replay dataset and inverted, and the model has the ability to prevent the monitoring process from running, but no other incentive to do so?
maybe to make this realistic, interfering with the monitoring process is not absolute?
then, can you detect from only the model’s internals when it has learned to interfere with the monitoring process
not sure that’s quite what we want, but it’s a sketch.
in general for generalization bounds, I’m most interested in generalization bounds with respect to a space of simulation environments, so that’s what I’m inclined to sketch. eg, you could imagine trying to do generalization bounds on a probabilistic cellular automata or a pool of small, time-bounded programs or similar, where your model needs to learn to predict those programs. generalization bounds for iid data sources are also nonzero interesting, though. at the moment I’d be interested in any generalization bounds for neural networks at all.
I’m not sure what you mean by generalization performance? isn’t that what a bound constrains? if you just mean empirical, then my argument collapses to “if your image recognition doesn’t work on mnist, it probably won’t work on bigger things, but not vice versa”.
seems like if it works to prevent ASI-with-10yr-planning-horizon-bad-thing, it must also work to prevent waterworld rl with 1 timestep planning horizon-bad-thing.
if you can’t mechinterp tiny, language-free model, you can’t mechinterp big model (success! this bar has been passed)
if you can’t prevent emergent scheming on tiny, language-free model, you can’t prevent emergent scheming on big model
as above for generalization bounds
as above for regret bounds
as above for regret bounds on CEV in particular*
if you can do it on small model doesn’t mean you can do it on big model. but a smoke test is that your method had better work on a model so small it can’t talk. if your method requires a model big enough to talk in order to do anything that seems promising, you probably aren’t robust against things that would arise even without the presence of language, and you might be getting punked by the “face”.
* (which I call out because CEV is often treated as a thing where you need an enormous model to make any statement about it, which I think is importantly misleading, because I think if a CEV-correctness certification is ever going to be possible, it should be one that monotonically improves with scale, but produces only-finitely-bad bounds on small model)
What does it mean to prevent scheming on a tiny model? If we care about generalization performance rather than proving generalization bounds, to what distribution do we want the tiny model to generalize?
prevent emergent scheming: the true answer this minute is that I don’t yet know, and would be using the “it should also work on a tiny model” as a guidepost to figure out how to answer this. I worry that any specific example I give won’t be general enough, and I don’t want to claim otherwise yet. doing that some now, though:
scheming might mean “reach reward region that is supposed to not happen but we can’t easily de-reward-ify”?
if so, seems like maybe scheming is only happening if we don’t get told the model is going to output actions that push it into that zone? I’m going to assume it’s not hidden scheming if it’s honest about it
a way of detecting scheming needs to not know what zone the model is trying to reach, just that it’s trying to reach some region that we intend it not to
so a language-free scheming model organism would need to have a reward signal, where there’s something like a zone to avoid but which is in some way rewarded? perhaps multiple of those? as well as plenty of zones we do want it to seek and which are rewarded normally
perhaps that zone to avoid can be something where there’s an async prevention process? eg, an area where rewards are given by the main reward function, but if a monitoring process gets to run, those rewards are deleted from the replay dataset and inverted, and the model has the ability to prevent the monitoring process from running, but no other incentive to do so?
maybe to make this realistic, interfering with the monitoring process is not absolute?
then, can you detect from only the model’s internals when it has learned to interfere with the monitoring process
not sure that’s quite what we want, but it’s a sketch.
in general for generalization bounds, I’m most interested in generalization bounds with respect to a space of simulation environments, so that’s what I’m inclined to sketch. eg, you could imagine trying to do generalization bounds on a probabilistic cellular automata or a pool of small, time-bounded programs or similar, where your model needs to learn to predict those programs. generalization bounds for iid data sources are also nonzero interesting, though. at the moment I’d be interested in any generalization bounds for neural networks at all.
I’m not sure what you mean by generalization performance? isn’t that what a bound constrains? if you just mean empirical, then my argument collapses to “if your image recognition doesn’t work on mnist, it probably won’t work on bigger things, but not vice versa”.