I would expect goals to be specified in code, using variables in the AI’s worldmodel that have specifically been engineered to have good representations. For instance if the worldmodel is a handcoded physics simulation, there will likely be a data structure that contains information about the number of grains of sand.
Of course in practice, we’d want most of the worldmodel to be learned. But this doesn’t mean we can’t make various choices to make the worldmodel have the variables of interest. (Well, sometimes, depends on the variable; sand seems easier than goodness.)
How would you learn a world model that had sand in it? Plausibly you could find something analogous to the sand of the original physics simulation (i.e. it has a similar transition function etc), but wouldn’t that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?
My immediate thought would be structurally impose a poincare symmetric geometry to the model. This would of course lock out the vast majority of possible architectures, but that seems like an acceptable sacrifice. Locking out most models make the remaining models more interpretable.
Given this model structure, it would be possible to isolate what stuff is at a given location in the model. It seems like this should make it relatively feasible to science out what variables in the model correspond to sand?
There may be numerous problems to this proposal, e.g. simulating a reductionistic world is totally computationally intractable. And for that matter, this approach hasn’t even been tried yet, so maybe there’s an unforeseen problem that would break it (I can’t test it because the capabilities aren’t there yet). I keep an eye out for how things are looking from the capabilities researchers, but they keep surprising me with how little bias you need to get nice abstractions, so it seems to me that this isn’t a taut constraint.
wouldn’t that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?
Yeah, I mean ultimately anything we do is going to be garbage in, garbage out. Our only hope is to use as weak assumptions as possible while still being usable, to make systems fail fast and safely, and to make them robustly corrigible in case of failure.
Oh and as I understand John Wentworth’s research program, he is basically studying how to robustly and generally solve this problem, so we’re less reliant on heuristics. I endorse that as a key component, hence why I mentioned John Wenthworth in my original response.
I would expect goals to be specified in code, using variables in the AI’s worldmodel that have specifically been engineered to have good representations. For instance if the worldmodel is a handcoded physics simulation, there will likely be a data structure that contains information about the number of grains of sand.
Of course in practice, we’d want most of the worldmodel to be learned. But this doesn’t mean we can’t make various choices to make the worldmodel have the variables of interest. (Well, sometimes, depends on the variable; sand seems easier than goodness.)
How would you learn a world model that had sand in it? Plausibly you could find something analogous to the sand of the original physics simulation (i.e. it has a similar transition function etc), but wouldn’t that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?
My immediate thought would be structurally impose a poincare symmetric geometry to the model. This would of course lock out the vast majority of possible architectures, but that seems like an acceptable sacrifice. Locking out most models make the remaining models more interpretable.
Given this model structure, it would be possible to isolate what stuff is at a given location in the model. It seems like this should make it relatively feasible to science out what variables in the model correspond to sand?
There may be numerous problems to this proposal, e.g. simulating a reductionistic world is totally computationally intractable. And for that matter, this approach hasn’t even been tried yet, so maybe there’s an unforeseen problem that would break it (I can’t test it because the capabilities aren’t there yet). I keep an eye out for how things are looking from the capabilities researchers, but they keep surprising me with how little bias you need to get nice abstractions, so it seems to me that this isn’t a taut constraint.
Yeah, I mean ultimately anything we do is going to be garbage in, garbage out. Our only hope is to use as weak assumptions as possible while still being usable, to make systems fail fast and safely, and to make them robustly corrigible in case of failure.
Oh and as I understand John Wentworth’s research program, he is basically studying how to robustly and generally solve this problem, so we’re less reliant on heuristics. I endorse that as a key component, hence why I mentioned John Wenthworth in my original response.