If we have an accurate and interpretable model of the system we are trying to control, then I think we have a fairly good idea about how to make utility maximizers; use the model to figure out the consequences of your actions, describe utility as a function of the consequences, and then pick actions that lead to high utility.
Of course this doesn’t work for advanced optimization in practice, for many reasons: difficulty in getting a model, difficulty in making it interpretable, difficulty in optimizing over a model. But it appears to me that many of the limitations to this or things-substantially-similar-to-this are getting addressed by capabilities research or John Wentworth. Presumably you disagree about this claim, but it’s not really clear what aspect of this claim you disagree with, as you don’t really go into detail about what you see as the constraints that aren’t getting solved.
You really think the difficulty of making an AGI with a fully-human-interpretable world-model “is getting addressed”? (Granted, more than zero progress is being made, but not enough to make me optimistic that it’s gonna happen in time for AGI.)
Fully human-interpretable, no, but the interpretation you particularly need for making utility maximizers is to be able to take some small set of human commonsense variables and identify or construct those variables within the AI’s world-model. I think this will plausibly take specialized work for each variable you want to add, but I think it can be done (and in particular will get easier as capabilities increase, and as we get better understandings of abstraction).
I don’t think we will be able to do it fully automatically, or that this will support all architectures, but it does seem like there are many specific approaches for making it doable. I can’t go into huge detail atm as I am on my phone, but I can say more later if you have any questions.
Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about, e.g. here (which you’ve seen—we had a chat in the comments) or upcoming Post #13 in my sequence :) BTW I’d love to call & chat at some point if you have time & interest.
Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about
Cool, yeah, I can also say that my views are partly inspired by your writings. 👍
BTW I’d love to call & chat at some point if you have time & interest.
I’d definitely be interested, can you send me a PM about your availability? I have a fairly flexible schedule, though I live in Europe, so there may be some time zone issues.
Contemporary RL agents can’t have goals like counting grains of sand unless there is some measurement specified (e.g. a sensor or a property of a simulation). Specifying goals like that (in a way that works in the real world in a way not vulnerable to reward hacking) would require some sort of linguistic translation interface. But then such an interface could be used to specify goals like “count grains of sand without causing any overly harmful side effects”, or just “do what is good”. Maybe these goals are less likely to be properly translated on account of being vague or philosophical, but it’s pretty unclear at which point the difficulties would show up.
I would expect goals to be specified in code, using variables in the AI’s worldmodel that have specifically been engineered to have good representations. For instance if the worldmodel is a handcoded physics simulation, there will likely be a data structure that contains information about the number of grains of sand.
Of course in practice, we’d want most of the worldmodel to be learned. But this doesn’t mean we can’t make various choices to make the worldmodel have the variables of interest. (Well, sometimes, depends on the variable; sand seems easier than goodness.)
How would you learn a world model that had sand in it? Plausibly you could find something analogous to the sand of the original physics simulation (i.e. it has a similar transition function etc), but wouldn’t that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?
My immediate thought would be structurally impose a poincare symmetric geometry to the model. This would of course lock out the vast majority of possible architectures, but that seems like an acceptable sacrifice. Locking out most models make the remaining models more interpretable.
Given this model structure, it would be possible to isolate what stuff is at a given location in the model. It seems like this should make it relatively feasible to science out what variables in the model correspond to sand?
There may be numerous problems to this proposal, e.g. simulating a reductionistic world is totally computationally intractable. And for that matter, this approach hasn’t even been tried yet, so maybe there’s an unforeseen problem that would break it (I can’t test it because the capabilities aren’t there yet). I keep an eye out for how things are looking from the capabilities researchers, but they keep surprising me with how little bias you need to get nice abstractions, so it seems to me that this isn’t a taut constraint.
wouldn’t that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?
Yeah, I mean ultimately anything we do is going to be garbage in, garbage out. Our only hope is to use as weak assumptions as possible while still being usable, to make systems fail fast and safely, and to make them robustly corrigible in case of failure.
Oh and as I understand John Wentworth’s research program, he is basically studying how to robustly and generally solve this problem, so we’re less reliant on heuristics. I endorse that as a key component, hence why I mentioned John Wenthworth in my original response.
On the revised post:
If we have an accurate and interpretable model of the system we are trying to control, then I think we have a fairly good idea about how to make utility maximizers; use the model to figure out the consequences of your actions, describe utility as a function of the consequences, and then pick actions that lead to high utility.
Of course this doesn’t work for advanced optimization in practice, for many reasons: difficulty in getting a model, difficulty in making it interpretable, difficulty in optimizing over a model. But it appears to me that many of the limitations to this or things-substantially-similar-to-this are getting addressed by capabilities research or John Wentworth. Presumably you disagree about this claim, but it’s not really clear what aspect of this claim you disagree with, as you don’t really go into detail about what you see as the constraints that aren’t getting solved.
You really think the difficulty of making an AGI with a fully-human-interpretable world-model “is getting addressed”? (Granted, more than zero progress is being made, but not enough to make me optimistic that it’s gonna happen in time for AGI.)
Fully human-interpretable, no, but the interpretation you particularly need for making utility maximizers is to be able to take some small set of human commonsense variables and identify or construct those variables within the AI’s world-model. I think this will plausibly take specialized work for each variable you want to add, but I think it can be done (and in particular will get easier as capabilities increase, and as we get better understandings of abstraction).
I don’t think we will be able to do it fully automatically, or that this will support all architectures, but it does seem like there are many specific approaches for making it doable. I can’t go into huge detail atm as I am on my phone, but I can say more later if you have any questions.
Oh OK, that sounds vaguely similar to the kinds of approaches to AGI safety that I’m thinking about, e.g. here (which you’ve seen—we had a chat in the comments) or upcoming Post #13 in my sequence :) BTW I’d love to call & chat at some point if you have time & interest.
Cool, yeah, I can also say that my views are partly inspired by your writings. 👍
I’d definitely be interested, can you send me a PM about your availability? I have a fairly flexible schedule, though I live in Europe, so there may be some time zone issues.
Contemporary RL agents can’t have goals like counting grains of sand unless there is some measurement specified (e.g. a sensor or a property of a simulation). Specifying goals like that (in a way that works in the real world in a way not vulnerable to reward hacking) would require some sort of linguistic translation interface. But then such an interface could be used to specify goals like “count grains of sand without causing any overly harmful side effects”, or just “do what is good”. Maybe these goals are less likely to be properly translated on account of being vague or philosophical, but it’s pretty unclear at which point the difficulties would show up.
I would expect goals to be specified in code, using variables in the AI’s worldmodel that have specifically been engineered to have good representations. For instance if the worldmodel is a handcoded physics simulation, there will likely be a data structure that contains information about the number of grains of sand.
Of course in practice, we’d want most of the worldmodel to be learned. But this doesn’t mean we can’t make various choices to make the worldmodel have the variables of interest. (Well, sometimes, depends on the variable; sand seems easier than goodness.)
How would you learn a world model that had sand in it? Plausibly you could find something analogous to the sand of the original physics simulation (i.e. it has a similar transition function etc), but wouldn’t that run into issues if your assumptions about how sand worked were to some extent wrong (due to imperfect knowledge of physics)?
My immediate thought would be structurally impose a poincare symmetric geometry to the model. This would of course lock out the vast majority of possible architectures, but that seems like an acceptable sacrifice. Locking out most models make the remaining models more interpretable.
Given this model structure, it would be possible to isolate what stuff is at a given location in the model. It seems like this should make it relatively feasible to science out what variables in the model correspond to sand?
There may be numerous problems to this proposal, e.g. simulating a reductionistic world is totally computationally intractable. And for that matter, this approach hasn’t even been tried yet, so maybe there’s an unforeseen problem that would break it (I can’t test it because the capabilities aren’t there yet). I keep an eye out for how things are looking from the capabilities researchers, but they keep surprising me with how little bias you need to get nice abstractions, so it seems to me that this isn’t a taut constraint.
Yeah, I mean ultimately anything we do is going to be garbage in, garbage out. Our only hope is to use as weak assumptions as possible while still being usable, to make systems fail fast and safely, and to make them robustly corrigible in case of failure.
Oh and as I understand John Wentworth’s research program, he is basically studying how to robustly and generally solve this problem, so we’re less reliant on heuristics. I endorse that as a key component, hence why I mentioned John Wenthworth in my original response.