I think a big part of the safety challenge for this type of approach is the thing I called “the 1st-person problem” here (Section 1.1.3). It seems easy enough to get a computer to learn a concept like “Alice is following human norms” by passive observation, but what we really want is the concept of “I am following human norms”, which is related but different. You could have the computer learn it actively by allowing it to do stuff and then labeling it, but then you’re really labeling “I am following human norms (as far as the humans can tell)”, which is different from what we want in an obviously problematic way. One way around that would be to solve transparency and thus correctly label deceptive actions, but I don’t have any idea how to do that reliably. Another possible approach might be actually figuring out how to fiddle with the low-level world-model variables to turn the “Alice is following human norms” concept into the “I am following human norms” concept. I’m not sure if that works either, but anyway, I’m planning to think more about this when I get a chance, starting with how it works in humans, and of course I’m open to ideas.
In terms of data structures, I usually think about this sort of thing in terms of a map with a self pointer.
Suppose our environment is a python list X. We wish to represent it using another python list, the “model” M. Two key points:
M is a data structure, it can contain pointers, but it’s not allowed to contain pointers directly to X or things in X: things in the model can only point directly to other things in the model. For instance, a pointer might literally be represented as an index to a position in M - i.e.pointer(1) would point to M1.
The model M is contained in X itself—for simplicity, we’ll say X0=M.
So: how can we make the model match the environment (i.e. M==X)?
The trick is quite similar to a typical quine. We can model the environment excluding X0 easily enough: M=[???,X1,...,Xn−1]. But then the ??? part has to match M, and we can’t point to the whole map—there is no index which contains the map. So, we have to drop another copy in there: M=[[???,X1,...,Xn−1],X1,...,Xn−1]. Now we have almost the same problem: we need to replace the ??? with something. But this time, we can point it at something in the map: we can point it at M0. So, the final representation looks like:
Seems like the keys to training something like this would be:
Make sure the model can support the appropriate kind of pointer.
Train in an environment where the system can observe its own internal map.
Actually getting the self-model pointed at the goal we want would be a whole extra step. Not sure how that would work, other than using transparency tools to explicitly locate the self-model and plug in a goal.
Why not just have a “my model of” thing in the model, so you can have both “this door” and “my model of” + “this door” = “my model of this door”? (Of course I’m assuming compositionality here, but whatever, I always assume compositionality. This is the same kind of thing as “Carol’s” + “door” = “Carol’s door”.) What am I missing? I didn’t use any quines. Seems too simple… :-P
The thing I’m interested in re “1st-person problem” is slightly different than that, I think, because your reply still assumes a passive model, I think, whereas I think we’re going to need an AGI that “does things”—even if it’s just “thinking thoughts”—for reasons discussed in section 7.2 here. So there would a bunch of 1st-person actions / decisions thrown into the mix.
The main issue with “my model of” + “this door” = “my model of this door”, taken literally, is that there’s no semantics. It’s the semantics which I expect to need something quine-like.
Adding actions is indeed a big step, and I still don’t know the best way to do that. Main strategies I’ve thought about are:
keep the model itself passive, but include an agent with actions in the model itself, and then require correctness of the model-abstraction. (In other words, put an agent in the map, then require map-territory correspondence.)
Something thermodynamic-esque but not predictive processing. This one seems most promising long-term but also I’m still most confused about how to set it up.
I think you’re saying that I’m proposing how to label everything but not describing what those things are or do. (Correct?) I guess I’d say we learn general rules to follow with the “my model of” piece-of-thought, and exceptions to those rules, and exceptions to the exceptions, etc. Like “the relation between my-model-of-X and my-model-of-Y is the same as the relation between X and Y” could be an imperfect rule with various exceptions. See my “Python code runs the same on Windows and Mac” example here.