Nice work! Whitelisting seems like a good thing to do, since it is safe by default. (Computer security has a similar principle of preferring to whitelist instead of blacklist.) I was initially worried that we’d have the problems of symbolic approaches to AI, where we’d have to enumerate far too many transitions for the whitelist in order to be able to do anything realistic, but since whitelisting could work on learned embedding spaces, and the whitelist itself can be learned from demonstrations, this could be a scalable method.
I’m worried that whitelisting presents generalization challenges—if you are distinguishing between different colors of tiles, to encode “you can paint any tile” you’d have to whitelist transitions (redTile → blueTile), (blueTile → redTile), (redTile → yellowTile) etc. Those won’t all be in the demonstrations. If you are going to generalize there, how do you _not_ generalize (redLight → greenLight) to (greenLight → redLight) for an AI that controls traffic lights? It seems like you want to
On another note, I personally don’t want to assume that we can point to a part of the architecture as the AI’s ontology.
On the technical side: The whitelist is only closed under transitivity if you assume that the agent is capable of taking all transitions, and you aren’t worried about cost. If you have a → b and b → c whitelisted, then the agent can only get from a to c if it can change a to c going through intermediate state b, which may be much harder than going directly from a to c.
You could just define the whitelist to be transitively closed, since it’s not hard to compute the transitive closure of a directed graph.
I’m worried that whitelisting presents generalization challenges
I think you might have to bite the paintbucket in this case. Note that in a latent space formulation, having one tile-painting-transition whitelisted might suggest that other painting applications would have lower (but not 0) cost. In general, I agree—I don’t see how we could expect reasonable generalization along those lines because of the traffic light issue. It isn’t clear how big of a problem this is, though.
The whitelist is only closed under transitivity if you assume that the agent is capable of taking all transitions, and you aren’t worried about cost. If you have a → b and b → c whitelisted, then the agent can only get from a to c if it can change a to c going through intermediate state b, which may be much harder than going directly from a to c.
That’s correct—if a→c isn’t on the whitelist, it might de facto incur additional costs (whether in time or resources). I suppose I was pointing to the idea that our theoretical preferences should be closed under transitivity—if we accept a→b,b→c, we should not reject a→c happening over time.
You could just define the whitelist to be transitively closed, since it’s not hard to compute the transitive closure of a directed graph.
Good point! Does get trickier in latent space, though.
Nice work! Whitelisting seems like a good thing to do, since it is safe by default. (Computer security has a similar principle of preferring to whitelist instead of blacklist.) I was initially worried that we’d have the problems of symbolic approaches to AI, where we’d have to enumerate far too many transitions for the whitelist in order to be able to do anything realistic, but since whitelisting could work on learned embedding spaces, and the whitelist itself can be learned from demonstrations, this could be a scalable method.
I’m worried that whitelisting presents generalization challenges—if you are distinguishing between different colors of tiles, to encode “you can paint any tile” you’d have to whitelist transitions (redTile → blueTile), (blueTile → redTile), (redTile → yellowTile) etc. Those won’t all be in the demonstrations. If you are going to generalize there, how do you _not_ generalize (redLight → greenLight) to (greenLight → redLight) for an AI that controls traffic lights? It seems like you want to
On another note, I personally don’t want to assume that we can point to a part of the architecture as the AI’s ontology.
On the technical side: The whitelist is only closed under transitivity if you assume that the agent is capable of taking all transitions, and you aren’t worried about cost. If you have a → b and b → c whitelisted, then the agent can only get from a to c if it can change a to c going through intermediate state b, which may be much harder than going directly from a to c.
You could just define the whitelist to be transitively closed, since it’s not hard to compute the transitive closure of a directed graph.
I think you might have to bite the paintbucket in this case. Note that in a latent space formulation, having one tile-painting-transition whitelisted might suggest that other painting applications would have lower (but not 0) cost. In general, I agree—I don’t see how we could expect reasonable generalization along those lines because of the traffic light issue. It isn’t clear how big of a problem this is, though.
That’s correct—if a→c isn’t on the whitelist, it might de facto incur additional costs (whether in time or resources). I suppose I was pointing to the idea that our theoretical preferences should be closed under transitivity—if we accept a→b,b→c, we should not reject a→c happening over time.
Good point! Does get trickier in latent space, though.