There is a lot of tension between “this is how would be nice for optimal agent to be built” and “this is how actual brains work”.
I can imagine that kinda-interpretability scheme works for, say, spatial tasks: it seems easy to track content of 3d-world model and reward successful accomplishment of tasks like “move object from point A to point B” and I would suspect that this system operates through cerebellum. I don’t think such system exists for anything more complicated like “caring about other entities mental states”.
There is a lot of tension between “this is how would be nice for optimal agent to be built” and “this is how actual brains work”.
I can imagine that kinda-interpretability scheme works for, say, spatial tasks: it seems easy to track content of 3d-world model and reward successful accomplishment of tasks like “move object from point A to point B” and I would suspect that this system operates through cerebellum. I don’t think such system exists for anything more complicated like “caring about other entities mental states”.