quetzal_rainbow comments on In (highly contingent!) defense of interpretability-in-the-loop ML training

quetzal_rainbow 8 Feb 2026 11:19 UTC
2 points
0
There is a lot of tension between “this is how would be nice for optimal agent to be built” and “this is how actual brains work”.

I can imagine that kinda-interpretability scheme works for, say, spatial tasks: it seems easy to track content of 3d-world model and reward successful accomplishment of tasks like “move object from point A to point B” and I would suspect that this system operates through cerebellum. I don’t think such system exists for anything more complicated like “caring about other entities mental states”.