So if I understand you, for (1) you’re proposing a “hard” attention over the image, rather than the “soft” differentiable attention which is typically meant by “attention” for NNs.
You might find interesting “Recurrent Models of Visual Attention” by DeepMind (https://arxiv.org/pdf/1406.6247.pdf). They use a hard attention over the image with RL to train where to attend. I found it interesting—there’s been subsequent work using hard attention (I thiiink this is a central paper for the topic, but I could be wrong, and I’m not at all sure what the most interesting recent one is) as well.
...you were an ancient being, with a mind vast and unsympathetic, concerned with all the events in the path of the light-cone, who has through some mistake been trapped in a smaller, duller mind, forgetting most of the wisdom natural to it, becoming encumbered by fleshy bounds, and who now must decide what to do with the potential it has left.
...the “you” listening to this was one of several complete agents inhabiting a body, each of which has their own plans, goals, and strategies, each of which jockeys for control over the actions of that body, and each of which can wage war or form alliances with each other to try gain more control over that body over the course of a lifetime?