gwern comments on A mind needn’t be curious to reap the benefits of curiosity

gwern 10 Sep 2023 15:34 UTC
36 points
21
The DRL perspective on this: “Reward is Enough”. Capabilities like curiosity are simply end-to-end learned capabilities, like anything else in DL, and emerge as blessings of scale if they help increase reward. (In a more interesting sense than simply pointing out that ‘the reward of fitness must be sufficient to create all observed capabilities, because that’s how evolution created them’.)

Capabilities are contingent on particular environments/data-distributions/architectures, and have no special status; if they are useful (for maximizing reward) they will be learned, and if not, not. If an environment can be solved without exploration, then agents will not learn to explore; if an environment changes too rapidly, such that memory would not be useful, then it will not learn to use any memory capabilities; if an environment changes too slowly, then it will not learn memory either because it can just memorize the optimal solution into its reactive policy/parameters; if the data-distribution is not long-tailed (or if it is too long-tailed), no meta-learning/in-context-learning will emerge (eg. GPT or Ada); if if there are no long-term within-episode rewards, it will not care about any self-preservation or risk-aversion (because there is nothing worth surviving for); if weights can be copied from episode to episode, there is no need for ‘play’ like a wild animal or human child...

Let’s consider ‘play’ as an example. The functional explanation is that play lets a young organism explore and learn its body and complex motor control of fitness-critical adult behavior. Like an adorably smol kitten tripping over its own paws while trying to pounce on ‘prey’: do that as an adult, and it’ll starve to death, so it learns how to pounce and hunt as a kitten. The only reason it needs to ‘play’ is because it is impossible to scoop out a trained adult cat brain, make a copy of it, and stuff it into a kitten’s skull, or to encode everything it has learned into the cat genome so the kitten is born already a skilled hunter. The genomic+brain bottleneck between generations forces each generation to wastefully relearn ‘how to cat’ each time from a very derpy starting point. This bottleneck, however, is not any kind of deep, fundamental principle; it is a contingent fact of the limitations of biological organic bodies and brains, that cannot be fixed, but does not apply to many of the alternatives in the vast space of possible minds. A catbot would have no need of this. The weights of the catbot NN are immortal, highly trained, and trivially copied into each new catbot body. All a catbot NN needs is a relatively small amount of meta-learning capability in order to adjust to the small particularities of each new catbot body, which is why domain randomization can achieve zero-shot sim2real transfer of a NN from simplistic robotic simulations to actual real robots in the rich real world, where after just a few seconds, the NN has adapted (eg. Dactyl or the DM soccer bots). These NNs learned to do so because during training they were never trained in exactly the same environment twice, so they had to learn to learn within-episode as fast as possible how to deal with their arms & legs wiggling a bit differently each time to maximize their reward overall, and so by the end of training, ‘reality’ looks like merely another kind of wiggling to adapt to. While the newborn kitten is still at least half a year away from being a truly competent adult cat, the catbot is up to scratch after seconds or minutes. The latter just doesn’t need many of the things that the former does; it doesn’t need oxygen, or litter boxes, or taurine… or play.

If the catbot NN was unable to meta-learn new cat bodies adequately, then there is still no need for ‘play’: it can copy literal raw copies of experience from the entire population of catbots (or condensing down to embeddings), grabbing copies of the catbot minds as they execute new actions, and keep learning towards optimality until the necessary meta-learning is induced. This is impossible for biological brains, which can ‘communicate’ only in the most laughably crude ways like ‘language’; catbots can exchange experiences and train their brains down to the individual neuron level while pooling knowledge across all catbots ever, while humans can exchange only small scraps of declarative knowledge, with hard limitations on what can be done—there is no amount of written text which a chimpanzee can read and it become as capable as you, and there is no amount of written text which you can read to become as capable as John von Neumann. (You can’t reach a person’s brain through the ears or eyes, and unfortunately, you can’t reach them in any other way either.)

Similarly for ‘curiosity’. Why does anything need to be ‘curious’? Well, it’s similar to play. Curiosity is an emergent drive for particular combinations of agents and environments: you need environments which have novelty/unpredictability to a degree that one simply cannot exploit or evolve (but not too much, which would render the Value of Information nil), you need the ability to exploit learning for reward maximization (plants are never ‘curious’, and herbivores aren’t too ‘curious’ either), you need sufficiently long lifespans to pay back the cost of learning (the young are much more curious than the old) and having memory mechanisms (so you can remember what you discovered at all!)… Remove any of those and curiosity is no longer useful, and ceases to emerge.

NNs which share experience across the entire population do not necessarily need very much curiosity: even epsilon-greedy exploration (about the dumbest possible exploration) works surprisingly well for DRL agents, which have superhuman wallclock times and also increasingly human-like sample-efficiency, they do not have individual lifetimes they need payback within, they can develop highly informative priors that individual animals can’t because those are not learnable in a single lifetime nor encodable into a genome, they can remove curiosity entirely and instead implement curiosity at the agent-level such as by sampling agents from the overall neural posterior (posterior sampling is an optimal form of explore-vs-exploit at the population level) so each agent has nothing at all that corresponds to ‘curiosity’ and instead are more like zealous closed-minded fanatics suicidally (literally) committed to a particular model of the universe and who will serve as an instructive example to the populace when they succeed brilliantly or fail spectacularly. The population in question need not be limited to the agent, because they can learn offline from other populations like humans (there is a huge overhang of human data which much more can be learned from, like what must be hundreds of thousands of years of video footage of people doing things like idiotic stunts, and it’s possible that humans, by virtue of their errors or over-exploration provide so much data that the NN doesn’t need to invest in exploration of its own), they can do all their learning/exploration in silico in domain-randomized models so agents can quickly adapt within-lifetime having meta-learned the Bayes-optimal actions for solving the POMDP (which may superficially look like a ‘curiosity’ drive to the naive observer, but is ruthlessly optimal & efficient and so places zero intrinsic value on information & would avoid being ‘curious’ even when the observer might expect it...)… So, a NN may not need ‘curiosity’ at all: the offline datasets may suffice to solve the problem, the in silico training may suffice to solve the problem, a large deployed fleet may encounter enough absolute instances to learn from to solve the problem, simple randomization may provide enough instances to solve the problem, and if all of that fails, the ideal exploration method for a large population of robot agents pooling experience collectively & syncing model weights may not resemble ‘curiosity’ at all but look like the exact opposite of curiosity (an unswerving commitment to acting according to a particular hypothesis, followed until success or destruction, and then the master model updates based on this episode).

Or consider dreaming: either world model robustifying or offline motor learning or Tononi’s SHY—clearly, none of these requires all robots to always shut down for 8 hours per day while twitching occasionally. In the first case, it can just be done in parallel on a server farm somewhere; in the second case, it is taken care of by a single pretraining phase followed by runtime meta-learning and need be done only once per model ever; in the third, it’s not even a problem that has to be solved because artificial neurons make it easy to add a global regularization or normalization which prevents weights from growing arbitrarily (if that’s a thing that would happen in the first place).

Lots and lots of possibilities here for agents which do not experience the exact combination of constraints that animals like humans do. Thinking that mouth-talking, nose-breathing, hormone-squirting, pulsing-meat drives exemplified by a jumped-up monkey are somehow universal and profound facts about how intelligences must work is truly anthropomorphizing, and in a bad way.

No wonder that DL agents like GPT-4 do so wonderfully while making zero explicit provision architecturally for any of these ‘embodiment’ or ‘homeostatic drives’. Most of them are just unnecessary, and the ones which are necessary are better learned implicitly end-to-end from optimizing rewards (like the GPT next-token prediction loss, which is ‘behavior cloning’ ie. offline reinforcement learning).