Some real examples of gradient hacking

Oliver SourbutNov 22, 2021, 12:11 AM

LW: 15 AF: 9

Gradient Hacking AI Mesa-Optimization Adaptation Executors MATS Program Evolution

Speculation and an invitation to make further suggestions or clarifications

I’ve been wondering what might count as an analogous example of gradient hacking if any has ever occurred, and if it could be useful to collect such analogies (while mindful not to overfit our expectations). Presumably so far the only examples will be for genetic natural selection (perhaps particularly in humans), and perhaps cultural/memetic selection (leaving aside the technicality of whether natural selection has a gradient per se).

I’m looking for deliberate behaviours which are intended to affect the fitness landscape of the outer process.

In biology

Sexual selection

I think examples of sexual selection are adjacent but don’t qualify as gradient hacking. While they constitute an execution of a policy whose main legible effect is indeed on natural selection, presumably in nearly all cases there is not a deliberate process of influence on the natural selection process as such. The policies themselves are encoded by genetic natural selection for the most part, so I’d say this qualifies more as one kind of divergent training trajectory of the outer training process. There remains a plausible analogy that at least some animals and humans engaging in sexual selection are doing so ‘because’ they are inner misaligned (they are agents with preferences which are mere proxies for the ‘true goals’ of genetic natural selection and sexual selection is just one such proxy preference).

Examples of sexual selection where the preferences are socially learned seem closer, and there are examples in animals and humans. Even in this case, even though the policy is adapted ‘at runtime’ (the genetic policy might be something like ‘choose a mate which my society codes as attractive’, which incidentally presumably encourages other genetic policies like ‘try to conform to my society’s attractiveness code’ and maybe even ‘try to nudge my society’s attractiveness code towards me and my kin’) it is still not carried out with the runtime deliberate intention of affecting the outer process of natural selection (even though it is a deliberate action which does in fact have this side effect, the intended goal is the proxy).

Eugenics

Perhaps the first and only qualifying attempts at actual gradient hacking are efforts towards eugenics practiced by humans, up to and including some instances of (attempted or successful) genocide. Probably since prehistory and certainly since antiquity we’ve had some ‘mesa’/‘runtime’ understanding of heritability, in contrast to (presumably) all other animals. Most such behaviours are deliberative and involve explicit modelling of the process of heritability, with the stated intention (at least nominally) being to affect the trajectory of the natural selection process (stated in antiquity in very crude but still essentially-correct terms).

What if (I think this is not very credible but at least plausible) a sufficient cause of all eugenics attempts is in fact a genetically-naturally-selected policy schema of ‘come up with whatever excuses you can to favour (the success/reproduction/existence of) your kin’? (And the rest of the fluff is just instrumental persuasion and implementation attempts.) Would eugenics be downgraded all the way from being gradient hacking to ‘mere’ proxy alignment?

In society

This gets more speculative.

Eumemics

What if we adapt the previous hypothesis about genetic eugenics to… eumemics? It is harder to locate the object of memetic selection, and I’m not sure if it’s right to identify it with individual organisms the way we can often sloppily get away with doing for biological genetic natural selection. Maybe it is hard to call humans ‘inner misaligned with respect to memetic natural selection’ with a straight face. But if so, do ‘eumemic’ attempts qualify as instances of gradient hacking?

If a meme or meme-complex encodes a behaviour of deliberately affecting the meme fitness landscape in ways unrelated to the particular meme(plex), does that qualify as gradient hacking? If so, could boycotts, cancellation, some reading of basically every ethical theory, and many other ideologies besides qualify?

Same question as above—what if such hackery is in fact just the side-effect of memetic natural selection acting on memeplexes which tend to encode behaviours promoting themselves (with a side-effect of also affecting the meme fitness lanscape in other ways)? If so, is this simply proxy alignment rather than gradient hacking?

What links here?

Motivations, Natural Selection, and Curriculum Engineering by Oliver Sourbut (Dec 16, 2021, 1:07 AM; 16 points)