Worrying about the Vase: Whitelisting

Sup­pose a de­signer wants an RL agent to achieve some goal, like mov­ing a box from one side of a room to the other. Some­times the most effec­tive way to achieve the goal in­volves do­ing some­thing un­re­lated and de­struc­tive to the rest of the en­vi­ron­ment, like knock­ing over a vase of wa­ter that is in its path. If the agent is given a re­ward only for mov­ing the box, it will prob­a­bly knock over the vase.
Amodei et al., Con­crete Prob­lems in AI Safety
Side effect avoidance is a ma­jor open prob­lem in AI safety. I pre­sent a ro­bust, trans­fer­able, eas­ily- and more safely-train­able, par­tially re­ward hack­ing-re­sis­tant im­pact mea­sure.
TurnTrout, Wor­ry­ing about the Vase: Whitelisting

An im­pact mea­sure is a means by which change in the world may be eval­u­ated and pe­nal­ized; such a mea­sure is not a re­place­ment for a util­ity func­tion, but rather an ad­di­tional pre­cau­tion thus over­laid.

While I’m fairly con­fi­dent that whitelist­ing con­tributes mean­ingfully to short- and mid-term AI safety, I re­main skep­ti­cal of its ro­bust­ness to scale. Should sev­eral challenges be over­come, whitelist­ing may in­deed be helpful for ex­clud­ing swathes of un­friendly AIs from the out­come space. Fur­ther­more, the ap­proach al­lows easy shap­ing of agent be­hav­ior in a wide range of situ­a­tions.

Seg­ments of this post are lifted from my pa­per, whose lat­est re­vi­sion may be found here; for Python code, look no fur­ther than this repos­i­tory. For brevity, some rele­vant de­tails are omit­ted.

Summary

Be care­ful what you wish for.

In effect, side effect avoidance aims to de­crease how care­ful we have to be with our wishes. For ex­am­ple, ask­ing for help filling a cauldron with wa­ter shouldn’t re­sult in this:

How­ever, we just can’t enu­mer­ate all the bad things that the agent could do. How do we avoid these ex­treme over-op­ti­miza­tions ro­bustly?

Sev­eral im­pact mea­sures have been pro­posed, in­clud­ing state dis­tance, which we could define as, say, to­tal par­ti­cle dis­place­ment. This could be mea­sured ei­ther naively (with re­spect to the origi­nal state) or coun­ter­fac­tu­ally (with re­spect to the ex­pected out­come had the agent taken no ac­tion).

Th­ese ap­proaches have some prob­lems:

  • Mak­ing up for bad things it pre­vents with other nega­tive side effects. Imag­ine an agent which cures can­cer, yet kills an equal num­ber of peo­ple to keep over­all im­pact low.

  • Not be­ing cus­tomiz­able be­fore de­ploy­ment.

  • Not be­ing adapt­able af­ter de­ploy­ment.

  • Not be­ing eas­ily com­putable.

  • Not al­low­ing gen­er­a­tive pre­views, elimi­nat­ing a means of safely pre­view­ing agent prefer­ences (see la­tent space whitelist­ing be­low).

  • Be­ing dom­i­nated by ran­dom effects through­out the uni­verse at-large; note that noth­ing about par­ti­cle dis­tance dic­tates that it be re­lated to any­thing hap­pen­ing on planet Earth.

  • Equally pe­nal­iz­ing break­ing and fix­ing vases (due to the sym­me­try of the above met­ric):

For ex­am­ple, the agent would be equally pe­nal­ized for break­ing a vase and for pre­vent­ing a vase from be­ing bro­ken, though the first ac­tion is clearly worse. This leads to “over­com­pen­sa­tion” (“offset­ting“) be­hav­iors: when re­warded for pre­vent­ing the vase from be­ing bro­ken, an agent with a low im­pact penalty res­cues the vase, col­lects the re­ward, and then breaks the vase any­way (to get back to the de­fault out­come).
Vic­to­ria Krakovna, Mea­sur­ing and Avoid­ing Side Effects Us­ing Reachability
  • Not ac­tu­ally mea­sur­ing im­pact in a mean­ingful way.

Whitelist­ing falls prey to none of these.

How­ever, other prob­lems re­main, and cer­tain new challenges have arisen; these, and the as­sump­tions made by whitelist­ing, will be dis­cussed.

Rare LEAKED footage of Mickey try­ing to catch up on his al­ign­ment the­ory af­ter in­stan­ti­at­ing an un­friendly ge­nie [col­orized, 2050].

So, What’s Whitelist­ing?

To achieve ro­bust side effect avoidance with only a small train­ing set, let’s turn the prob­lem on its head: al­low a few effects, and pe­nal­ize ev­ery­thing else.

What’s an “Effect”?

You’re go­ing to be the agent, and I’ll be the su­per­vi­sor.

Look around—what do you see? Chairs, trees, com­put­ers, phones, peo­ple? As­sign a prob­a­bil­ity mass func­tion to each; ba­si­cally:

When you do things that change your be­liefs about what each ob­ject is, you re­ceive a penalty pro­por­tional to how much your be­liefs changed—pro­por­tional to how much prob­a­bil­ity mass “changed hands” amongst the classes.

But wait—isn’t it OK to effect cer­tain changes?

Yes, it is—I’ve got a few videos of agents effect­ing ac­cept­able changes. See all the ob­jects be­ing changed in this video? You can do that, too—with­out penalty.

De­com­pose your cur­rent knowl­edge of the world into a set of ob­jects. Then, for each ob­ject, main­tain a dis­tri­bu­tion over the pos­si­ble iden­tities of each ob­ject. When you do some­thing that changes your be­liefs about the ob­jects in a non-whitelisted way, you are pe­nal­ized pro­por­tion­ally.

There­fore, you avoid break­ing vases by de­fault.

Com­mon Confusions

  • We are not whitelist­ing en­tire states or tran­si­tions be­tween them; we whitelist spe­cific changes in our be­liefs about the on­tolog­i­cal de­com­po­si­tion of the cur­rent state.

  • The whitelist is in ad­di­tion to what­ever util­ity or re­ward func­tion we sup­ply to the agent.

  • Whitelist­ing is com­pat­i­ble with coun­ter­fac­tual ap­proaches. For ex­am­ple, we might pe­nal­ize a tran­si­tion af­ter its “quota” has been sur­passed, where the quota is how many times we would have ob­served that tran­si­tion had the agent not acted.

    • This im­plies the agent will do no worse than tak­ing no ac­tion at all. How­ever, this may still be un­de­sir­able. This prob­lem will be dis­cussed in fur­ther de­tail.

  • The whitelist is prov­ably closed un­der tran­si­tivity.

  • The whitelist is di­rected; .

La­tent Space Whitelisting

In a sense, class-based whitelist­ing is but a rough ap­prox­i­ma­tion of what we’re re­ally af­ter: “which ob­jects in the world can change, and in what ways?″. In la­tent space whitelist­ing, no longer do we con­strain tran­si­tions based on class bound­aries; in­stead, we pe­nal­ize based on end­point dis­tance in the la­tent space. Learned la­tent spaces are low-di­men­sional man­i­folds which suffice to de­scribe the data seen thus far. It seems rea­son­able that nearby points in a well-con­structed la­tent space cor­re­spond to like ob­jects, but fur­ther in­ves­ti­ga­tion is war­ranted.

As­sume that the agent mod­els ob­jects as points , the -di­men­sional la­tent space. A pri­ori, any move­ment in the la­tent space is un­de­sir­able. When train­ing the whitelist, we record the end­points of the ob­served changes. For and ob­served change , one pos­si­ble dis­similar­ity for­mu­la­tion is:

where is the Eu­clidean dis­tance.

Ba­si­cally, the dis­similar­ity for an ob­served change is the dis­tance to the clos­est whitelisted change. Vi­su­al­iz­ing these changes as one-way worm­holes may be helpful.

Advantages

Whitelist­ing as­serts that we can effec­tively en­cap­su­late a large part of what “change” means by us­ing a rea­son­able on­tol­ogy to pe­nal­ize ob­ject-level changes. We thereby ground the defi­ni­tion of “side effect”, avoid­ing the is­sue raised by Tay­lor et al.:

For ex­am­ple, if we ask [the agent] to build a house for a home­less fam­ily, it should know im­plic­itly that it should avoid de­stroy­ing nearby houses for ma­te­ri­als—a large side effect. How­ever, we can­not sim­ply de­sign it to avoid hav­ing large effects in gen­eral, since we would like the sys­tem’s ac­tions to still have the de­sir­able large fol­low-on effect of im­prov­ing the fam­ily’s so­cioe­co­nomic situ­a­tion.

Nonethe­less, we may not be able to perfectly ex­press what it means to have side-effects: the whitelist may be in­com­plete, the la­tent space in­suffi­ciently gran­u­lar, and the al­lowed plans sub-op­ti­mal. How­ever, the agent still be­comes more ro­bust against:

  • In­com­plete speci­fi­ca­tion of the util­ity func­tion.

    • Like­wise, an in­com­plete whitelist means missed op­por­tu­ni­ties, but not un­safe be­hav­ior.

  • Out-of-dis­tri­bu­tion situ­a­tions (as long as the ob­jects therein roughly fit in the pro­vided on­tol­ogy).

  • Some va­ri­eties of re­ward hack­ing. For ex­am­ple, equipped with a can of blue spray paint and tasked with find­ing the short­est path of blue tiles to the goal, a nor­mal agent may learn to paint red tiles blue, while a whitelist-en­abled agent would in­cur penalties for do­ing so ().

  • Danger­ous ex­plo­ra­tion. While this ap­proach does not at­tempt to achieve safe ex­plo­ra­tion (also act­ing safely dur­ing train­ing), an agent with some amount of fore­sight will learn to avoid ac­tions which likely lead to non-whitelisted side effects.

    • I be­lieve that this can be fur­ther sharp­ened us­ing to­day’s ma­chine learn­ing tech­nol­ogy, lev­er­ag­ing deep Q-learn­ing to pre­dict both ac­tion val­ues and ex­pected tran­si­tions.

      • This al­lows query­ing the hu­man about whether par­tic­u­larly-in­hibit­ing tran­si­tions be­long on the whitelist. For ex­am­ple, if the agent no­tices that a bunch of oth­er­wise-re­ward­ing plans are be­ing held up by a par­tic­u­lar tran­si­tion, it could ask for per­mis­sion to add it to the whitelist.

  • As­sign­ing as­tro­nom­i­cally-large weight to side effects hap­pen­ing through­out the uni­verse. Pre­sum­ably, we can just have the whitelist in­clude tran­si­tions go­ing on out there—we don’t care as much about dic­tat­ing the ex­act me­chan­ics of dis­tant su­per­novae.

    • If an agent did some­how come up with plans that in­volved blow­ing up dis­tant stars, that would in­deed con­sti­tute as­tro­nom­i­cal waste. Whitelist­ing doesn’t solve the prob­lem of as­sign­ing too much weight to events out­side our cor­ner of the neigh­bor­hood, but it’s an im­prove­ment.

      • Log­i­cal un­cer­tainty may be our friend here, such that most rea­son­able plans in­cur roughly the same level of in­ter­stel­lar penalty noise.

Results

I tested a vanilla Q-learn­ing agent and its whitelist-en­abled coun­ter­part in 100 ran­domly-gen­er­ated grid wor­lds (di­men­sions up to ). The agents were re­warded for reach­ing the goal square as quickly as pos­si­ble; no ex­plicit penalties were lev­ied for break­ing ob­jects.

The simu­lated clas­sifi­ca­tion con­fi­dence of each ob­ject’s true class was (trun­cated to ), . This simu­lated sen­sor noise was han­dled with a Bayesian statis­ti­cal ap­proach.

At rea­son­able lev­els of noise, the whitelist-en­abled agent com­pleted all lev­els with­out a sin­gle side effect, while the Q-learner broke over 80 vases.

Assumptions

I am not as­sert­ing that these as­sump­tions nec­es­sar­ily hold.

  • The agent has some world model or set of ob­ser­va­tions which can be de­com­posed into a set of dis­crete ob­jects.

    • Fur­ther­more, there is no need to iden­tify ob­jects on mul­ti­ple lev­els (e.g., a for­est, a tree in the for­est, and that tree’s bark need not all be iden­ti­fied con­cur­rently).

    • Not all ob­jects need to be rep­re­sented—what do we make of a ‘field’, or the ‘sky’, or ‘the dark places be­tween the stars visi­ble to the naked eye’? Surely, these are not all ob­jects.

  • We have an on­tol­ogy which rea­son­ably de­scribes (di­rectly or in­di­rectly) the vast ma­jor­ity of nega­tive side effects.

    • Indi­rect de­scrip­tions of nega­tive out­comes means that even if an un­de­sir­able tran­si­tion isn’t im­me­di­ately pe­nal­ized, it gen­er­ally re­sults in a num­ber of penalties. Think: pol­lu­tion.

    • La­tent space whitelist­ing: the learned la­tent space en­cap­su­lates most of the rele­vant side effects. This is a slightly weaker as­sump­tion.

  • Said on­tol­ogy re­mains in place.

Problems

Beyond re­solv­ing the above as­sump­tions, and in roughly as­cend­ing difficulty:

Ob­ject Permanence

If you wanted to im­ple­ment whitelist­ing in a mod­ern em­bod­ied deep-learn­ing agent, you could cer­tainly pair deep net­works with state-of-the-art seg­men­ta­tion and ob­ject track­ing ap­proaches to get most of what you need. How­ever, what’s the differ­ence be­tween an ob­ject leav­ing the frame, and an ob­ject van­ish­ing?

Not only does the agent need to re­al­ize that ob­jects are per­ma­nent, but also that they keep in­ter­act­ing with the en­vi­ron­ment even when not be­ing ob­served. If this is not re­al­ized, then an agent might set an effect in mo­tion, stop ob­serv­ing it, and then turn around when the bad effect is done to see a “new” ob­ject in its place.

Time Step Size Invariance

The penalty is presently at­ten­u­ated based on the prob­a­bil­ity that the be­lief shift was due to noise in the data. Ac­cord­ingly, there are cer­tain ways to abuse this to skirt the penalty. For ex­am­ple, sim­ply have non-whitelisted side effects take place over long timescales; this would be clas­sified as noise and at­ten­u­ated away.

How­ever, if we don’t need to han­dle noise in the be­lief dis­tri­bu­tions, this prob­lem dis­ap­pears—pre­sum­ably, an ad­vanced agent keeps its epistemic house in or­der. I’m still un­cer­tain about whether (in the limit) we have to hard-code a means for de­com­pos­ing a rep­re­sen­ta­tion of the world-state into ob­jects, and where to point the penalty eval­u­a­tor in a po­ten­tially self-mod­ify­ing agent.

In­for­ma­tion Theory

Whitelist­ing is wholly un­able to cap­ture the im­por­tance of “in­for­ma­tional states” of sys­tems. It would ap­ply no penalty to pass­ing pow­er­ful mag­nets over your hard drive. It is not clear how to rep­re­sent this in a sen­si­ble way, even in a la­tent space.

Loss of Value

Whitelist­ing could get us stuck in a tol­er­able yet sub-op­ti­mal fu­ture. Cor­rigi­bil­ity via some mechanism for ex­pand­ing the whitelist af­ter train­ing has ended is then de­sir­able. For ex­am­ple, the agent could pro­pose ex­ten­sions to the whitelist. To avoid ma­nipu­la­tive be­hav­ior, the agent should be in­differ­ent as to whether the ex­ten­sion is ap­proved.

Even if ex­treme care is taken in ap­prov­ing these ex­ten­sions, mis­takes may be made. The agent it­self should be suffi­ciently cor­rigible and al­igned to no­tice “this out­come might not ac­tu­ally be what they wanted, and I should check first”.

Reversibility

As Deep­Mind out­lines in Spec­i­fy­ing AI Safety Prob­lems in Sim­ple En­vi­ron­ments, we may want to pe­nal­ize not just phys­i­cal side effects, but also causally-ir­re­versible effects:

Krakovna et al. in­tro­duce a means for pe­nal­iz­ing ac­tions by the pro­por­tion of ini­tially-reach­able states which are still reach­able af­ter the agent acts.

I think this is a step in the right di­rec­tion. How­ever, even given a hy­per­com­puter and a perfect simu­la­tor of the uni­verse, this wouldn’t work for the real world if im­ple­mented liter­ally. That is, due to en­tropy, you may not be able to re­turn to the ex­act same uni­verse con­figu­ra­tion. To be clear, the au­thors do not sug­gest im­ple­ment­ing this ideal­ized al­gorithm, flag­ging a more tractable ab­strac­tion as fu­ture work.

What does it re­ally mean for an “effect” to be “re­versible”? What level of ab­strac­tion do we in fact care about? Does it in­volve re­versibil­ity, or just out­comes for the ob­jects in­volved?

On­tolog­i­cal Crises

When a util­ity-max­i­miz­ing agent re­fac­tors its on­tol­ogy, it isn’t always clear how to ap­ply the old util­ity func­tion to the new on­tol­ogy—this is called an on­tolog­i­cal crisis.

Whitelist­ing may be vuln­er­a­ble to on­tolog­i­cal crises. Con­sider an agent whose whitelist dis­in­cen­tivizes break­ing apart a tile floor (); con­ceiv­ably, the agent could come to see the floor as be­ing com­posed of many tiles. Ac­cord­ingly, the agent would no longer con­sider re­mov­ing tiles to be a side effect.

Gen­er­ally, prov­ing in­var­i­ance of the whitelist across re­fac­tor­ings seems tricky, even as­sum­ing that we can iden­tify the cor­rect map­ping.

Re­trac­ing Steps

When I first en­coun­tered this prob­lem, I was ac­tu­ally fairly op­ti­mistic. It was clear to me that any on­tol­ogy re­fac­tor­ing should re­sult in util­ity nor­malcy—roughly, the util­ity func­tions in­duced by the pre- and post-re­fac­tor­ing on­tolo­gies should out­put the same scores for the same wor­lds.

Wow, this seems like a use­ful in­sight. Maybe I’ll write some­thing up!

Turns out a cer­tain some­one beat me to the punch—here’s a nov­ella Eliezer wrote on Ar­bital about “res­cu­ing the util­ity func­tion”.

Cling­i­ness

This prob­lem cuts to the core of causal­ity and “re­spon­si­bil­ity” (what­ever that means). Say that an agent is clingy when it not only stops it­self from hav­ing cer­tain effects, but also stops you. Whitelist-en­abled agents are cur­rently clingy.

Let’s step back into the hu­man realm for a mo­ment. Con­sider some out­come—say, the spark­ing of a small for­est fire in Cal­ifor­nia. At what point can we truly say we didn’t start the fire?

  • My ac­tions im­me­di­ately and visi­bly start the fire.

  • At some mod­er­ate tem­po­ral or spa­tial re­move, my ac­tions end up start­ing the fire.

  • I in­ten­tion­ally per­suade some­one to start the fire.

  • I un­in­ten­tion­ally (but per­haps pre­dictably) in­cite some­one to start the fire.

  • I set in mo­tion a mod­er­ately-com­plex chain of events which con­vince some­one to start the fire.

  • I pro­voke a but­terfly effect which ends up start­ing the fire.

  • I pro­voke a but­terfly effect which ends up con­vinc­ing some­one to start a fire which they:

    • were pre­dis­posed to start­ing.

    • were not pre­dis­posed to start­ing.

Taken liter­ally, I don’t know that there’s ac­tu­ally a sig­nifi­cant differ­ence in “re­spon­si­bil­ity” be­tween these out­comes—if I take the ac­tion, the effect hap­pens; if I don’t, it doesn’t. My ini­tial im­pres­sion is that un­cer­tainty about the re­sults of our ac­tions pushes us to view some effects as “un­der our con­trol” and some as “out of our hands”. Yet, if we had com­plete knowl­edge of the out­comes of our ac­tions, and we took an ac­tion that landed us in a Cal­ifor­nia-for­est-fire world, whom could we blame but our­selves?

Can we re­ally do no bet­ter than a naive coun­ter­fac­tual penalty with re­spect to what­ever im­pact mea­sure we use? My con­fu­sion here is not yet dis­solved. In my opinion, this is a gap­ing hole in the heart of im­pact mea­sures—both this one, and oth­ers.

Stasis

For­tu­nately, a whitelist-en­abled agent should not share the clas­sic con­ver­gent in­stru­men­tal goal of valu­ing us for our atoms.

Un­for­tu­nately, de­pend­ing on the mag­ni­tude of the penalty in pro­por­tion to the util­ity func­tion, the eas­iest way to pre­vent pe­nal­ized tran­si­tions may be putting any rele­vant ob­jects in some kind of pro­tected sta­sis, and then op­ti­miz­ing the util­ity func­tion around that. Whitelist­ing is clingy!

If we have at least an al­most-al­igned util­ity func­tion and proper penalty scal­ing, this might not be a prob­lem.

Edit: a po­ten­tial solu­tion to cling­i­ness, with its own draw­backs.

Dis­cussing Im­perfect Approaches

A few months ago, Scott Garrabrant wrote about ro­bust­ness to scale:

Briefly, you want your pro­posal for an AI to be ro­bust (or at least fail grace­fully) to changes in its level of ca­pa­bil­ities.

I recom­mend read­ing it—it’s to-the-point, and he makes good points.

Here are three fur­ther thoughts:

  • In­tu­itively-ac­cessible van­tage points can help us ex­plore our un­stated as­sump­tions and more eas­ily ex­trap­o­late out­comes. If less men­tal work has to be done to put one­self in the sce­nario, more en­ergy can be ded­i­cated to find­ing nasty edge cases. For ex­am­ple, it’s prob­a­bly harder to re­al­ize all the things that go wrong with naive im­pact mea­sures like raw par­ti­cle dis­place­ment, since it’s just a weird frame through which to view the evolu­tion of the world. I’ve found it to be sub­stan­tially eas­ier to ex­trap­o­late through the frame of some­thing like whitelist­ing.

    • I’ve already ad­justed for the fact that one’s own ideas are of­ten more fa­mil­iar and in­tu­itive, and then ad­justed for the fact that I prob­a­bly didn’t ad­just enough the first time.

  • Im­perfect re­sults are of­ten left un­stated, wast­ing time and ob­scur­ing use­ful data. That is, peo­ple can­not see what has been tried and what road­blocks were en­coun­tered.

  • Promis­ing ap­proaches may be con­cep­tu­ally-close to cor­rect solu­tions. My in­tu­ition is that whitelist­ing ac­tu­ally al­most works in the limit in a way that might be im­por­tant.

Conclusion

Although some­what out­side the scope of this post, whitelist­ing per­mits the con­cise shap­ing of re­ward func­tions to get be­hav­ior that might be difficult to learn us­ing other meth­ods. This method also seems fairly use­ful for al­ign­ing short- and medium-term agents. While en­coun­ter­ing some new challenges, whitelist­ing ame­lio­rates or solves many prob­lems with pre­vi­ous im­pact mea­sures.


Even an ideal­ized form of whitelist­ing is not suffi­cient to al­ign an oth­er­wise-un­al­igned agent. How­ever, the same ar­gu­ment can be made against hav­ing an off-switch; if we haven’t for­mally proven the al­ign­ment of a seed AI, hav­ing more safe­guards might be bet­ter than throw­ing out the seat­belt to shed dead­weight and get some ex­tra speed. Of course, there are also le­gi­t­i­mate ar­gu­ments to be made on the ba­sis of timelines and op­ti­mal time al­lo­ca­tion.

Hu­mor aside, we would have no lux­ury of “catch­ing up on al­ign­ment the­ory” if our code doesn’t work on the first go—that is, if the AI still func­tions, yet differ­ently than ex­pected.

Luck­ily, hu­mans are great at pro­duc­ing flawless code on the first at­tempt.

A po­ten­tially-helpful anal­ogy: similarly to how Bayesian net­works de­com­pose the prob­lem of rep­re­sent­ing a (po­ten­tially ex­tremely large) joint prob­a­bil­ity table to that of spec­i­fy­ing a hand­ful of con­di­tional ta­bles, whitelist­ing at­tempts to de­com­pose the messy prob­lem of quan­tify­ing state change into a set of com­pre­hen­si­ble on­tolog­i­cal tran­si­tions.

Tech­ni­cally, at 6,250 words, Eliezer’s ar­ti­cle falls short of the 7,500 re­quired for “nov­ella” sta­tus.

Is there an­other name for this?

I do think that “re­spon­si­bil­ity” is an im­por­tant part of our moral the­ory, de­serv­ing of res­cue.

In par­tic­u­lar, I found a par­tic­u­lar var­i­ant of Mur­phyjitsu helpful: I vi­su­al­ized Eliezer com­ment­ing “ac­tu­ally, this fails ter­ribly be­cause...” on one of my posts, let­ting my mind fill in the rest.

In my opinion, one of the most im­por­tant com­po­nents of do­ing AI al­ign­ment work is iter­a­tively ap­ply­ing Mur­phyjitsu and Re­solve cy­cles to your ideas.

A fun ex­am­ple: I imag­ine it would be fairly easy to train an agent to only de­stroy cer­tain-col­ored ships in Space In­vaders.