On Media Synthesis: An Essay on The Next 15 Years of Creative Automation

One of my fa­vorite child­hood mem­o­ries in­volves some­thing that tech­ni­cally never hap­pened. When I was ten years old, my wak­ing life re­volved around car­toons— flashy, col­or­ful, quirky shows that I could find in con­ve­nient thirty-minute blocks on a host of ca­ble chan­nels. This love was so strong that I thought to my­self one day, “I can cre­ate a car­toon.” I’d been writ­ing lit­tle non­sense sto­ries and draw­ing (badly) for years by that point, so it was a no-brainer to my ten-year-old mind that I ought to make some­thing similar to (but bet­ter than) what I saw on tele­vi­sion.
The log­i­cal course of ac­tion, then, was to search “How to make a car­toon” on the in­ter­net. I saw noth­ing worth my time that I could eas­ily un­der­stand, so I re­al­ized the trick to play— I would have to open a text file, type in my de­scrip­tion of the car­toon, and then feed it into a Car­toon-a-Tron. Voilà! A 30-minute car­toon!
Now I must add that this was in 2005, which ought to com­mu­ni­cate how suc­cess­ful my an­i­ma­tion ca­reer was.

Two years later, I dis­cov­ered an an­i­ma­tion pro­gram at the lo­cal Wal-Mart and be­lieved that I had fi­nally found the pro­gram I had hith­erto been un­able to find. When I rode home, I felt triumphant in the knowl­edge that I was about to be­come a fa­mous car­toon­ist. My only worry was whether the disk would have all the voices I wanted preloaded.

I used the pro­gram once and have never touched it since. Around that same time, I did re­search on how car­toons were made— though I was aware some re­quired many draw­ings, I was not clear on the en­tire pro­cess un­til I read a fairly de­tailed book filled with tech­ni­cal in­dus­try jar­gon. The thought of draw­ing thou­sands of images of sin­gu­lar char­ac­ters, let alone en­tire scenes, sounded ex­cru­ci­at­ing. This did not be­gin to fully en­cap­su­late what one needed to cre­ate a com­pe­tent piece of an­i­ma­tion— from brain­storm­ing, sto­ry­board­ing, and script edit­ing all the way to vo­cal takes, mu­sic pro­duc­tion, au­di­tory stan­dards, post-pro­duc­tion edit­ing, union rules, and more, the re­al­ity shat­tered ev­ery bit of naïveté I held prior about the ‘ease’ of cre­at­ing a sin­gle 30-minute car­toon (let alone my cav­al­cade of con­cepts com­ing and go­ing with the sea­sons).

In the most bizarre of twists, my ten and twelve-year-old selves may have been onto some­thing; their only mis­take was hold­ing these ideas decades too soon.
In the early 2010s, progress in the field of ma­chine learn­ing be­gan to ac­cel­er­ate ex­po­nen­tially as deep learn­ing went from an ob­scure lit­tle break­through to the fore­front of data sci­ence. Neu­ral net­works— once a non­starter in the field of ar­tifi­cial in­tel­li­gence— un­der­went a “grunge mo­ment” and quickly ush­ered in a swel­ter­ing new AI sum­mer which we are still in.

In very short form, neu­ral net­works are se­quences of large ma­trix mul­ti­ples with non­lin­ear func­tions used for ma­chine learn­ing, and ma­chine learn­ing is ba­si­cally statis­ti­cal gra­di­ent mod­el­ing. Deep learn­ing in­volves mas­sive lay­ers of neu­ral net­works, pars­ing through a frankly stupid amount of data to op­ti­mize out­puts.
As it turns out, deep learn­ing is very com­pe­tent at cer­tain sub-cog­ni­tive tasks— things we rec­og­nize as lan­guage mod­el­ing, con­cep­tual un­der­stand­ing, and image clas­sifi­ca­tion. In that re­gard, it was only a mat­ter of time be­fore we used this tool to gen­er­ate me­dia. Syn­the­size it, if you will.

Me­dia syn­the­sis is an um­brella term that in­cludes deep­fakes, style trans­fer, text syn­the­sis, image syn­the­sis, au­dio ma­nipu­la­tion, video gen­er­a­tion, text-to-speech, text-to-image, au­topara­phras­ing, and more.

AI has been used to gen­er­ate me­dia for quite some time— speech syn­the­sis goes back to the 1950s, Markov chains have stitched to­gether oc­ca­sion­ally-quasi-co­her­ent po­ems and short sto­ries for decades, and Pho­to­shop in­volves al­gorith­mic changes to pre­ex­ist­ing images. If you want to get very figu­ra­tive, some me­chan­i­cal au­toma­tons from cen­turies prior could write and do cal­lig­ra­phy.
It wasn’t un­til roughly the 2010s that the nascent field of “me­dia syn­the­sis” truly be­gan to grow thanks to the cre­ation of gen­er­a­tive-ad­ver­sar­ial net­works (first de­scribed in the 1990s by Jür­gen Sch­mid­hu­ber). Early suc­cesses in this area in­volved ‘Deep­Dream’, an in­cred­ibly psychedelic style of image syn­the­sis that bears some re­sem­blance to schizophrenic hal­lu­ci­na­tions— net­works would hal­lu­ci­nate swirling pat­terns filled with dis­em­bod­ied eyes, doglike faces, and ten­ta­cles be­cause they were trained on cer­tain images.
When it came to gen­er­at­ing more re­al­is­tic images, GANs im­proved rapidly: in 2016, Google’s image gen­er­a­tion and clas­sifi­ca­tion sys­tem proved able to cre­ate a num­ber of rec­og­niz­able ob­jects rang­ing from ho­tel rooms to piz­zas. The next year, image syn­the­sis im­proved to the point that GANs could cre­ate re­al­is­tic high-defi­ni­tion images.

Neu­ral net­works weren’t figur­ing out just images— in 2016, UK-based Google Deep­Mind un­veiled WaveNet for the syn­the­sis of re­al­is­tic au­dio. Though it was meant for voices, syn­the­siz­ing au­dio waves with such high pre­ci­sion means that you can syn­the­size any sound imag­in­able, in­clud­ing mu­si­cal in­stru­ments.

And on Valen­tine’s Day, 2019, OpenAI shocked the sci-tech world with the un­veiling of GPT-2, a text-syn­the­sis net­work with billions of pa­ram­e­ters that is so pow­er­ful, it dis­plays just a hint of some nar­rowly gen­er­al­ized in­tel­li­gence— from text alone, it is ca­pa­ble of in­fer­ring lo­ca­tion, dis­tance, se­quence, and more with­out any spe­cial­ized pro­gram­ming. The text gen­er­ated by GPT-2 ranges from typ­i­cally in­co­her­ent all the way to hu­man­like, but the mag­i­cal part is how con­sis­tently it can syn­the­size hu­man­like text (in the form of ar­ti­cles, po­ems, and short sto­ries). GPT-2 crushes the com­pe­ti­tion on the Wino­grad Schema by over seven points— a barely be­liev­able leap for­ward in the state of the art made even more im­pres­sive by the fact GPT-2 is a sin­gle, rather sim­ple net­work with no aug­men­ta­tion made by other al­gorithms. If given such perfor­mance en­hance­ments, its score may reach as high as 75%. If the num­ber of pa­ram­e­ters for GPT-2 were in­creased 1,000x over, it very well could syn­the­size en­tire co­her­ent nov­els— that is, sto­ries that are at least 50,000 words in length.

This is more my area of ex­per­tise, and I know how difficult it can be to craft a novel or even an nov­ella (which need only be roughly 20,000 words in length). But I am not afraid of my own ob­so­les­cence. Far from it. I fash­ion my iden­tity more as a me­dia cre­ator who merely re­sorts to writ­ing— draw­ing, mu­sic, an­i­ma­tion, di­rect­ing, etc. is cer­tainly learn­able, but I’ve ded­i­cated my­self to writ­ing. My dream has always been to cre­ate “con­tent”, not nec­es­sar­ily “books” or any one spe­cific form of me­dia.
This is why I’ve been watch­ing the progress in me­dia syn­the­sis so closely ever since I had an epiphany on the tech­nol­ogy in De­cem­ber of 2017.

We speak of au­toma­tion as fol­low­ing a fairly pre­dictable path: com­put­ers get faster, al­gorithms get smarter, and we pro­gram robots to do drudgery— difficult blue-col­lar jobs that no one wants to do but some­one has to do for so­ciety to func­tion. In a bet­ter world, this would free work­ers to pur­sue more in­tel­lec­tual pur­suits in the STEM field and en­ter­tain­ment, though there’s the chance that this will merely lead to wide­spread un­em­ploy­ment and ne­ces­si­tate the im­ple­men­ta­tion of a uni­ver­sal ba­sic in­come. As more white-col­lar jobs are au­to­mated, hu­mans take to cre­ative jobs in greater num­bers, bring­ing about a flour­ish­ing of the arts.

In truth, the pro­gres­sion of au­toma­tion will likely un­fold in the ex­act op­po­site pat­tern. Me­dia syn­the­sis re­quires no phys­i­cal body. Art, ob­jec­tively, re­quires a medium by which we can en­joy it— whether that’s a can­vass, a record, a screen, a mar­ble block, or whathaveyou. The ac­tual artis­tic la­bor is men­tal in na­ture; the phys­i­cal la­bor in­volves trans­mit­ting that art through a medium. This can be perfectly repli­cated with data alone, as these forms of ex­pres­sion can be quan­tified in digi­tal form. Thus, pure soft­ware can au­to­mate the cre­ation of en­ter­tain­ment with hu­mans needed only as periph­eral agents to en­joy this art (or bring the medium to the soft­ware).

This is not the case with most other jobs. A garbage­man does not use a medium of ex­pres­sion in or­der to pick up trash. Nei­ther does an in­dus­trial worker. The re­sults of these jobs also is not rooted in data or any­thing ephemeral— if there is trash to be picked up, you must use phys­i­cal la­bor in or­der to do so. And while many of these jobs have in­deed been au­to­mated, there is a limit to how au­to­mated they can be with cur­rent soft­ware. Au­toma­tion works best when there are no vari­ables. If some­thing goes wrong on an as­sem­bly line, we send in a hu­man to fix it be­cause the ma­chines are not able to han­dle er­rors or un­ex­pected vari­ables. What’s more, phys­i­cal jobs like this re­quire a phys­i­cal body— they re­quire robotics. And any­one who has worked in the field of ma­chine learn­ing knows that there is a mas­sive gap be­tween what works heav­enly in a simu­la­tion and what works in real life due to the ex­po­nen­tially in­creas­ing vari­ables in re­al­ity that can’t be mod­eled in com­put­ers even of the pre­sent.

To put it an­other way, in or­der for blue-col­lar au­toma­tion to com­pletely up­end the la­bor mar­ket, we re­quire both gen­eral-pur­pose robots (which we tech­ni­cally have) and gen­eral AI (which we don’t). There will be in­creas­ing au­toma­tion of the in­dus­trial and ser­vice sec­tors, sure, but it won’t hap­pen quite as quickly as some claim.

Con­versely, “dis­em­bod­ied” jobs— the cre­atives and plenty of white-col­lar work— could be au­to­mated away within a decade. It makes sense that the eco­nomic elite would pro­mote the op­po­site be­lief since this sug­gests they are the first on the chop­ping block of ob­so­les­cence, but when it comes to the en­ter­tain­ment in­dus­try, there is ac­tu­ally an el­e­ment of dan­ger in how stu­pen­dously close we are to great changes and yet how ut­terly un­pre­pared we are to deal with them.

Or to put it shortly, jobs that in­volve the cre­ation of data can be au­to­mated with­out any need for ad­vance­ments in robotics. 10 years from now, many low and high-skill man­ual jobs will still be around, but plenty of white-col­lar and en­ter­tain­ment-based jobs will be ob­so­lete.

There are es­sen­tially two types of art: art for art’s sake and art as ca­reer. Art for art’s sake isn’t go­ing away any­time soon and never has been in dan­ger of au­toma­tion. This, pure ex­pres­sion, will sur­vive. Art as ca­reer, how­ever, is doomed. What’s more, its doom is im­pend­ing and im­mi­nent. If your plan in life is to make a ca­reer out of com­mis­sioned art, as a pro­fes­sional mu­si­cian, voice ac­tor, cover model, pop writer, video game de­signer, keyframe artist, or as­set de­signer, your field has at most 15 years left. In 2017, I felt this was a liberal pre­dic­tion and that art-as-ca­reer would die per­haps in the lat­ter half of the 21st cen­tury. Now, just two years later, I’m be­gin­ning to be­lieve I was con­ser­va­tive. We need not to cre­ate ar­tifi­cial gen­eral in­tel­li­gence to effec­tively de­stroy most of the model, movie, and mu­sic in­dus­tries.

Models, es­pe­cially cover mod­els, might find a dearth of work within a year.

Yes, a year. If the in­dus­try were techno­pro­gres­sive, that is. In truth, it will take longer than that. But the tech­nol­ogy to com­pletely un­em­ploy most mod­els already ex­ists in a rudi­men­tary form. State-of-the-art image syn­the­sis can gen­er­ate pho­to­re­al­is­tic faces with ease—we’re merely wait­ing on the rest of the body at this point. Pa­ram­e­ters can be al­tered, al­low­ing for cus­tomiza­tion and style trans­fer be­tween an ex­ist­ing image and a de­sired style, fur­ther giv­ing op­tions to de­sign­ers. In the very near fu­ture, it ought to be pos­si­ble to feed an image of any cloth­ing item and make some­one in a photo “wear” those clothes.

In other words, if I wanted to put Adolf Hitler in a Ja­panese school­girl’s clothes for what­ever es­o­teric rea­son, it wouldn’t be im­pos­si­ble for me to do this.

And here is where we shift gears for a mo­ment to dis­cuss the more fun side of me­dia syn­the­sis.

With suffi­ciently ad­vanced tools which we might find next decade, it will be pos­si­ble to take any song you want and remix it any­way you de­sire. My clas­sic ex­am­ple is tak­ing TLC’s “Water­falls” and turn­ing it into a 1900s-style bar­ber­shop quar­tet. This would could only be ac­com­plished via an al­gorithm that un­der­stood what bar­ber­shop mu­sic sounds like and knew to keep the lyrics and melody of the origi­nal song, swap the gen­ders, trans­fer the vo­cal style to a new one, and sub­tract the origi­nal in­stru­men­ta­tion. A similar ex­am­ple of mine is tak­ing Witch­fin­der Gen­eral’s “Friends of Hell” and do­ing just two things: change the singer into a woman, prefer­ably Coven’s Jinx Daw­son, and chang­ing a few of the lyrics. No pitch change to the mu­sic, mean­ing ev­ery­thing else has to stay right where it is.
The only way to do this to­day is to ac­tu­ally cover the songs and hope you do a de­cent enough job. In the very near fu­ture, through a neu­ral ma­nipu­la­tion of the mu­sic, I could ac­com­plish the same on my com­puter with just a few tex­tual in­puts and prompts. And if I can ma­nipu­late mu­sic to such a level, surely I needn’t men­tion the po­ten­tial to gen­er­ate mu­sic through this method. Per­haps you’d love noth­ing more than to hear Foo Fighters but with Kurt Cobain as vo­cal­ist (or co-vo­cal­ist), or per­haps you’d love to hear an en­tirely new Foo Fighters album recorded in the style of the very first record.

Another ex­am­ple I like to use is the prospect of the first “com­puter-gen­er­ated comic.” Not to be con­fused with a comic us­ing CGI art, the first com­puter-gen­er­ated comic will be one cre­ated en­tirely by an al­gorithm. Or, at least, drawn by al­gorithm. The hu­man will in­put text and de­scrip­tions, and the com­puter will do the rest. It could con­ceiv­ably do so in any art style. I thought this would hap­pen be­fore the first AI-gen­er­ated an­i­ma­tion, but I was wrong— a neu­ral net­work man­aged to syn­the­size short clips of the Flint­stones in 2018. Not all of them were great, but they didn’t have to be.

Very near in the fu­ture, I ex­pect there to be “char­ac­ter cre­ator: the game” uti­liz­ing a fully cus­tomiz­able GAN-based in­ter­face. We’ll be able to gen­er­ate any sort of char­ac­ter we de­sire in any sort of situ­a­tion, any pose, any scene, in any style. From there, we’ll be able to cre­ate any art scene we de­sire. If we want Byzan­tine art ver­sions of mod­ern comic books, for ex­am­ple, it will be pos­si­ble. If you wanted your fa­vorite artist to draw a par­tic­u­lar scene they oth­er­wise never would, you could see the re­sult. And you could even over­lay style trans­fer­ring vi­su­als over aug­mented re­al­ity, turn­ing the en­tire world it­self into your own lit­tle car­toon or ab­stract paint­ing.

Ten years from now, I will be able to ac­com­plish the very thing my ten-year-old self always wanted: I’ll be able to down­load an auto-an­i­ma­tion pro­gram and cre­ate en­tire car­toons from scratch. And I’ll be able to syn­the­size the voices— any voice, whether I have a clip or not. I’ll be able to syn­the­size the perfect sound­track to match it. And the car­toon could be in any art style. It doesn’t have to have choppy an­i­ma­tion— if I wanted it to have fluidity be­yond that of any Dis­ney film, it could be done. And there won’t be reg­u­la­tions to fol­low un­less I chose to pub­li­cly re­lease that car­toon. I won’t have to pay any­one, let alone put down hun­dreds of thou­sands of dol­lars per epi­sode. The worst prob­lem I might have is if this tech­nol­ogy isn’t open-source (most me­dia syn­the­siz­ing tools are, via GitHub) and it turns out I have to pay hun­dreds of thou­sands of dol­lars for such tools any­way. This would only hap­pen if the big stu­dios of the en­ter­tain­ment in­dus­try bought out ev­ery AI re­searcher on the planet or shut down piracy & open source sites with ex­treme prej­u­dice by then.

But it could also hap­pen will­ingly in the case said AI re­searchers don’t trust these tools to be used wisely, as OpenAI so con­tro­ver­sially chose with GPT-2.

Surely you’ve heard of deep­fakes. There is quite a bit of en­ter­tain­ment po­ten­tial in them, and some are be­gin­ning to cap­i­tal­ize on this— who wouldn’t want to star in a block­buster movie or see their crush on a porn star’s body? Ex­cept that last one isn’t tech­ni­cally le­gal.
And this is just where the prob­lems be­gin. Deep­fakes ex­ist as the tip of the war­head that will end our trust-based so­ciety. De­spite the ex­is­tence of image ma­nipu­la­tion soft­ware, most isn’t quite good enough to fool peo­ple— it’s eas­ier to sim­ply mis­la­bel some­thing and pre­sent it as some­thing else (e.g. a mob of Is­lamic ter­ror­ists be­ing la­beled Amer­i­can Mus­lims cel­e­brat­ing 9/​11). This will change in the com­ing years when it be­comes easy to recre­ate re­al­ity in your fa­vor.

Imag­ine a phisher us­ing style trans­fer­ring al­gorithms to “steal” your mother’s voice and then call you ask­ing for your so­cial se­cu­rity num­ber. Some­one will be the first. We have no wide­spread tele­phone en­cryp­tion sys­tem in place to pre­vent such a thing be­cause such a thing is so un­think­able to us at the pre­sent mo­ment.

Deep­fakes would be the best at sub­tly al­ter­ing things, adding el­e­ments that weren’t there and you didn’t im­me­di­ately no­tice at first. But it’s also pos­si­ble for all as­pects of me­dia syn­the­sis to erode trust. If you wanted to cre­ate events in his­tory, com­plete with all the “ev­i­dence” nec­es­sary, there is noth­ing stop­ping you. Most prob­a­bly won’t be­lieve you, but some sub­set will, and that’s all you need to start wreak­ing havoc. At some point, you could pick and choose your own re­al­ity. If I had a son and raised him be­liev­ing that the Bea­tles were an all-fe­male band— with all live perfor­mances and in­ter­views show­cas­ing a fe­male Bea­tles and all on­line refer­ences refer­ring to them as women— then the in­verse, that the Bea­tles were an all-male band, might very well be­come “al­ter­nate his­tory” to him be­cause how can he con­firm oth­er­wise? Some­one else might tell him that the Bea­tles were ac­tu­ally called Long John & the Beat Brothers be­cause that’s the re­al­ity they chose.

This to­tal malle­abil­ity of re­al­ity is a sym­bol of our in­creas­ingly ad­vanced civ­i­liza­tion, and it’s on the verge of be­com­ing the pre­sent. Yet out­side of men­tions of deep­fakes, there has been lit­tle di­alogue on the pos­si­bil­ity in the main­stream. It’s still a given that Hol­ly­wood will re­main rel­a­tively un­changed even into the 2040s and 2050s be­sides “per­haps us­ing robots & holo­graphic ac­tors”. It’s still a given that you could get rich writ­ing shlocky ro­mance nov­els on Ama­zon, be­come a top 40 pop star, or trust most images and videos be­cause we ex­pect (and, per­haps, we want) the fu­ture to be “the pre­sent with bet­ter gad­gets” and not the ut­terly trans­for­ma­tive, cy­berdelic era of sturm und drang ahead of us.

All my ten-year-old self wants is his car­toon. I’ll be happy to give it to him when­ever I can.

If you want to see more, come visit the sub­red­dit: https://​​www.red­dit.com/​​r/​​Me­di­aSyn­the­sis/​​