Stats Advice on a New N-back Game

Cross-posted to my blog. I ex­pect this will be of some in­ter­est to the LessWrong com­mu­nity both be­cause of pre­vi­ous in­ter­est in N-back and be­cause of the op­por­tu­nity to ap­ply Bayesian statis­tics to a real-world prob­lem. The main rea­son I’m writ­ing this ar­ti­cle is to get feed­back on my ap­proach and to ask for help in the ar­eas where I’m stuck. For some back­ground, I’m a soft­ware de­vel­oper who’s been work­ing in games for 7+ years and re­cently left my cor­po­rate job to work on this pro­ject full-time.

As I men­tioned here and here, since early Fe­bru­ary I’ve been work­ing on an N-back-like mo­bile game. I plan to re­lease for iOS this sum­mer and for An­droid a few months later if all goes well. I have fully im­ple­mented the core game­play and most of the vi­sual styling and UI, and am cur­rently work­ing with a com­poser on the sound and mu­sic.

I am just now start­ing on the fi­nal com­po­nent of the game: an adap­tive mode that as­sesses the player’s skill and pre­sents challenges that are tuned to in­duce a state of flow.

The Problem

The game is bro­ken down into waves, each of which pre­sents an N-back-like task with cer­tain pa­ram­e­ters, such as the num­ber of at­tributes, the num­ber of var­i­ants in each at­tribute, the tempo, and so on. I would like to find a way to col­lapse these pa­ram­e­ters into a sin­gle difficulty pa­ram­e­ter that I can com­pare against a player’s skill level to pre­dict their perfor­mance on a given wave.

But I re­al­ize that some play­ers will be bet­ter at some challenges than oth­ers (e.g. mem­ory, match­ing mul­ti­ple at­tributes, han­dling fast tem­pos, deal­ing with vi­sual dis­trac­tions like ro­ta­tion, or rec­og­niz­ing let­ters). Skill and difficulty are mul­ti­di­men­sional quan­tities, and this makes perfor­mance hard to pre­dict. The ques­tion is, is there a sin­gle-pa­ram­e­ter ap­prox­i­ma­tion that de­liv­ers an ad­e­quate ex­pe­rience? Ad­di­tion­ally, the task is not pure N-back — I’ve made it more game-like — and as a re­sult the re­la­tion­ship be­tween the game pa­ram­e­ters and the over­all difficulty is not as straight­for­ward as it would be in a cleaner en­vi­ron­ment (e.g. difficulty might be nearly lin­ear in tempo for some set-ups but highly non-lin­ear for oth­ers).

I have the lux­ury of hav­ing ac­cess to fairly rich be­hav­ioral data. The game is partly a rhythm game, so not only do I know whether a match has been made cor­rectly (or a non-match cor­rectly skipped) but I also know the timing of a player’s pos­i­tive re­sponses. A player with higher skill should have smaller timing er­rors, so a well-timed match is ev­i­dence for higher skill. I am still un­sure ex­actly how I can use this in­for­ma­tion op­ti­mally.

I plan to dis­play a plot of player skill over time, but this opens an­other set of ques­tions. What ex­actly am I plot­ting? How do I model player skill over time (just a time-weighted av­er­age? as a se­ries of slopes and plateaus? how should I ex­pect skill to change over a pe­riod of time with­out any play?)? How much vari­a­tion in perfor­mance is due to fa­tigue, at­ten­tion, caf­feine, etc.? Do I show er­ror bars or box plots? What units do I use?

And fi­nally, how do I turn a difficulty and a skill level into a pre­dic­tion of perfor­mance? What is the model of the player play­ing the game?

Main Questions

  • Is there an ad­e­quate difficulty pa­ram­e­ter and if so how do I calcu­late it?

  • Can I use timing data to im­prove pre­dic­tions? How?

  • What model do I use for player skill chang­ing over time?

  • How do I com­mu­ni­cate perfor­mance stats to the user? Box and whiskers? Units?

  • What is the model of the player and how do I turn that into a pre­dic­tion?

My Approach

I’ve read Sivia, so I have some the­o­ret­i­cal back­ground on how to solve this kind of prob­lem, but only limited real-world ex­pe­rience. Th­ese are my thoughts so far.

Model­ing game­play perfor­mance as Bernoulli tri­als seems ok. That is, given a skill level S and a difficulty D, perfor­mance on a set of N matches should be closely matched by N Bernoulli tri­als with prob­a­bil­ity of suc­cess p(S, D) as fol­lows:

  • if S ≪ D, p = 0.5

  • if S ≫ D, p is close to 1.0 (how close?)

  • if S = D, p = 0.9 feels about right

  • etc.

Then I can up­date S (and maybe D? see next para­graph) on ac­tual player perfor­mance. This will re­sult in a new prob­a­bil­ity den­sity func­tion over the “true” value of S, which will hope­fully be uni­modal and nar­row enough to re­port as a sin­gle best es­ti­mate (pos­si­bly with er­ror bars). Which re­minds me, what do I use as a prior for S? And what hap­pens if the player just stops play­ing halfway through, or hands the game to their 5-year-old?

Deter­min­ing difficulty is an­other hard prob­lem. I cur­rently have a com­pli­cated ad-hoc for­mula that I cob­bled to­gether with log­a­r­ithms, ex­po­nen­tials, and magic num­bers, and lots of trial and er­ror. It seems to work pretty well for the limited set of lev­els I’ve tested with a small group of playtesters, but I’m wor­ried that it won’t pre­dict difficulty well out­side of that do­main. One pos­si­bil­ity is to croud-source it: af­ter re­lease I’d col­lect perfor­mance data across all users and up­date the difficulty rat­ings on the fly. This seems risky and difficult, and the ini­tial difficulty rat­ings might be way off, which would lead to poor ini­tial user ex­pe­riences with the adap­tive mode. I would also have to worry about main­tain­ing a server back-end to gather the data and re­port on up­dated difficulty lev­els.

Re­quest For Feedback

So, any sug­ges­tions on how to tackle these prob­lems? Or the first place to start look­ing?

I’m pretty ex­cited about the po­ten­tial to col­lect real-world data on skill ac­qui­si­tion over time. If there is suffi­cient in­ter­est I’ll con­sider mak­ing the raw data pub­lic, and even in­stru­ment the code to col­lect other data of in­ter­est, by re­quest. I do have some con­cerns over data pri­vacy, so I may al­low users to opt out of send­ing their data up to the server.