Stats Advice on a New N-back Game
Cross-posted to my blog. I expect this will be of some interest to the LessWrong community both because of previous interest in N-back and because of the opportunity to apply Bayesian statistics to a real-world problem. The main reason I’m writing this article is to get feedback on my approach and to ask for help in the areas where I’m stuck. For some background, I’m a software developer who’s been working in games for 7+ years and recently left my corporate job to work on this project full-time.
As I mentioned here and here, since early February I’ve been working on an N-back-like mobile game. I plan to release for iOS this summer and for Android a few months later if all goes well. I have fully implemented the core gameplay and most of the visual styling and UI, and am currently working with a composer on the sound and music.
I am just now starting on the final component of the game: an adaptive mode that assesses the player’s skill and presents challenges that are tuned to induce a state of flow.
The game is broken down into waves, each of which presents an N-back-like task with certain parameters, such as the number of attributes, the number of variants in each attribute, the tempo, and so on. I would like to find a way to collapse these parameters into a single difficulty parameter that I can compare against a player’s skill level to predict their performance on a given wave.
But I realize that some players will be better at some challenges than others (e.g. memory, matching multiple attributes, handling fast tempos, dealing with visual distractions like rotation, or recognizing letters). Skill and difficulty are multidimensional quantities, and this makes performance hard to predict. The question is, is there a single-parameter approximation that delivers an adequate experience? Additionally, the task is not pure N-back — I’ve made it more game-like — and as a result the relationship between the game parameters and the overall difficulty is not as straightforward as it would be in a cleaner environment (e.g. difficulty might be nearly linear in tempo for some set-ups but highly non-linear for others).
I have the luxury of having access to fairly rich behavioral data. The game is partly a rhythm game, so not only do I know whether a match has been made correctly (or a non-match correctly skipped) but I also know the timing of a player’s positive responses. A player with higher skill should have smaller timing errors, so a well-timed match is evidence for higher skill. I am still unsure exactly how I can use this information optimally.
I plan to display a plot of player skill over time, but this opens another set of questions. What exactly am I plotting? How do I model player skill over time (just a time-weighted average? as a series of slopes and plateaus? how should I expect skill to change over a period of time without any play?)? How much variation in performance is due to fatigue, attention, caffeine, etc.? Do I show error bars or box plots? What units do I use?
And finally, how do I turn a difficulty and a skill level into a prediction of performance? What is the model of the player playing the game?
Is there an adequate difficulty parameter and if so how do I calculate it?
Can I use timing data to improve predictions? How?
What model do I use for player skill changing over time?
How do I communicate performance stats to the user? Box and whiskers? Units?
What is the model of the player and how do I turn that into a prediction?
I’ve read Sivia, so I have some theoretical background on how to solve this kind of problem, but only limited real-world experience. These are my thoughts so far.
Modeling gameplay performance as Bernoulli trials seems ok. That is, given a skill level S and a difficulty D, performance on a set of N matches should be closely matched by N Bernoulli trials with probability of success p(S, D) as follows:
if S ≪ D, p = 0.5
if S ≫ D, p is close to 1.0 (how close?)
if S = D, p = 0.9 feels about right
Then I can update S (and maybe D? see next paragraph) on actual player performance. This will result in a new probability density function over the “true” value of S, which will hopefully be unimodal and narrow enough to report as a single best estimate (possibly with error bars). Which reminds me, what do I use as a prior for S? And what happens if the player just stops playing halfway through, or hands the game to their 5-year-old?
Determining difficulty is another hard problem. I currently have a complicated ad-hoc formula that I cobbled together with logarithms, exponentials, and magic numbers, and lots of trial and error. It seems to work pretty well for the limited set of levels I’ve tested with a small group of playtesters, but I’m worried that it won’t predict difficulty well outside of that domain. One possibility is to croud-source it: after release I’d collect performance data across all users and update the difficulty ratings on the fly. This seems risky and difficult, and the initial difficulty ratings might be way off, which would lead to poor initial user experiences with the adaptive mode. I would also have to worry about maintaining a server back-end to gather the data and report on updated difficulty levels.
Request For Feedback
So, any suggestions on how to tackle these problems? Or the first place to start looking?
I’m pretty excited about the potential to collect real-world data on skill acquisition over time. If there is sufficient interest I’ll consider making the raw data public, and even instrument the code to collect other data of interest, by request. I do have some concerns over data privacy, so I may allow users to opt out of sending their data up to the server.