Kaj_Sotala comments on Open Thread, Apr. 27 - May 3, 2015

Kaj_Sotala 27 Apr 2015 17:07 UTC
19 points
0
I managed to get my Bayes RPG into such a state that, although it still isn’t that interesting as a game, it’s moderately entertaining for a brief while until you master it, and seems like it should produce some actual learning.

I had this game as my MSc thesis topic as a way to force myself to work on the game, but I’m now finally starting to get to the point where a) working on it is fun enough that I don’t need an external motivator b) I’d like to actually graduate. So I’ll take what I have so far, run it to a bunch of test subjects, see if they learn anything, and write up the results in my thesis. Then I’ll continue working on the game on my spare time.

But I’d like to do the empirical part of the thesis properly. Since LW has a bunch of people who know a lot about statistics, I’d like to ask LW: what kinds of statistical tests would be most appropriate for measuring the results?

To elaborate more on the test setup. I expect to go with the standard approach: have some task that measures understanding of something that we want the game to teach, and split people into an intervention group and control group. Have them complete the task first, dropping anyone who does too well in this pre-test, and then carry out the intervention (i.e. either have them play the game or do some “placebo” task, depending on their group). Then have them re-do a new version of the original task and see whether the intervention group has improved more than the controls have.

I don’t want to elaborate too much on what tasks we’ll give to the subjects, in case I’ll recruit someone reading this to be one of my test subjects. But you can expect the standard mammography/cancer thing to be there, since it’s such a classic in the literature, though it’s not the thing that I’d expect the game’s current state to be the most successful at teaching. There will also be a task on a subject I do expect the game to currently be good at teaching. Then there will be one task that I’d expect to have a bimodal distribution in whether or not the game improves it, since the game doesn’t force you to pay attention to it. I’d expect some types of players to pay attention to it with others ignoring it.

Additionally I’d like to test things like:
- giving the players a relatively challenging in-game goal and see whether the completion of that challenge correlates with learning results
- ask all the players to play for at least X minutes but optionally allow them to play for longer, see whether the amount of time spent playing has any connection to the learning results
- after playing the game, have the players rate the game on some likert-like scales on questions like whether they enjoyed the game, whether it was too easy or too hard, whether they’d like to play it again, etc. Again look to see if the correlations might be as expected.
So, what statistical tests to use here? I don’t actually have much experience with statistics. I guess that the naive approach would be to use some (which?) form of ANOVA to test whether the means of pre-test, control intervention, and game intervention populations are the same. And then just do Spearman’s correlation between every numerical item that I’ve collected and see whether any statistically significant items pop up. Is that fine? Neither of those tests is going to pick up on the hypothesized bimodal distribution in the improvement in one of the tasks, but I might not bother with digging too deeply into that.

Also, how do I set the threshold for how good of a performance in the pre-test indicates that the subject already knows this too well to learn anything, and should thus be ignored in the analysis? Or should I even do that in the first place?
- [deleted] 28 Apr 2015 10:15 UTC
  9 points
  0
  Parent
  Typical analysis of the basic design you described is often something like a mixed 2×2 factorial design: which test (pre- / post-test, within subjects) × intervention (yes/no, between subjects) - the interaction term being evidence for effects of intervention (greater increase between pre- and post- test in intervention condition). Often analysed using ANOVA (participants as random effect), nonparametric equivalents may be more appropriate.
  
  More complex models are also very appropriate, e.g., adding question type as a factor/predictor rather than treating the different questions as separate dependent variables: this would provide indications of whether improvement after intervention differs for the question types, as you’ve predicted. This doesn’t give you clues about bimodality but at least allows you to more directly test your predictions about relative degree of improvement (if the intervention works).
  
  Correlations between your different dependent measures: feel free by all means—but make sure you examine the characteristics of the distributions rather than just zooming ahead with a matrix of correlation coefficients. And be aware of the multiple comparisons problem, Type I error is very likely.
  
  Excluding participants on the basis of overly high performance in pretest is appropriate. If possible I suggest setting this criterion before formal testing (even an educated guess is appropriate as this doesn’t harm the conclusions you can draw: it can be justified as leaving room for improvement if the intervention works) - or at the very least do this before analysing anything else of the participant’s performance to avoid biasing your decision about setting the threshold.
  
  … don’t want to elaborate too much on what tasks we’ll give to the subjects, in case I’ll recruit someone reading this to be one of my test subjects.
  
  I’m afraid you’ve said too much already—and if you’re looking for people who are naive about the principles involved, LW is probably not a great place for recruiting anyway.
  
  please feel free to private message me if you’d like clarification of what I’ve posted—this sort of thing is very much part of my day job.
  - Kaj_Sotala 29 Apr 2015 9:01 UTC
    4 points
    0
    Parent
    Thanks a lot!
    
    I’m afraid you’ve said too much already
    
    Could you elaborate on that? Something like “so we’re going to test the impact of traditional instruction versus this prototype educational game on your ability to do these tasks” is what I’d have expected to say to the test subjects anyway, and that’s mostly the content of what I said here. (Though I do admit that the bit about expecting a bimodal distribution depending on whether or not the subjects pay attention to something was a bit of an unnecessary tipoff here.)
    
    In particular, I expect to have a tradeoff—I can tell people even less than that, and get a much smaller group of testers. Or I can tell people that I’ve gotten the game I’ve been working on to a very early prototype stage and am now looking for testers, and advertise that on e.g. LW, and get a much bigger group of test subjects.
    
    and if you’re looking for people who are naive about the principles involved, LW is probably not a great place for recruiting anyway.
    
    It’s true that LW-people are much more likely to be able to e.g. solve the mammography example already, but I’d still expect most users to be relatively unfamiliar with the technicalities of causal networks—I was too, until embarking on this project.
    - [deleted] 29 Apr 2015 11:57 UTC
      5 points
      0
      Parent
      I was thinking more about your previous posts on the subject (your development of the game and some of the ideas behind it). The same general reason I’d avoid testing people from my extended lab network, who may not know any details of a current study but have a sufficiently clear impression of what I’m interested in to potentially influence the outcomes (whether intentionally, “helping me out”, or implicitly).
      
      When rolling it out for testing, you could always include a post-test which probes people’s previous experience (e.g. what they knew in advance about your work & the ideas behind it) & exclude people who report that they know “too much” about the motivations of the study. Could even prompt for some info about LW participation, could also be used to mitigate this issue (especially if you end up with decent samples both in and outside LW).
      - Kaj_Sotala 1 May 2015 18:07 UTC
        3 points
        0
        Parent
        Ah, that’s a good point. And a good suggestion, too.
- IlyaShpitser 29 Apr 2015 9:35 UTC
  3 points
  0
  Parent
  
  what kinds of statistical tests would be most appropriate for measuring the results?
  
  What question about your game and learning math/probability are you trying to answer?
  
  If you want “an effect” you want a comparison of two arms. But you can only have one arm have an intervention, and the other just be the baseline arm with no treatment at all (or just the ‘background treatment’ of being a college undergraduate). For example, you can take a set of undergrads, and advertise that you are testing probability aptitude or something, and then the control arm just gets the test, while the test arm gets your game and the test afterwards.
  
  I don’t know about your advisor, but I would accept a study like that.
  
  I always found it slightly puzzling that LW folks who get into practical data analysis start with F methods, and not B. Isn’t B kind of a LW “thing?”
  
  Starting to think about measuring results via ANOVA et al is, to me, starting at the wrong level of abstraction (I realize I may differ on this from a lot of statisticians). For example, ANOVA can test for the null. What does that null mean? Well, you are interested in some causal effect. Maybe this: E[test result | assigned to game] - E[test result | baseline undergrad].
  
  Or maybe you give them a questionaire first, and learn how much math they have had (or even what particular classes). Maybe you want to actually look at an effect conditional on math preparation level. Does your game possibly have an ‘interaction’ with background math sophistication level? Then you need to model that. Then maybe if you decide on the model, you decide for how to test for the null. Or maybe you don’t want the null, but the size of the effect itself. etc. etc.
  
  You think about what you want first, the stats technique afterwards.
  - Kaj_Sotala 8 May 2015 15:38 UTC
    0 points
    0
    Parent
    
    What question about your game and learning math/probability are you trying to answer?
    
    Mostly 1) do the players actually learn anything that would transfer outside the immediate game 2) how much (if at all) things like their enjoyment affect whether they learn
    
    If you want “an effect” you want a comparison of two arms. But you can only have one arm have an intervention, and the other just be the baseline arm with no treatment at all (or just the ‘background treatment’ of being a college undergraduate). For example, you can take a set of undergrads, and advertise that you are testing probability aptitude or something, and then the control arm just gets the test, while the test arm gets your game and the test afterwards.
    
    Thanks! Isn’t “undergrads with only the test vs. undergrads with the game and then the test” kinda the same as “undergrads with only test vs. undergrads after the pretest and the game”, though?
    
    I always found it slightly puzzling that LW folks who get into practical data analysis start with F methods, and not B. Isn’t B kind of a LW “thing?”
    
    F is what we’ve been taught, and what most of our supervisors understand. I’m not really familiar with B stats.
- Kaj_Sotala 27 Apr 2015 17:51 UTC
  1 point
  0
  Parent
  Additionally, I’m a little worried about the control group part. I expect it’s relatively easy to recruit people to play a game and have them be motivated to play it, but if I tell people that “oh, but you may be randomly assigned to the control condition where you’re given more traditional math instruction instead”, I expect that that will drop participation. And even the people who do show up regardless may not be particularly motivated to actually work on the problems if they do get assigned to the control condition, especially given that I’m hoping to also educate people who’d usually avoid maths. How insane would it be to just not have a control group?
  - ChristianKl 27 Apr 2015 21:57 UTC
    8 points
    0
    Parent
    “Traditional math instruction” isn’t the only possible control. I don’t even think that you need to prove that your game is better than “Traditional math instruction”. You could simply take any other game that includes a bit of math as control.
    
    Maybe the Credence game.
    - Kaj_Sotala 28 Apr 2015 7:37 UTC
      5 points
      0
      Parent
      Nice idea, thanks.
  - TylerJay 27 Apr 2015 23:17 UTC
    7 points
    0
    Parent
    
    How insane would it be to just not have a control group?
    
    Pretty insane in my opinion. I can’t imagine anything I would grade more harshly than not having a control except ethics violations.
    
    Besides, don’t most university psychology experiments with volunteers keep the protocol secret throughout the whole experiment and then debrief at the end? (Or sometimes even lie about the protocol to avoid skewing the results?)
    
    Alternatively, have you thought about doing a crossover-style design?
    
    Take group A and group B. Group A plays your game, and then takes the test. Group B either just takes the test or goes through some traditional education lesson (or whatever else you want for your control) and then takes the test. Next, group A does the traditional education, group B does the game, and both take part 2 of the test.
    
    That way, everyone gets to play the game at least, though it means they’re there for twice as long. Do you think you could pitch this in a way that is better than the “Maybe you play a game, maybe you don’t” option?
    
    You could potentially derive additional research value from this as well. If group A does better on Test Part 2, then your game would be shown to be a better way of acclimating people to traditional education on the subject or something like that (I’m sure you can draw a better conclusion or phrase this better).
    
    Just some thoughts. Also, make sure you write up a grading rubric ahead of time (or ideally, have someone else do it) and then have someone who knows nothing (or as little as possible) about the experiment (and especially the subjects) grade the answers to avoid researcher bias.
    - Kaj_Sotala 28 Apr 2015 6:43 UTC
      1 point
      0
      Parent
      
      Pretty insane in my opinion. I can’t imagine anything I would grade more harshly than not having a control except ethics violations.
      
      I think there might be reasonable theoretical grounds for it in this case, though? If I was testing say a medical treatment or self-help technique, then yes, there should absolutely be a control group since some people might get better on their own or just do better for a while because the self-help technique gave them extra confidence.
      
      But suppose I give people a pre-test, have them play for some minimum time, and then fill out the post-test when they’re done. I don’t see much in the way for random chance to confound things here: either they know the things needed for solving the tasks, or they don’t. If they didn’t know enough to solve the problems on the first try, they’re not going to suddenly acquire that knowledge in between.
      
      Besides, don’t most university psychology experiments with volunteers keep the protocol secret throughout the whole experiment and then debrief at the end?
      
      To some extent, but usually they still give some brief description of it beforehand, to attract people.
      
      Alternatively, have you thought about doing a crossover-style design?
      
      That’s a good idea, thanks.
      - ChristianKl 28 Apr 2015 10:29 UTC
        5 points
        0
        Parent
        
        But suppose I give people a pre-test, have them play for some minimum time, and then fill out the post-test when they’re done. I don’t see much in the way for random chance to confound things here: either they know the things needed for solving the tasks, or they don’t. If they didn’t know enough to solve the problems on the first try, they’re not going to suddenly acquire that knowledge in between.
        
        If I get a problem I can’t solve I can Google afterwards and read about how to solve the problem. Even if you lock me in a dark room, there the possibility that I recover forgotten knowledge if you give my brain a few hours.
        
        The pretest itself also provides practice. You need a control group, but it would be possible to give the control group nothing to do.
  - afeller08 28 Apr 2015 7:02 UTC
    6 points
    0
    Parent
    If I were designing the experiment, I would have the control group be to play a different game instead of having it be maths instructions.
    
    You generally don’t want test subjects to know whether they are in the control condition or not. So if you’re going to make it be maths instructions, you probably shouldn’t tell them what the experiment is designed to test at all, until you’re debriefing at the end. If you tell people you are recruiting that you are testing the effects of playing computer games on statistical reasoning, then the people in the control condition won’t need to realize that what you’re really testing is whether your RPG in particular helps people think about statistics. They can just play HalfLife 2 or whatever you pick for them to play for a few minutes, and then take your tests afterwards.
  - SarahNibs 27 Apr 2015 18:17 UTC
    4 points
    0
    Parent
    Do you have access to units of caring?
    
    Are you trying to gain knowledge, get a piece of paper, both, one as a side effect of another?
    
    “actually graduate” versus “see if they learn anything” might hugely inform your process. Off-the-cuff I’m guessing you want to actually graduate first with hopes of nice learning side effects, then see if they learn anything via something that takes longer.
    
    Also a consideration: 3+ arms. Instruction game, instruction non-game, and non-instruction game. Also possibly non-instruction non-game.
    - Kaj_Sotala 27 Apr 2015 18:36 UTC
      1 point
      0
      Parent
      
      Do you have access to units of caring?
      
      To some limited extent.
      
      Off-the-cuff I’m guessing you want to actually graduate first with hopes of nice learning side effects, then see if they learn anything via something that takes longer.
      
      Correct.
  - [deleted] 28 Apr 2015 9:54 UTC
    3 points
    0
    Parent
    If you didn’t have any control group, you wouldn’t be able to interpret any improvement between pretest and posttest, if you observed such a pattern: repetition or practice effects could explain any improvement. If you observed no improvement, you wouldn’t need a control group because there’s no effect to be explained.
    
    Sometimes exploratory methods start out with no-control group pilots just to see if a method is potentially promising (if no hints of effects, don’t invest a lot of resources in trying to set up a proper study).
    
    Sometimes studies like this are set up with multiple control groups to address specific concerns that may apply to individual control conditions. Here it seems like two would be the minimum: one in which participants play a different game that is expected to confer no benefit for learning; and another with some kind of more traditional instruction.
    
    In cases like this, recruitment is usually very vague—giving participants a realistic impression of the kinds of tasks they will be asked to do, and definitely no indications about who is assigned to a “control” group.
  - Lumifer 27 Apr 2015 18:01 UTC
    1 point
    0
    Parent
    
    How insane would it be to just not have a control group?
    
    So, there is this blog/forum which tries to teach people rationality! and science! and proper ways to solve problems! It even hopes to raise the sanity waterline.
    
    And then “oh, but it’s inconvenient...” X-/
    - Kaj_Sotala 27 Apr 2015 18:38 UTC
      1 point
      0
      Parent
      There’s the extent to which I’m willing to go to raise the sanity waterline, and then there’s the extent to which I’m willing to go for the sake of possibly improving my grade on a work whose final grade nobody will really ever care about.
      - ChristianKl 28 Apr 2015 10:31 UTC
        4 points
        0
        Parent
        
        There’s the extent to which I’m willing to go to raise the sanity waterline, and then there’s the extent to which I’m willing to go for the sake of possibly improving my grade on a work whose final grade nobody will really ever care about.
        
        That might not be the most productive mindset. If you show that your game works at teaching Bayes, I would expect people to refer to your thesis from time to time.
      - Lumifer 27 Apr 2015 18:47 UTC
        2 points
        0
        Parent
        In this case I don’t quite understand what are you asking.
        
        LW is unlikely to know whether your adviser / committee will consider the absence of a control group acceptable enough for this project.
        Kaj_Sotala 28 Apr 2015 7:51 UTC
        3 points
        0
        Parent
        You’re right, I wasn’t very clear on my objectives. Also, my previous comment was needlessly snarky, for which I apologize.
        
        To be honest, I’m not very sure of what I want, myself. I have reason to believe that they’ll consider it acceptable regardless of whether there’s a control group or not (this being the CS department and not the psych one), so that’s not actually an issue. And I’ve got some desire to do things “properly”, for its own sake, and also because it might be fun to do this well enough to turn it into a real publication. But I’m also swamped with a bunch of other stuff and don’t have a chance to spend too much effort on this.
        
        So, I guess I dunno what I’m asking, myself.
        ChristianKl 28 Apr 2015 10:32 UTC
        6 points
        0
        Parent
        
        To be honest, I’m not very sure of what I want, myself. I have reason to believe that they’ll consider it acceptable regardless of whether there’s a control group or not (this being the CS department and not the psych one)
        
        How about going to the office hours of a professor in the psychology department and ask them for advice on how to run your study?
        Kaj_Sotala 29 Apr 2015 9:15 UTC
        3 points
        0
        Parent
        Your question made me go d’oh, in that I suddenly remembered that there’s an obvious place right nearby to ask help from, both for designing the study and recruiting test subjects. I’ll talk with them, thanks.
        [deleted] 28 Apr 2015 10:21 UTC
        3 points
        0
        Parent
        Speaking very practically—who will be marking/grading your project?
        
        If psychologists aren’t going to be looking at it, it’s surely going to be fine to do the intervention as best you can and then discuss implications and limitations (including need for control group) in whatever you have to write up. It’s not going to be publishable but then you can deal with that later, depending on your circumstances this would probably mean re-doing the study with random assignment to conditions, starting with your project study as a pilot/proof of concept.
        Kaj_Sotala 29 Apr 2015 9:04 UTC
        3 points
        0
        Parent
        It’s going to be graded by computer scientists, so yeah, I can get away with a less rigorous protocol than what psychologists would insist on. (And then collaborate with actual psychologists with more resources later on.)