gwern comments on Open Thread, March 1-15, 2013

gwern 16 Mar 2013 2:54 UTC
8 points
I finished Coursera “Data Analysis” last night. (It started back in January.)

It’s basically “applied statistics/some machine learning in R”: we get a quick tour of data clean up and munging, basic stats, material on working with linear & logistic models, use of common visualization and clustering approaches, prediction with linear regression and trees and random forests, then uses of simulation such as bootstrapping.

There’s a lot of material to cover, and while there’s plenty of worked out examples in the lectures, I don’t see anyone learning R or statistics just from this course—you should definitely have used R to some degree before (at least running some t-tests or graphs), and you will definitely benefit from already knowing what a p-value is and how you would calculate it by hand (because eg. you’ll be flummoxed when the lecturer Leek works out a confidence interval ‘by hand’ while coding—“where does this magic value 1.96 come from?!”).

On the plus side, I liked all the examples and the curriculum seems useful and well-chosen. It’s a reasonable introduction to ‘data science’. I think my time wasn’t wasted doing this Coursera: I’m more comfortable with some of the more advanced/exotic techniques, and picked up many R tips, some of which have come in handy already (eg. some of the data munging tips were useful in working with Touhou music data, and I’ve been able to replace all my homebrew Haskell multiple-correction code in various nootropics & Zeo experiments with a standard R library function p.adjust, which I had no idea existed until the lecture on multiple comparison introduced it to me) - although as of yet I have not used bootstraps or random forests* or splines in anger. (But if any is thinking about doing it in the future, see my comment about the prerequisites.)

On the negative side: like most of the other students, I think this should’ve been a longer course than 8 weeks and that the estimate of 5hrs/wk is misleading. The pace was very unforgiving. I was relatively well-prepared for this course, but I still wound up submitting for the second data analysis assignment a paper I think was very substandard. Why? Well, though we had two weeks or so to do it, I deliberately didn’t do much work on it in the first week because in the first assignment you couldn’t do a good job without the lectures from the week before the assignment was due, and I didn’t want to get bushwhacked again; but in the actual week before, I got completely distracted by my Touhou music project, and so I wound up just not having the time or energy to do it. Similar things happened to a lot of other students: there was no slack or recovery time.

(There were also the usual teething problems of any new course: wrong or misleading quizzes, errors in lectures, that sort of thing. The peer review grading seems particularly poor, with the required grades being based on pretty superficial aspects of the submitted analyses.)

* EDIT: I have since employed random forests or bootstrapping in http://www.gwern.net/Weather , http://www.gwern.net/hpmor , & http://www.gwern.net/Google%20shutdowns