Making bayesian statistics easier and more accessible by coding advanced sampling algorithms for PyMC
Some background: I took statistics in high school because it seemed vaguely useful. Unfortunately, the material seemed very dry and involved mostly memorization and few general principles. It was boring and limited. College statistics was the same thing. I did some internships and statistics seemed very useful for figuring things out, but I didn’t know how to do very much.
Later I started reading Overcoming Bias, and Yudkowsky kept mentioning this thing called “Bayes theorem” and how it was really powerful. I read a stats book on Bayesian Statistics and my mind was blown. The statistics that I had been taught was a collection of formulas that gave answers but not much insight, but Bayes theorem encapsulated not just all of the statistics I had learned but the very notion of “learning from data.” I was hooked.
Later I figured out that the curse of dimensionality makes complex problems difficult (even though the simple statistics taught in stats classes now are still easy).
My project: Bayes theorem provides a simple coherent framework for learning from data. It massively clarifies how to think about data. It is something all engineers (and technical folk in general) could and should know. Not only is Bayesian stats very practical, but it turns a topic that even nerds find confusing and boring into something elegant and interesting.
I want to make fitting bayesian models as thought free as possible. Calculating the posterior distributions for your models is often very difficult and usually the most constraining issue. This is often true (though less so) even if you know a great deal about statistical computation.
As I have discussed here, I think the current lowest hanging fruit is the use of gradients and higher derivatives in algorithms for sampling from the posterior distribution. Thus my project for the last year + has been to work on improving PyMC, a Python package for doing Bayesian inference, adding gradient information, writing advanced general sampling algorithms found in the literature, improving the syntax of PyMC to make it simpler, more intuitive, and more powerful.
On my blog I linked to a package I built with a sampler that works using Langevin dynamics (uses gradient information), but more recently I have found that Hybrid (or Hamiltonian) Monte-Carlo is practically much simpler and works much better. This is my Hamiltonian MC implementation.
I am currently trying to improve my HMC sampler, and working on making PyMC faster and easier to maintain and extend.
If you know Bayesian stats and have some programming skills, I invite you help me improve statistical computation! just message me!
The biggest thing is probably it’s lack of good debugging tools. Their tracebacks are not very informative. Their handling of arrays is significantly inferior to NumPy. For example, R has a tendency to have separate functions for applying functions along different dimensions of an array whereas numpy almost universally has uses an argument to specify along what axis to apply a function.
Also doing much of anything non-statistics is a royal pain.
I would describe it more as compilation/explanation of existing theory in a more accessible and non-ideological way, but yes I am doing that too. I am a bit less optimistic about that though, because it’s a big complex topic and my explanation skills aren’t really that fabulous.
Making bayesian statistics easier and more accessible by coding advanced sampling algorithms for PyMC
Some background: I took statistics in high school because it seemed vaguely useful. Unfortunately, the material seemed very dry and involved mostly memorization and few general principles. It was boring and limited. College statistics was the same thing. I did some internships and statistics seemed very useful for figuring things out, but I didn’t know how to do very much.
Later I started reading Overcoming Bias, and Yudkowsky kept mentioning this thing called “Bayes theorem” and how it was really powerful. I read a stats book on Bayesian Statistics and my mind was blown. The statistics that I had been taught was a collection of formulas that gave answers but not much insight, but Bayes theorem encapsulated not just all of the statistics I had learned but the very notion of “learning from data.” I was hooked.
Later I figured out that the curse of dimensionality makes complex problems difficult (even though the simple statistics taught in stats classes now are still easy).
My project: Bayes theorem provides a simple coherent framework for learning from data. It massively clarifies how to think about data. It is something all engineers (and technical folk in general) could and should know. Not only is Bayesian stats very practical, but it turns a topic that even nerds find confusing and boring into something elegant and interesting.
I want to make fitting bayesian models as thought free as possible. Calculating the posterior distributions for your models is often very difficult and usually the most constraining issue. This is often true (though less so) even if you know a great deal about statistical computation.
As I have discussed here, I think the current lowest hanging fruit is the use of gradients and higher derivatives in algorithms for sampling from the posterior distribution. Thus my project for the last year + has been to work on improving PyMC, a Python package for doing Bayesian inference, adding gradient information, writing advanced general sampling algorithms found in the literature, improving the syntax of PyMC to make it simpler, more intuitive, and more powerful.
On my blog I linked to a package I built with a sampler that works using Langevin dynamics (uses gradient information), but more recently I have found that Hybrid (or Hamiltonian) Monte-Carlo is practically much simpler and works much better. This is my Hamiltonian MC implementation.
I am currently trying to improve my HMC sampler, and working on making PyMC faster and easier to maintain and extend.
If you know Bayesian stats and have some programming skills, I invite you help me improve statistical computation! just message me!
Why in python instead of R? R is used much more widely among people actually doing statistics, as far as I know.
I know, but R is really really terrible, and I hate working in it while Python is a joy to use and develop.
out of curiosity, what don’t you like about R?
The biggest thing is probably it’s lack of good debugging tools. Their tracebacks are not very informative. Their handling of arrays is significantly inferior to NumPy. For example, R has a tendency to have separate functions for applying functions along different dimensions of an array whereas numpy almost universally has uses an argument to specify along what axis to apply a function. Also doing much of anything non-statistics is a royal pain.
What about your work in economics to develop a stronger intellectual justification for quasi-monetarist policies?
I would describe it more as compilation/explanation of existing theory in a more accessible and non-ideological way, but yes I am doing that too. I am a bit less optimistic about that though, because it’s a big complex topic and my explanation skills aren’t really that fabulous.
Which one?
It took me too long there to realize that you were asking which book, rather than which mind.
Er, you were asking about the book, right?
L.O.L.
It was Bolstad’s Bayesian Statistics, but I now recommend Sivia’s Data Analysis: A Bayesian Tutorial, as I mentioned in the textbook thread.