jsteinhardt(Jacob Steinhardt)

Karma: 5,536
• Here’s my attempt (I haven’t read the comments above in detail, as I don’t want the answer spoiled in case I’m wrong).

For whatever reason, it is apparent that the conscious part of our brain is not fully aware of everything that our brain does. Now let’s imagine our brain executing some algorithm, and see what it looks like from the perspective of our consciousness. At any given stage in the algorithm, we might have multiple possible branches, and need to continue to execute the algorithm along one of those possible branches. To determine which branch to follow, we need to do some computation. But that computation isn’t done on a conscious level (or rather, sometimes it is, but the fastest computations are done on a subconscious level). However, the computation is done in parallel, so our consciousness “sees” all of the possible next steps, and then feels as if it is choosing one of them. In reality, that “choice” occurs when all of the subconscious processes terminate and we pick the choice with the highest score.

• This is also a problem I have thought about a bit. I plan to think about it more, organize my thoughts, and hopefully make a post about it soon, but in the meantime I’ll sketch my ideas. (It’s unfortunate that this comment appeared in a post that was so severely downvoted, as less people are likely to think about it now.)

There is no sense in which an absolute probability can be uncertain. Given our priors, and the data we have, Bayes’ rule can only give one answer.

However, there is a sense in which conditional probability can be uncertain. Since all probabilities in reality are conditional (at the very least, we have to condition on our thought process making any sense at all), it will be quite common in practice to feel uncertain about a probability, and to be well-justified in doing so.

Let me illustrate with the coin example. When I say that the next flip has a 50% chance of coming up heads, what I really mean is that the coin will come up heads in half of all universes that I can imagine (weighted by likelihood of occurrence) that are consistent with my observations so far.

However, we also have an estimate of another quantity, namely ‘the probability that the coin comes up heads’ (generically). I’m going to call this the weight of the coin since that is the colloquial term. When we say that we are 50% confident that the coin comes up heads (and that we have a high degree of confidence in our estimate), we really mean that we believe that the distribution over the weight of the coin is tightly concentrated about one-half. This will be the case after 10,000 flips, but not after 5 flips. (In fact after N heads and N tails, a weight of x has probability proportional to [x(1-x)] ^N.)

What is important to realize is that the statement ‘the coin will come up heads with probability 50%’ means ‘I believe that in half of all conceivable universes the coin will come up heads’, whereas ‘I am 90% confident that the coin will come up heads with probability 50%’ means something more along the lines of ‘I believe that in 90% of all conceivable universes my models predict a 50% chance of heads’. But there is also the difference that in the second statement, the ’90% of all conceivable universes’ only actually specifies them up to the extent that our models need in order to take over.

I think that this is similar to what humans do when they express confidence in a probability. However, there is an important difference, as in the previous case my ‘confidence in a probability’ corresponded to some hidden parameter that dictated the results of the coin under repeated trials. The hidden parameter in most real-world situations is far less clear, and we also don’t usually get to see repeated trials (I don’t think this should matter, but unfortunately my intuition is frequentist).

• Darn, looks like I’ll be a bit late, as I have something to do in Kendall Square until 7pm. Still, I’m looking forward to meeting other LW users in Cambridge.

• I feel like this is actually quite hard for an AI. To talk about formal systems like mathematics in an intuitive sense, we need language. And language requires us to have almost every other system (except perhaps vision). In particular, we would need sophisticated systems for dealing with categories.

I personally think that the simplest AI-hard problem is coming up with a satisfactory system for category formation and inference. See for instance Goldstone and Kerstein’s chapter 22 in the Handbook of Psychology for a discussion of the problem.

I actually disagree with wikipedia’s claim that vision is AI-hard. I think we could have good vision systems without a strong AI, and vice versa, although the latter would be quite crippled initially.

In­ter­est­ing talk on Bayesi­ans and frequentists

23 Oct 2010 4:10 UTC
11 points
• Bayesian approaches tend to be more powerful than other statistical techniques in situations where there is a relatively limited supply of data. This is because Bayesian approaches, due to being model-based, tend to have a richer structure that allows it to take advantage of more of the structure of the data; a second reason is because Bayes allows for the explicit integration of prior assumptions and is therefore usually a more aggressive form of inference than most frequentist methods.

I tried to find a good paper demonstrating this (called “learning from one example”), unfortunately I only came across this PhD thesis—http://​​www.cs.umass.edu/​​~elm/​​papers/​​thesis.pdf , although there is certainly a lot of work being done on generalizing from one, or a small number of, examples.

• Model selection is definitely one of the biggest conceptual problems in GAI right now (I would say that planning once you have a model is of comparable importance /​ difficulty). I think the way to solve this sort of problem is by having humans carefully pick a really good model (flexible enough to capture even unexpected situations while still structured enough to make useful predictions). Even with SVMs you are implicitly assuming some sort of structure on the data, because you usually transform your inputs into some higher-dimensional space consisting of what you see as useful features in the data.

Even though picking the model is the hard part, using Bayes by default seems like a good idea because it is the only general method I know of for combining all of my assumptions without having to make additional arbitrary choices about how everything should fit together. If there are other methods, I would be interested in learning about them.

What would the “really good model” for a GAI look like? Ideally it should capture our intuitive notions of what sorts of things go on in the world without imposing constraints that we don’t want. Examples of these intuitions: superficially similar objects tend to come from the same generative process (so if A and B are similar in ways X and Y, and C is similar to both A and B in way X, then we would expect C to be similar to A and B in way Y, as well); temporal locality and spatial locality underly many types of causality (so if we are trying to infer an input-output relationship, it should be highly correlated over inputs that are close in space/​time); and as a more concrete example, linear momentum tends to persist over short time scales. A lot of work has been done in the past decade on formalizing such intuitions, leading to nonparametric models such as Dirichlet processes and Gaussian processes. See for instance David Blei’s class on Bayesian nonparametrics (http://​​www.cs.princeton.edu/​​courses/​​archive/​​fall07/​​cos597C/​​index.html) or Michael Jordan’s tutorial on Dirichlet processes (http://​​www.cs.berkeley.edu/​​~jordan/​​papers/​​pearl-festschrift.pdf).

I’m beginning to think that a top-level post on how Bayes is actually used in machine learning would be helpful. Perhaps I will make on when I have a bit more time. Also, does anyone happen to know how to collapse URLs in posts (e.g. the equivalent of test in HTML).

• But even “learning to learn” is done in the context of a model, it’s just a higher-level model. There are in fact models that allow experience gained in one area to generalize to other areas (by saying that the same sorts of structures that are helpful for explaining things in one area should be considered in that other area). Talking about what an AI researcher would do is asking much more out of an AI than one would ask out of a human. If we could get an AI to even be as intelligent as a 3-year-old child then we would be more or less done. People don’t develop sophisticated problem solving skills until at least high school age, so it seems difficult to believe that such a problem is fundamental to AGI.

Another reference, this time on learning to learn, although unfortunately it is behind a pay barrier (Tenenbaum, Goodman, Kemp, “Learning to learn causal models”).

It appears that there is also a book on more general (mostly non-Bayesian) techniques for learning to learn: Sebastian Thrun’s book. I got the latter just by googling, so I have no idea what’s actually in it, other than by skimming through the chapter descriptions. It’s also not available online.

• I’m curious why this was down-voted, other than that it gives no explanation of what it links to?

• My main question is what your goal is in the post. It wasn’t clear to me upon a perusal, and I think it would be good to state it up-front regardless.

• It depends on what you mean by model selection. If you mean e.g. figuring out whether to use quadratics or cubics, then the standard solution that people cite is to use Bayesian Occam’s razor, i.e. compute

Where we compute the probabilities on the right-hand side by marginalizing over all cubics and quadratics. But the number you get out of this will depend strongly on how quickly the tails decay on your distribution over cubics and quadratics, so I don’t find this particularly satisfying. (I’m not alone in this, although there are people who would disagree with me or propose various methods for choosing the prior distributions appropriately.)

If you mean something else, like figuring out what specific model to pick out from your entire space (e.g. picking a specific function to fit your data), then you can run into problems like having to compare probability masses to probability densities, or comparing measures with different dimensionality (e.g. densities on the line versus the plane); a more fundamental issue is that picking a specific model potentially ignores other features of your posterior distribution, like how concentrated the probability mass is about that model.

I would say that the most principled way to get a single model out at the end of the day is variational inference, which basically attempts to set parameters in order to minimize the relative entropy between the distribution implied by the parameters and the actual posterior distribution. I don’t know a whole lot about this area, other than a couple papers I read, but it does seem like a good way to perform inference if you’d like to restrict yourself to considering a single model at a time.

• I think I basically agree with you on that; whenever feasible the full posterior (as opposed to the maximum-likelihood model) is what you should be using. So instead of using “Bayesian model selection” to decide whether to pick cubics or quadratics, and then fitting the best cubic or the best quadratic depending on the answer, the “right” thing to do is to just look at the posterior distribution over possible functions f, and use that to get a posterior distribution over f(x) for any given x.

The problem is that this is not always reasonable for the application you have in mind, and I’m not sure if we have good general methods for coming up with the right way to get a good approximation. But certainly an average over the models is what we should be trying to approximate.

• “but that seems likely to make the house pretty unpleasant to be in afterward.”

But the house would only end up being unpleasant if it was used to prevent far more serious consequences. As a particularly effective deterrent, it seems still worth it.

• In particular, even if we use some form of approximate inference, there’s so many options out there (and probably none of them are good enough to be what humans actually use) that pseudolikelihood is not itself that likely.

Other versions of approximate inference: Markov-Chain Monte Carlo, Variational Inference, Loopy Belief Propogation.

Although merely citing research to back up your claims doesn’t, in my opinion, make your arguments significantly stronger unless the research itself has been established fairly rigorously; for instance, the affect heuristic, while a popular theme on LessWrong, lacks experimental evidence.

• While, as noted by others, pseudolikelihood is unlikely to be what humans actually use, I think it is interesting to ask whether some cognitive biases come from some sort of approximate inference. Designing an experiment to test this conclusively would be quite challenging, but very useful.

• But, of course, the mathematics of probability theory don’t work that way. A hypothesis, such as that the apparent burglary in Filomena Romanelli’s room was staged—doesn’t get points for its ability to explain the data unless it does so better than its negation. And, in the absence of the assumption that Knox and Sollecito are guilty—if we’re presuming them to be innocent, as the law requires, or assigning a tiny prior probability to their guilt, as epistemic rationality requires—this contest is rigged. The standards for “explaining well” that the fake-burglary hypothesis has to meet in order to be taken seriously are much higher than those that its negation has to meet, because of the dependence relation that exists between the fake-burglary question and the murder question.

This isn’t quite true. If the prior probability of being a murderer is 1 in 10^6, and I can find 30 things that are explained twice as well by the murder hypothesis as the non-murder hypothesis, then the posterior probability of being a murderer is 99.9%, in the absence of mitigating factors (since 2^30/​10^6 is about 1000.) So, many pieces of weak evidence for an unlikely proposition can still establish that proposition.