passive_fist comments on LINK: AI Researcher Yann LeCun on AI function

passive_fist 11 Dec 2013 21:12 UTC
0 points
0
Prediction by itself cannot solve causal decision problems (that’s why AIXI is not the same as just a Solomonoff predictor) but your example is incorrect. What you’re describing is a modelling problem, not a decision problem.
- IlyaShpitser 11 Dec 2013 21:41 UTC
  4 points
  0
  Parent
  Sorry, I am not following you. Decision problems have the form of “What do you do in situation X to maximize a defined utility function?”
  
  It is very easy to transform any causal modeling example into a decision problem. In this case: “here is an observational study where doctors give drugs to some cohort of patients. This is your data. Here’s the correct causal graph for this data. Here is a set of new patients from the same cohort. Your utility function rewards you for minimizing patient deaths. Your actions are ‘give the drug to everyone in the set’ or ‘do not give the drug to everyone in the set.’ What do you do?”
  
  Predictor algorithms, as understood by the machine learning community, cannot solve this class of problems correctly. These are not abstract problems! They happen all the time, and we need to solve them now, so you can’t just say “let’s defer solving this until we have a crazy detailed method of simulating every little detail of the way the HIV virus does its thing in these poor people, and the way this drug disrupts this, and the way side effects of the drug happen, etc. etc. etc.”
  - V_V 12 Dec 2013 0:36 UTC
    2 points
    0
    Parent
    Bayesian network learning and Bayesian network inference can, in principle, solve that problem.
    
    Of course, if your model is wrong, and/or your dataset is degenerate, any approach will give you bad results: Gargbage in, garbage out.
    - IlyaShpitser 12 Dec 2013 0:38 UTC
      3 points
      0
      Parent
      Bayesian networks are statistical, not causal models.
      - V_V 12 Dec 2013 12:53 UTC
        −1 points
        0
        Parent
        I don’t know what you mean by “causal model”, but Bayesian networks can deal with the type of problems you describe.
        IlyaShpitser 12 Dec 2013 13:42 UTC
        4 points
        0
        Parent
        A causal model to me is a set of joint distributions defined over potential outcome random variables.
        
        And no, regardless of how often you repeat it, Bayesian networks cannot solve causal problems.
        V_V 12 Dec 2013 16:01 UTC
        2 points
        0
        Parent
        I have no idea what you’re talking about.
        
        gjm asked you what a causal problem was, you didn’t provide a definition and instead gave an example of a problem which seems clearly solvable by Bayesian methods such as hidden Markov models (for prediction) or partially observable Markov decision processes (for decision).
        IlyaShpitser 12 Dec 2013 16:57 UTC
        0 points
        0
        Parent
        (a) Hidden Markov models and POMDPs are probabilistic models, not necessarily Bayesian.
        
        (b) I am using the standard definition of a causal model, first due to Neyman, popularized by Rubin. Everyone except some folks in the UK use this definition now. I am sorry if you are unfamiliar with it.
        
        (c) Statistical models cannot solve causal problems. The number of times you repeat the opposite, while adding the word “clearly” will not affect this fact.
        V_V 12 Dec 2013 18:40 UTC
        0 points
        0
        Parent
        
        (a) Hidden Markov models and POMDPs are probabilistic models, not necessarily Bayesian.
        
        According to Wikipedia:
        
        A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states. A HMM can be considered the simplest dynamic Bayesian network.
        
        .
        
        (b) I am using the standard definition of a causal model, first due to Neyman, popularized by Rubin. Everyone except some folks in the UK use this definition now. I am sorry if you are unfamiliar with it.
        
        I suppose you mean this.
        
        It seems to be a framework for the estimation of probability distributions from experimental data, under some independence assumptions.
        
        (c) Statistical models cannot solve causal problems. The number of times you repeat the opposite, while adding the word “clearly” will not affect this fact.
        
        You still didn’t define “causal problem” and what you mean by “solve” in this context.
        IlyaShpitser 12 Dec 2013 20:07 UTC
        2 points
        0
        Parent
        A “Bayesian network” is not necessarily a Bayesian model. Bayesian networks can be used with frequentist methods, and frequently are (see: the PC algorithm). I believe Pearl called the networks “Bayesian” to honor Bayes, and because of the way Bayes theorem is used when you shuffle probabilities around. The model does not necessitate Bayesian methods at all.
        
        I don’t mean to be rude, but are we operating at the level of string pattern matching, and google searches here?
        
        You still didn’t define “causal problem” and what you mean by “solve” in this context.
        
        Sociological definition : “a causal problem” is a problem that people who do causal inference study. Estimating causal effects. Learning cause-effect relationships from data. Mediation analysis. Interference analysis. Decision theory problems. To “solve” means to get the right answer and thereby avoid going to jail for malpractice.
        
        This is a bizarre conversation. Causal problems aren’t something esoteric. Imagine if you kept insisting I define what an algebra problem is. There are all sorts of things you could read on this standard topic.
        Lumifer 12 Dec 2013 20:37 UTC
        1 point
        0
        Parent
        
        This is a bizarre conversation.
        
        Looks a like a perfectly normal conversation where people insist on using different terminology sets :-/
        Expand this thread
        IlyaShpitser 12 Dec 2013 20:52 UTC
        0 points
        0
        Parent
        One of these people has a good reason for preferring his terminology (e.g. it’s standard, it’s what everyone in the field actually uses, etc.) “Scott, can you define what a qubit is?”, etc.
        Lumifer 12 Dec 2013 21:16 UTC
        0 points
        0
        Parent
        
        it’s what everyone in the field actually uses
        
        Yes, but you are talking to people outside of the field.
        
        For example you tend to use the expression “prediction model” as an antonym to “causal model”. That may be standard in your field, but that’s not what it means outside of it.
        IlyaShpitser 12 Dec 2013 21:49 UTC
        0 points
        0
        Parent
        
        For example you tend to use the expression “prediction model” as an antonym to “causal model”.
        
        Not an antonym, just a different thing that should not be confused. A qubit is a very different thing from a bit, with different properties.
        
        That may be standard in your field, but that’s not what it means outside of it.
        
        “Sure, this definition of truth may be standard in your field, Prof. Tarsky, but that’s not what we mean!” I guess we are done, then! Thanks for your time.
        V_V 12 Dec 2013 22:36 UTC
        0 points
        0
        Parent
        
        A “Bayesian network” is not necessarily a Bayesian model. Bayesian networks can be used with frequentist methods, and frequently are (see: the PC algorithm).
        
        You can use frequentists methods to learn Bayesian networks from data, as with any other Bayesian model.
        
        And you can also use Bayesian networks without priors to do things like maximum likelihood estimation, which isn’t Bayesian sensu stricto, but I don’t think this is relevant to this conversation, is it?
        
        I don’t mean to be rude, but are we operating at the level of string pattern matching, and google searches here?
        
        No, we are operating at the level of trying to make sense of your claims.
        
        Sociological definition : “a causal problem” is a problem that people who do causal inference study. Estimating causal effects. Learning cause-effect relationships from data. Mediation analysis. Interference analysis. Decision theory problems. To “solve” means to get the right answer and thereby avoid going to jail for malpractice.
        
        Please try to reformulate without using the word “cause/causal”.
        The term has multiple meanings. You may be using a one of them assuming that everybody shares it, but that’s not obvious.
        Expand this thread
        IlyaShpitser 12 Dec 2013 23:00 UTC
        1 point
        0
        Parent
        I operate within the interventionist school of causality, whereby a causal effect has something to do with how interventions affect outcome variables. This is of course not the only formalization of causality, there are many many others. However, this particular one has been very influential, almost universally adopted among the empirical sciences, corresponds very closely to people’s causal intuitions in many important respects (and has the mathematical machinery to move far beyond when intuitions fail), and has a number of other nice advantages I don’t have the space to get into here (for example it helped to completely crack open the “what’s the dimension of a hidden variable DAG” problem).
        
        One consequence of the conceptual success of the interventionist school is that there is now a long list of properties we think a formalization of causality has to satisfy (that were first figured out within the interventionist framework). So we can now rule out bad formalizations of causality fairly easily.
        
        I think getting into the interventionist school is too long for even a top level post, let alone a response post buried many levels deep in a thread. If you are interested, you can read a book about it (Pearl’s book for example), or some papers.
        
        Prediction algorithms, as used in ML today, completely fail on interventionist causal problems, which correspond, loosely speaking, to trying to figure out the effect of a randomized trial from observational data. I am not trying to give them a hard time about it, because that’s not what the emphasis in ML is, which is perfectly fine!
        
        You can think of this problem as just another type of “prediction problem,” but this word usage simply does not conform to what people in ML mean by “prediction.” There is an entirely different theory, etc.
        Lumifer 12 Dec 2013 20:40 UTC
        0 points
        0
        Parent
        
        A causal model to me is a set of joint distributions defined over potential outcome random variables.
        
        Huh?
        
        Can you expand on this, with special attention to the difference between the model and the result of a model, and to the differences from plain-vanilla Bayesian models which will also produce joint distributions over outcomes.
        IlyaShpitser 12 Dec 2013 20:57 UTC
        2 points
        0
        Parent
        Sure. Here’s the world’s simplest causal graph: A → B.
        
        Rubin et al, who do not like graphs, will instead talk about a joint distribution:
        
        p(A, B(a=1), B(a=0))
        
        where B(a=1) means ‘random variable B under intervention do(a=1)’. Assume binary A for simplicity here.
        
        A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] } The causal model for this graph would be:
        
        { p(A, B(a=1), B(a=0) | B(a=1) is independent of A, and B(a=0) is independent of A }
        
        These assumptions are called ‘ignorability assumptions’ in the literature, and they correspond to the absence of confounding between A and B. Note that it took counterfactuals to define what ‘absence of confounding’ means.
        
        A regular Bayesian network model for this graph is just the set of densities over A and B (since this graph has no d-separation statements). That is, it is the set { p(A,B) | [no assumptions] }. This is a ‘statistical model,’ because it is a set of regular old joint densities, with no mention of counterfactuals or interventions anywhere.
        
        The same graph can correspond to very different things, you have to specify.
        
        You could also have assumptions corresponding to “missing graph edges.” For example, in the instrumental variable graph:
        
        Z → A → B, with A ← U → B, where we do not see U, we would have an assumption that states that B(a,z) = B(a,z’) for all a,z,z’.
        
        Please don’t say “Bayesian model” when you mean “Bayesian network.” People really should say “belief networks” or “statistical DAG models” to avoid confusion.
        Lumifer 12 Dec 2013 21:28 UTC
        0 points
        0
        Parent
        
        Please don’t say “Bayesian model” when you mean “Bayesian network.”
        
        I do not mean “Bayesian networks”. I mean Bayesian models of the kind e.g. described in Gelman’s Bayesian Data Analysis.
        
        p(A, B(a=1), B(a=0)) where B(a=1) means ‘random variable B under intervention do(a=1)’. Assume binary A for simplicity here.
        
        You still can express this as plain-vanilla conditional densities, can’t you? “under intervention do(a=1)” is just a different way of saying “conditional on A=1″, no?
        
        A causal model over A,B is a set of densities { p(A, B(a=1), B(a=0) | [ some property ] }
        
        and
        
        with no mention of counterfactuals or interventions anywhere.
        
        I don’t see counterfactuals in your set of densities and how “interventions” are different from conditionality?
        IlyaShpitser 12 Dec 2013 21:43 UTC
        2 points
        0
        Parent
        
        You still can express this as plain-vanilla conditional densities, can’t you?
        
        No. If conditioning was the same as interventions I could make it rain by watering my lawn and become a world class athlete by putting on a gold medal.
        Lumifer 12 Dec 2013 21:52 UTC
        0 points
        0
        Parent
        
        If conditioning was the same as interventions I could make it rain by watering my lawn
        
        I don’t understand—can you unroll?
        Expand this thread
        IlyaShpitser 12 Dec 2013 22:45 UTC
        2 points
        0
        Parent
        Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely. Of course you might say that “making the grass wet” and “seeing the grass wet” is not the same thing, in which case I agree!
        
        The fact that these are not the same thing is why people say conditioning and interventions are not the same thing.
        
        You can of course say that you can still use the language of conditional probability to talk about “doing events” vs “seeing events.” But then you are just reinventing interventions (as will become apparent if you try to figure out axioms for your notation).
        Lumifer 13 Dec 2013 1:18 UTC
        0 points
        0
        Parent
        
        Well, since p(rain | grass wet) is high, it seems making the grass wet via a garden hose will make rain more likely.
        
        That’s a strawman. The conditional probability we’re talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).
        
        But then you are just reinventing interventions
        
        Talking about conditional probability was widespread long before people started talking about interventions.
        
        It seems to me that the language of interventions, etc. is just a formalism that is convenient for certain types of analysis, but I’m not seeing that it means anything new.
        pragmatist 13 Dec 2013 5:44 UTC
        4 points
        0
        Parent
        
        That’s a strawman. The conditional probability we’re talking about has a clear (if explicitly unstated) temporal ordering: P(rain in the past | wet grass in the present).
        
        You seem to be missing Ilya’s point. He was arguing that if you regard “under intervention do(A = 1)” as equivalent to “conditional on A = 1″ (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet). But these are not in fact equivalent, and adding temporal ordering in there doesn’t make them equivalent either. P(rain in the past | do(wet grass) in the present) = P(rain in the past), but P(rain in the past | wet grass in the present) != P(rain in the past) .
        Lumifer 13 Dec 2013 16:23 UTC
        0 points
        0
        Parent
        
        He was arguing that if you regard “under intervention do(A = 1)” as equivalent to “conditional on A = 1″ (as you suggested in a previous comment), then you should regard P(rain | do(grass wet)) as equivalent to P(rain | grass wet).
        
        There is obviously a difference between observational data and experiments.
        
        But these are not in fact equivalent
        
        No, because they’re modeling different reality.
        pragmatist 13 Dec 2013 19:12 UTC
        2 points
        0
        Parent
        
        There is obviously a difference between observational data and experiments.
        
        Yes! The difference is that experiments involve intervention. I thought the necessity of formalizing the notion of intervention is precisely what was under dispute here.
        Lumifer 13 Dec 2013 20:19 UTC
        0 points
        0
        Parent
        Well, kinda. I am not sure whether the final output—the joint densities of outcomes—will be different in a causal model compared to a properly specified conventional model.
        
        To continue with the same example, it suffers from the expression “wet grass” meaning two different things—either “I see wet grass” or “I made grass wet”. This is your difference between just (a=1) and do(a=1) -- but conventional non-causal modeling doesn’t have huge problems with this, it is fully aware of the difference.
        
        And I don’t know if it’s necessary to formalize intervention. I freely concede that it’s useful in certain areas but not so sure that’s true for all areas.
        Vaniver 14 Dec 2013 10:57 UTC
        0 points
        0
        Parent
        
        Well, kinda. I am not sure whether the final output—the joint densities of outcomes—will be different in a causal model compared to a properly specified conventional model.
        
        So, we could add a node to the graph for every single node, which corresponds to whether or not that node was the subject of an intervention. So you would talk about P(rain|grass is wet, ~I made it rain, ~I made the grass wet) vs. P(rain|grass is wet, ~I made it rain, I made the grass wet). But this means doubling the number of nodes in the dataset (which, since the number of probabilities is exponential in the number of nodes for a discrete dataset, is a terrible idea). You also might want to throw in a lot of consistency constraints which are not guaranteed to hold in an arbitrary graph, which makes things more awkward.
        
        It is much simpler, conceptually and practically, to just have a rule to determine how interventions differ from observations in updating the state of the graph, that is, talking about P(rain|grass is wet) vs. P(rain|do(grass is wet)).
        IlyaShpitser 14 Dec 2013 14:43 UTC
        4 points
        0
        Parent
        
        So, we could add a node to the graph for every single node, which corresponds to whether or not that node was the subject of an intervention.
        
        In fact, Phil Dawid does precisely this. What he ends up with is still interventions. (Of course he (I think!) does not believe in counterfactuals, but that is a long discussion.)
        Lumifer 15 Dec 2013 0:42 UTC
        0 points
        0
        Parent
        
        So, we could add a node to the graph for every single node
        
        That assumes we’re doing graphs and networks.
        
        My problems in this subthread really started when the causal model was defined as “a set of joint distributions defined over potential outcome random variables”—notice how nothing like networks or interventions is mentioned here—and I got curious why a plain-vanilla Bayesian model which also produces a set of joint distributions doesn’t qualify.
        
        It probably just was a bad definition.
        IlyaShpitser 22 Aug 2014 5:12 UTC
        2 points
        0
        Parent
        Sorry this is a response to an old comment, but this is an easy to clarify question.
        
        A potential outcome Y(a) is a random variable under an intervention, e.g. Y under do(a). It’s just a different notation from a different branch of statistics.
        
        We may or may not choose to use graphs to represent causality (or indeed probability). Some people like graphs, others do not. Graphs do not add anything, they are just a visual representation.
        IlyaShpitser 13 Dec 2013 12:01 UTC
        2 points
        0
        Parent
        I agree with pragmatist’s explanation. But let me add a bit more detail to illustrate that a temporal ordering will not save you here. Imagine instead of two variables we have three variables : rain (R), my grass being wet (G1), and my neighbor’s grass being wet (G2). Clearly R preceeds both G1, and G2, and G1 and G2 are contemporaneous. In fact, we can even consider G2 to be my neighbor’s grass 1 hour in the future (so clearly G1 preceeds G2!).
        
        Also clearly, p(R = yes | G1 = wet) is high, and p(R = yes | G2 = wet) is high, also p(G1 = wet | R = yes) is high, and p(G2 = wet | R = yes) is high.
        
        So by hosing my grass I am making it more likely than my neighbor’s grass one hour from now will be wet?
        
        Or, to be more succinct : http://www.smbc-comics.com/index.php?db=comics&id=1994#comic
        Lumifer 13 Dec 2013 16:12 UTC
        −2 points
        0
        Parent
        Yeah, well, I’ve heard somewhere that correlation does not equal causation :-)
        
        I agree that causal models are useful—if only because they make explicit certain relationships which are implicit in plain-vanilla regular models and so trip up people on a regular basis.What I’m not convinced of is that you can’t re-express that joint density on the outcomes in a conventional way even if it turns out to look a bit awkward.
        IlyaShpitser 13 Dec 2013 17:22 UTC
        8 points
        0
        Parent
        Here’s how this conversation played out.
        
        Lumifer : “can we not express cause effect relationships via conditioning probabilities?”
        
        me : “No: [example].”
        
        Lumifer : “Ah, but this is silly because of time ordering information.”
        
        me : “Time ordering doesn’t matter: [slight modification of example].”
        
        Lumifer : “Yeah… causal models are useful, but it’s not clear they cannot be expressed via conditioning probabilities.”
        
        I guess you can lead a horse to water, but you can’t make him drink. I have given you everything, all you have to do is update and move on. Or not, it’s up to you.
        Lumifer 13 Dec 2013 17:26 UTC
        0 points
        0
        Parent
        Yes, I’m a picky sort of a horse :-) Thanks for the effort, though.
  - passive_fist 11 Dec 2013 21:49 UTC
    0 points
    0
    Parent
    
    Decision problems have the form of “What do you do in situation X to maximize a defined utility function?”
    
    Yes, but what you are describing is a modelling problem. “Is the drug killing them or helping them?” is not a decision problem, although “Which drug should we give them to save their lives?” is. These are two very different problems, possibly with different answers!
    
    It is very easy to transform any causal modeling example into a decision problem.
    
    Yes, but in the process it becomes a new problem. Although, you are right that modelling is in some respects an ‘easier’ problem than making decisions. That’s also the reason I wrote my top-level comment, saying that it is true that something you can identify in an AI is the ability to model the world.
    - IlyaShpitser 12 Dec 2013 10:53 UTC
      2 points
      0
      Parent
      I guess my point was that there is a trivial reduction (in the complexity theory sense of the word) here, namely that decision theory is “modeling-complete.” In other words, if we had algorithm for solving a certain class of decision problems correctly, we automatically have an algorithm for correctly handling the corresponding model (otherwise how could we get the decision problem right?)
      
      Prediction cannot solve causal decision problems, but the reason it cannot is that it cannot solve the underlying modeling problem correctly. (If it could, there is nothing more to do, just integrate over the utility).