paulfchristiano comments on Finite Factored Sets

paulfchristiano 24 May 2021 18:56 UTC
LW: 35 AF: 22
AF
Here is how I’m currently thinking about this framework and especially inference, in case it’s helpful for other folks who have similar priors to mine (or in case something is still wrong).
A description of traditional causal models:
- A causal graph with N nodes can be viewed as a model with 2N variables, one for each node of the graph and a corresponding noise variable for each. Each real variable is a deterministic function of its corresponding noise variable + its parents in the causal graph.
- When we talk about causal inference, we often consider probabilities in “generic position”: we ask what kind of graphs would give rise to the observed conditional independence relations (and no others) regardless of the probability distributions and functions.
In the factored sets framework:
- We still posit a bunch of independent noise variables, and our observations are still a deterministic functions of these noise variables.
- But we no longer make any structural assumption about an underlying graph—the deterministic functions can be arbitrary.
- It’s easy to prove that for any deterministic function there is a unique “smallest” set of variables on which it potentially depends. We call this the history.
- Now we define “Y is after X” to mean that X depends on a strict subset of the variables on which Y depends.
- (For a traditional causal model, this is equivalent to the usual definition, since the history of a variable is the set of noise variables upstream of it.)
- We can ask the same kinds of questions we asked before: what properties are true about all models that can give rise to the observed conditional independence relations (and no others) regardless of the probability distribution on the noise variables. For example, we can ask whether “Y is after X” is true in all of these models.
Some other thoughts:
- We can enlarge Pearl’s framework to allow deterministic nodes. If we do, both of these frameworks are compatible with the same set of distributions about the world (and can represent the same sets of distributions when we take the probabilities to be arbitrary, as long as we allow the deterministic nodes to be fixed).
- But this makes it hard to say much of anything in Pearl’s framework. The basic problem is that for all you know Pearl’s models now look like the factored sets models—there could be a bunch of independent random variables, and then everything we observe is a deterministic function of them. In fact this seems like a pretty natural view of the real world. In this case there are no causal relationships between anything we observe, and it just feels like many of Pearl’s definitions aren’t a good fit. (I initially thought the d-separation criterion still worked but Scott points out that it definitely doesn’t.)
- I think the definition of history is the most natural way to recover something like causal structure in these models. I can’t think of any other good options, and I don’t currently see any major downsides or intuition-violations in this approach. And then as far as I can tell most everything else follows naturally from this key assumption (I haven’t thought about alternative definitions of conditional orthogonality but it feels intuitively like there must be only one reasonable definition).
- It’s not a priori clear to me whether or not the fundamental theorem would hold. But it clearly should if this notion of history/orthogonality is a good one. Given that it holds I’m inclined to accept this as a good generalization of Pearl’s notion to models with determinism / with more events than noise variables. I’d expect to revise that position if someone turned up a case where the definitions felt wrong, or if someone was able to offer a similarly-compelling alternative.
- I’m not a big fan of the name “finite factored sets” because it feels to me like it doesn’t emphasize the most important difference between these frameworks. Maybe my perspective on the terminology would shift if I thought about it more.
- cousin_it 25 May 2021 10:40 UTC
  LW: 12 AF: 5
  AF Parent
  I think the definition of history is the most natural way to recover something like causal structure in these models.
  
  I’m not sure how much it’s about causality. Imagine there’s a bunch of envelopes with numbers inside, and one of the following happens:
  1. Alice peeks at three envelopes. Bob peeks at ten, which include Alice’s three.
  2. Alice peeks at three envelopes and tells the results to Bob, who then peeks at seven more.
  3. Bob peeks at ten envelopes, then tells Alice the contents of three of them.
  Under the FFS definition, Alice’s knowledge in each case is “strictly before” Bob’s. So it seems less of a causal relationship and more like “depends on fewer basic facts”.
  - paulfchristiano 25 May 2021 16:18 UTC
    LW: 19 AF: 10
    AF Parent
    Agree it’s not totally right to call this a causal relationship.
    That said:
    The contents of 3 envelopes does seems causally upstream of the contents of 10 envelopes
    If Alice’s perception is imperfect (in any possible world), then “what Alice perceived” is not identical to “the contents of 3 envelopes” and so is not strictly before “what Bob perceived” (unless there is some other relationship between them).
    If Alice’s perception is perfect in every possible world, then there is no possible way to intervene on Alice’s perception without intervening on the contents of the 3 envelopes. So it seems like a lot rests on whether you are restricting your attention to possible worlds.
    Even if Alice’s perception is perfect (or if Bob is guaranteed to tell Alice the contents of the 3 envelopes) we can still imagine an intervention on Alice’s perception, and in your stories it seems like that’s what makes it feel like Alice’s perception isn’t upstream of Bob’s perception. But it feels to me like this imagination ought to track subjective possibility, even if in fact it is probably logically necessary that Alice perceives correctly / Bob reports correctly / whatever.
    So I do feel like there’s a case to be made that it captures everything we should care about with respect to causality.
    For example, it seems unlikely to me that decision theory should depend on what happens in obviously impossible worlds. If we want to depend on impossible worlds it seems like it will usually happen by introducing a more naive epistemic state from which those worlds are subjectively possible—in which case we can talk about the FFS definition with respect to that epistemic state.
    (I have no idea if this perspective is endorsed by Scott or if it would stand up to scrutiny.)
    - Scott Garrabrant 25 May 2021 16:27 UTC
      LW: 10 AF: 5
      AF Parent
      I think I (at least locally) endorse this view, and I think it is also a good pointer to what seems to me to be the largest crux between the my theory of time and Pearl’s theory of time.
    - cousin_it 26 May 2021 9:14 UTC
      LW: 8 AF: 4
      AF Parent
      I feel that interpreting “strictly before” as causality is making me more confused.
      
      For example, here’s a scenario with a randomly changed message. Bob peeks at ten regular envelopes and a special envelope that gives him a random boolean. Then Bob tells Alice the contents of either the first three envelopes or the second three, depending on the boolean. Now Alice’s knowledge depends on six out of ten regular envelopes and the special one, so it’s still “strictly before” Bob’s knowledge. And since Alice’s knowledge can be computed from Bob’s knowledge but not vice versa, in FFS terms that means the “cause” can be (and in fact is) computed from the “effect”, but not vice versa. My causal intuition is just blinking at all this.
      
      Here’s another scenario. Alice gets three regular envelopes and accurately reports their contents to Bob, and a special envelope that she keeps to herself. Then Bob peeks at seven more envelopes. Now Alice’s knowledge isn’t “before” Bob’s, but if later Alice predictably forgets the contents of her special envelope, her knowledge becomes “before” Bob’s. Even though the special envelope had no effect on the information Alice gave to Bob, didn’t affect the causal arrow in any possible world. And if we insist that FFS=causality, then by forgetting the envelope, Alice travels back in time to become the cause of Bob’s knowledge in the past. That’s pretty exotic.
      - Scott Garrabrant 26 May 2021 17:50 UTC
        LW: 4 AF: 3
        AF Parent
        I partially agree, which is partially why I am saying time rather than causality.
        I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice’s knowledge is (not) before the physical thing that is Bob’s knowledge.
        In my ontology:
        1) the information content of Alice’s knowledge is before the information content of Bob’s knowledge. (I am curios if this part is controversial.)
        and then,
        2) there is in some sense no more to say about the physical thing that is e.g. Alice’s knowledge beyond the information content.
        So, I am not just saying Alice is before Bob, I am also saying e.g. Alice is before Alice+Bob, and I can’t disentangle these statements because Alice+Bob=Bob.
        I am not sure what to say about the second example. I am somewhat rejecting the dynamics. “Alice travels back in time” is another way of saying that the high level FFS time disagrees with the standard physical time, which is true. The “high level” here is pointing to the fact that we are only looking at the part of Alice’s brain that is about the envelopes, and thus talking about coarser variables than e.g. Alice’s entire brain state in physical time. And if we are in the ontology where we are only looking at the information content, taking a high level version of a variable is the kind of thing that can change its temporal properties, since you get an entirely new variable.
        I suspect most of the disagreement is in the sort of “variable nonrealism” of reducing the physical thing that is Alice’s knowledge to its information content?
        cousin_it 26 May 2021 18:37 UTC
        LW: 9 AF: 4
        AF Parent
        Not sure we disagree, maybe I’m just confused. In the post you show that if X is orthogonal to X XOR Y, then X is before Y, so you can “infer a temporal relationship” that Pearl can’t. I’m trying to understand the meaning of the thing you’re inferring—“X is before Y”. In my example above, Bob tells Alice a lossy function of his knowledge, and Alice ends up with knowledge that is “before” Bob’s. So in this case the “before” relationship doesn’t agree with time, causality, or what can be computed from what. But then what conclusions can a scientist make from an inferred “before” relationship?
        Scott Garrabrant 26 May 2021 22:00 UTC
        LW: 8 AF: 5
        AF Parent
        I don’t have a great answer, which isn’t a great sign.
        I think the scientist can infer things like. “algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information.” But why should I think of that as time?
        I think the scientist can infer things like “If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribution with no spurious independencies (including in deterministic functions of the variables), and X and Y happen to be variables in that DAG, then there will be a path from X to Y.”
        The scientist can infer that if Z is orthogonal to Y, then Z is also orthogonal to X, where this is important because Z is orthogonal to Y can be thought of as saying that Z is useless for learning about Y. (and importantly a version of useless for learning that is closed under common refinement, so if you collect a bunch of different Z orthogonal to Y, you can safely combine them, and the combination will be orthogonal to Y.)
        This doesn’t seem to get at why we want to call it before. Hmm.
        Maybe I should just list a bunch of reasons why it feels like time to me (in no particular order):
        It seems like it gets a very reasonable answer in the Game of Life example
        Prior to this theory, I thought that it made sense to think of time as a closure property on orthogonality, and this definition of time is exactly that closure property on orthogonality, where X is weakly before Y if whenever Z is orthogonal to Y, Z is also orthogonal to X. (where the definition of orthogonality is justified with the fundamental theorem.)
        If Y is a refinement of X, then Y cannot be strictly before X. (I notice that I don’t have a thing to say about why this feels like time to me, and indeed it feels like it is in direct opposition to your “doesn’t agree with what can be computed from what,” but it does agree with the way I feel like I want to intuitively describe time in the stories told in the “Saving Time” post.) (I guess one thing I can say is that as an agent learns over time, we think of the agent as collecting information, so later=more information makes sense.)
        History looks a lot like a non-quantitative version of entropy, where instead of thinking of how much randomness goes into a variable, we think of which randomness goes into the variable. There are lemmas towards proving the semigraphoid axioms which look like theorems about entropy modified to replace sums/expectations with unions. Then, “after” exactly corresponds to “greater entropy” in this analogy.
        If I imagine X and Z being computed independently, and Y as being computed from X and Z, it will say that X is before Y, which feels right to me (and indeed this property is basically the definition. It seems like my time is maybe the unique thing that gets the right answer on this simple story and also treats variables with the same info content as the same.)
        We can convert a Pearlian DAG to a FFS, and under this conversion, d-seperation is sent to conditional orthogonality, and paths between nodes are sent to time. (on the questions Pearl knows how to ask. We also generalize the definition to all variables)
        cousin_it 26 May 2021 22:40 UTC
        LW: 4 AF: 2
        AF Parent
        Thanks for the response! Part of my confusion went away, but some still remains.
        
        In the game of life example, couldn’t there be another factorization where a later step is “before” an earlier one? (Because the game is non-reversible and later steps contain less and less information.) And if we replace it with a reversible game, don’t we run into the problem that the final state is just as good a factorization as the initial?
        Scott Garrabrant 27 May 2021 0:38 UTC
        LW: 4 AF: 3
        AF Parent
        Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of “before.”
        cousin_it 27 May 2021 10:00 UTC
        LW: 2 AF: 1
        AF Parent
        I think your argument about entropy might have the same problem. Since classical physics is reversible, if we build something like a heat engine in your model, all randomness will be already contained in the initial state. Total “entropy” will stay constant, instead of growing as it’s supposed to, and the final state will be just as good a factorization as the initial. Usually in physics you get time (and I suspect also causality) by pointing to a low probability macrostate and saying “this is the start”, but your model doesn’t talk about macrostates yet, so I’m not sure how much it can capture time or causality.
        
        That said, I like really like how your model talks only about information, without postulating any magical arrows. Maybe it has a natural way to recover macrostates, and from them, time?
        Scott Garrabrant 27 May 2021 16:43 UTC
        LW: 2 AF: 2
        AF Parent
        Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you’d think.
        if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps.
        If all of the initial probabilities are ¹⁄₂, you will stay in the uniform distribution, but if the probabilities are in general position, things will change, and time 0 will be special because of the independence between cells.
        There will be other events at later times that will be independent, but those later time events will just represent “what was the state at time 0.”
        For a concrete example consider the reversible cellular automaton that just has 2 cells, X and Y, and each time step it keeps X constant and replaces Y with X xor Y.
        Expand this thread
        cousin_it 27 May 2021 18:07 UTC
        LW: 2 AF: 1
        AF Parent
        Wait, can you describe the temporal inference in more detail? Maybe that’s where I’m confused. I’m imagining something like this:
        
        Check which variables look uncorrelated
        
        Assume they are orthogonal
        
        From that orthogonality database, prove “before” relationships
        
        Which runs into the problem that if you let a thermodynamical system run for a long time, it becomes a “soup” where nothing is obviously correlated to anything else. Basically the final state would say “hey, I contain a whole lot of orthogonal variables!” and that would stop you from proving any reasonable “before” relationships. What am I missing?
        Scott Garrabrant 27 May 2021 18:44 UTC
        LW: 2 AF: 2
        AF Parent
        I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.
        [ ]
        [deleted]
        Scott Garrabrant 27 May 2021 16:51 UTC
        LW: 2 AF: 2
        AF Parent
        I think my model has macro states. In game of life, if you take the entire grid at time t, that will have full history regardless of t. It is only when you look at the macro states (individual cells) that my time increases with game of life time.
        Scott Garrabrant 27 May 2021 16:49 UTC
        LW: 2 AF: 2
        AF Parent
        As for entropy, here is a cute observation (with unclear connection to my framework): whenever you take two independent coin flips (with probabilities not 0,1, or ¹⁄₂), their xor will always be high entropy than either of the individual coin flips.
- Scott Garrabrant 24 May 2021 19:19 UTC
  LW: 6 AF: 4
  AF Parent
  Thanks Paul, this seems really helpful.
  As for the name I feel like “FFS” is a good name for the analog of “DAG”, which also doesn’t communicate that much of the intuition, but maybe doesn’t make as much sense for name of the framework.
  - Scott Garrabrant 24 May 2021 19:30 UTC
    LW: 30 AF: 11
    AF Parent
    I was originally using the name Time Cube, but my internal PR center wouldn’t let me follow through with that :)
    - gwillen 25 May 2021 0:42 UTC
      6 points
      Parent
      That sounds like the right choice, but a part of me is incredibly disappointed that you didn’t go for it.
  - paulfchristiano 24 May 2021 21:12 UTC
    LW: 9 AF: 4
    AF Parent
    I think FFS makes sense as an analog of DAG, and it seems reasonable to think of the normal model as DAG time and this model as FFS time. I think the name made me a bit confused by calling attention to one particular diff between this model and Pearl (factored sets vs variables), whereas I actually feel like that diff was basically a red herring and it would have been fastest to understand if the presentation had gone in the opposite direction by demphasizing that diff (e.g. by presenting the framework with variables instead of factors).
    That said, even the DAG/FFS analogy still feels a bit off to me (with the caveat that I may still not have a clear picture / don’t have great aesthetic intuitions about the domain).
    Factorization seems analogous to describing a world as a set of variables (and to the extent it’s not analogous it seems like an aesthetic difference about whether to take the world or variables as fundamental, rather than a substantive difference in the formalism) rather than to the DAG that relates the variables.
    The structural changes seem more like (i) replacing a DAG with a bipartite graph, (ii) allowing arrows to be deterministic (I don’t know how typically this is done in causal models). And then those structural changes lead to generalizing the usual concepts about causality so that they remain meaningful in this setting.
    All that said, I’m terrible at both naming things and presenting ideas, and so don’t want to make much of a bid for changes in either department.
    - Scott Garrabrant 24 May 2021 21:49 UTC
      LW: 10 AF: 5
      AF Parent
      Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.
      
      I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology, graphs feel not useful. One crux might be about how I am excited for directions that are not temporal inference from statistical data.
      
      My guess is that if I were putting a lot of work into a very long introduction for e.g. the structure learning community, I might start the way you are emphasizing, but then eventually convert to throwing all the graphs away.
      
      (The paper draft I have basically only ever mentions Pearl/graphs for motivation at the beginning and in the applications section.)
      - paulfchristiano 24 May 2021 23:09 UTC
        LW: 4 AF: 2
        AF Parent
        I agree that bipartite graphs are only a natural way of thinking about it if you are starting from Pearl. I’m not sure anything in the framework is really properly analogous to the DAG in a causal model.
        Koen.Holtman 25 May 2021 12:39 UTC
        LW: 5 AF: 3
        AF Parent
        My thoughts on naming this finite factored sets: I agree with Paul’s observation that
        
        | Factorization seems analogous to describing a world as a set of variables
        
        By calling this ‘finite factored sets’, you are emphasizing the process of coming up with individual random variables, the variables that end up being the (names of the) nodes in a causal graph. With $s \in S$ representing the entire observable 4D history of a world (like a computation starting from a single game of life board state), a factorisation $B = {b_{1}, b_{2}, \dots b_{n}}$ splits such $s$ into a tuple of separate, more basic observables $(b b_{1}, b b_{2}, \dots, b b_{n})$ . where $b b_{1} \in b_{1}$ , etc. In the normal narrative that explains Pearl causal graphs, this splitting of the world into smaller observables is not emphasized. Also, the splitting does not necessarily need to be a bijection. It may loose descriptive information with respect to $s$ .
        
        So I see the naming finite factored sets as a way to draw attention to this splitting step, it draws attention to the fact that if you split things differently, you may end up with very different causal graphs. This leaves open the question of course is if really want to name your framework in a way that draws attention to this part of the process. Definitely you spend a lot of time on creating an equivalent to the arrows between the nodes too.