Looks like you copied it wrong. Your B only has one 4.

# Scott Garrabrant

# Finite Factored Sets: Orthogonality and Time

# Finite Factored Sets: Introduction and Factorizations

I have not thought much about applying to things other than finite sets. (I looked at infinite sets enough to know there is nontrivial work to be done.) I do think it is good that you are thinking about it, but I don’t have any promises that it will work out.

What I meant when I think that this can be done in a categorical way is that I think I can define a nice symmetric monodical category of finite factored sets such that things like orthogonality can be given nice categorical definitions. (I see why this was a confusing thing to say.)

If I understand correctly, that definition is not the same. In particular, it would say that you can get nontrivial factorizations of a 5 element set: {{{0,1},{2,3,4}},{{0,2,4},{1,3}}}.

When I prove it, I prove and use (a slight notational variation on) these two lemmas.

If , then for all .

.

(These are also the two lemmas that I have said elsewhere in the comments look suspiciously like entropy.)

These are not trivial to prove, but they might help.

I think that you are pointing out that you might get a bunch of false positives in your step 1 after you let a thermodynamical system run for a long time, but they are are only approximate false positives.

I think my model has macro states. In game of life, if you take the entire grid at time t, that will have full history regardless of t. It is only when you look at the macro states (individual cells) that my time increases with game of life time.

As for entropy, here is a cute observation (with unclear connection to my framework): whenever you take two independent coin flips (with probabilities not 0,1, or

^{1}⁄_{2}), their xor will always be high entropy than either of the individual coin flips.

Wait, I misunderstood, I was just thinking about the game of life combinatorially, and I think you were thinking about temporal inference from statistics. The reversible cellular automaton story is a lot nicer than you’d think.

if you take a general reversible cellular automaton (critters for concreteness), and have a distribution over computations in general position in which initial conditions cells are independent, the cells may not be independent at future time steps.

If all of the initial probabilities are

^{1}⁄_{2}, you will stay in the uniform distribution, but if the probabilities are in general position, things will change, and time 0 will be special because of the independence between cells.There will be other events at later times that will be independent, but those later time events will just represent “what was the state at time 0.”

For a concrete example consider the reversible cellular automaton that just has 2 cells, X and Y, and each time step it keeps X constant and replaces Y with X xor Y.

Yep, there is an obnoxious number of factorizations of a large game of life computation, and they all give different definitions of “before.”

I don’t have a great answer, which isn’t a great sign.

I think the scientist can infer things like. “algorithms reasoning about the situation are more likely to know X but not Y than they are to know Y but not X, because reasonable processes for learning Y tend to learn learn enough information to determine X, but then forget some of that information.” But why should I think of that as time?

I think the scientist can infer things like “If I were able to factor the world into variables, and draw a DAG (without determinism) that is consistent with the distribution with no spurious independencies (including in deterministic functions of the variables), and X and Y happen to be variables in that DAG, then there will be a path from X to Y.”

The scientist can infer that if Z is orthogonal to Y, then Z is also orthogonal to X, where this is important because Z is orthogonal to Y can be thought of as saying that Z is useless for learning about Y. (and importantly a version of useless for learning that is closed under common refinement, so if you collect a bunch of different Z orthogonal to Y, you can safely combine them, and the combination will be orthogonal to Y.)

This doesn’t seem to get at why we want to call it before. Hmm.

Maybe I should just list a bunch of reasons why it feels like time to me (in no particular order):

It seems like it gets a very reasonable answer in the Game of Life example

Prior to this theory, I thought that it made sense to think of time as a closure property on orthogonality, and this definition of time is exactly that closure property on orthogonality, where X is weakly before Y if whenever Z is orthogonal to Y, Z is also orthogonal to X. (where the definition of orthogonality is justified with the fundamental theorem.)

If Y is a refinement of X, then Y cannot be strictly before X. (I notice that I don’t have a thing to say about why this feels like time to me, and indeed it feels like it is in direct opposition to your “doesn’t agree with what can be computed from what,” but it does agree with the way I feel like I want to intuitively describe time in the stories told in the “Saving Time” post.) (I guess one thing I can say is that as an agent learns over time, we think of the agent as collecting information, so later=more information makes sense.)

History looks a lot like a non-quantitative version of entropy, where instead of thinking of how much randomness goes into a variable, we think of which randomness goes into the variable. There are lemmas towards proving the semigraphoid axioms which look like theorems about entropy modified to replace sums/expectations with unions. Then, “after” exactly corresponds to “greater entropy” in this analogy.

If I imagine X and Z being computed independently, and Y as being computed from X and Z, it will say that X is before Y, which feels right to me (and indeed this property is basically the definition. It seems like my time is maybe the unique thing that gets the right answer on this simple story and also treats variables with the same info content as the same.)

We can convert a Pearlian DAG to a FFS, and under this conversion, d-seperation is sent to conditional orthogonality, and paths between nodes are sent to time. (on the questions Pearl knows how to ask. We also generalize the definition to all variables)

I partially agree, which is partially why I am saying time rather than causality.

I still feel like there is an ontological disagreement in that it feels like you are objecting to saying the physical thing that is Alice’s knowledge is (not) before the physical thing that is Bob’s knowledge.

In my ontology:

1) the information content of Alice’s knowledge is before the information content of Bob’s knowledge. (I am curios if this part is controversial.)and then,

2) there is in some sense no more to say about the physical thing that is e.g. Alice’s knowledge beyond the information content.

So, I am not just saying Alice is before Bob, I am also saying e.g. Alice is before Alice+Bob, and I can’t disentangle these statements because Alice+Bob=Bob.

I am not sure what to say about the second example. I am somewhat rejecting the dynamics. “Alice travels back in time” is another way of saying that the high level FFS time disagrees with the standard physical time, which is true. The “high level” here is pointing to the fact that we are only looking at the part of Alice’s brain that is about the envelopes, and thus talking about coarser variables than e.g. Alice’s entire brain state in physical time. And if we are in the ontology where we are only looking at the information content, taking a high level version of a variable is the kind of thing that can change its temporal properties, since you get an entirely new variable.

I suspect most of the disagreement is in the sort of “variable nonrealism” of reducing the physical thing that is Alice’s knowledge to its information content?

Now on OEIS: https://oeis.org/A338681

is the event you are conditioning on, so the thing you should expect is that , which does indeed hold.

I think I (at least locally) endorse this view, and I think it is also a good pointer to what seems to me to be the largest crux between the my theory of time and Pearl’s theory of time.

Hmm, I am not sure what to say about the fundamental theorem, because I am not really understanding the confusion. Is there something less motivating about the fundamental theorem, than the analogous theorem about d-seperation being equivalent to conditional independence in all distributions comparable with the DAG?

Maybe this helps? (probably not): I am mostly imagining interacting with only a single distributions in the class, and the claim about independence in all probability distributions comparable with the structure can be replaced with instead independence in a general position probability distribution comparable with the structure.

I am not thinking of it as related to a maximum entropy argument.

The point about SEMs having more structure that I am ignoring is correct. I think that the largest philosophical difference between my framework and Pearlian one is that I am denying realism of anything beyond the “apparently identical.” Another way to think about it is that I am denying realism about there being anything about the variables beyond their information. All of my definitions are only based on the information content of the variables, and so, for example, if you have two variables that are deterministic copies of each other, they will have all the same temporal properties, while in an SEM, they could be different. The surprising thing is that even without intervention data, this variable non-realism allows us to define and infer something that looks a lot like time.

I have a lot of uncertainty about learning algorithms. On the surface, it looks like my structure just has so much to check, and is going to have a hard time being practical, but I could see it going either way. Especially if you imagine it as a minor modification to graph learning, where maybe you don’t consider all re-factorizations, but you do consider e.g. taking a pair of nodes and replacing one if them with the XOR.

Makes sense. I think a bit of my naming and presentation was biased by being so surprised by the not on OEIS fact.

I think I disagree about the bipartite graph thing. I think it only feels more natural when comparing to Pearl. The talk frames everything in comparison to Pearl, but I think if you are not looking at Pearl, I think graphs don’t feel like the right representation here. Comparing to Pearl is obviously super important, and maybe the first introduction should just be about the path from Pearl to FFS, but once we are working within the FFS ontology, graphs feel not useful. One crux might be about how I am excited for directions that are not temporal inference from statistical data.

My guess is that if I were putting a lot of work into a very long introduction for e.g. the structure learning community, I might start the way you are emphasizing, but then eventually convert to throwing all the graphs away.

(The paper draft I have basically only ever mentions Pearl/graphs for motivation at the beginning and in the applications section.)

I was originally using the name Time Cube, but my internal PR center wouldn’t let me follow through with that :)

Sure, if you want to send me an email and propose some times, we could set up a half hour chat. (You could also wait until I post all the math over the next couple weeks.)