Petrov Day thought: there’s this narrative around Petrov where one guy basically had the choice to nuke or not, and decided not to despite all the flashing red lights. But I wonder… was this one of those situations where everyone knew what had to be done (i.e. “don’t nuke”), but whoever caused the nukes to not fly was going to get demoted, so there was a game of hot potato and the loser was the one forced to “decide” to not nuke? Some facts possibly relevant here:
Petrov’s choice wasn’t actually over whether or not to fire the nukes; it was over whether or not to pass the alert up the chain of command.
Petrov himself was responsible for the design of those warning systems.
… so it sounds like Petrov was ~ the lowest-ranking person with a de-facto veto on the nuke/don’t nuke decision.
Petrov was in fact demoted afterwards.
There was another near-miss during the Cuban missile crisis, when three people on a Soviet sub had to agree to launch. There again, it was only the lowest-ranked who vetoed the launch. (It was the second-in-command; the captain and political officer both favored a launch—at least officially.)
This was the Soviet Union; supposedly (?) this sort of hot potato happened all the time.
That is exactly correct.
One thing I am wondering is how you think about time vs. space in your analysis. If we think of all of physics as a very nonlinear dynamical system, then how do you move from that to these large causal networks you are drawing?
The equations of physics are generally local in both time and space. That actually makes causal networks a much more natural representation, in some ways, than nonlinear dynamical systems; dynamical systems don’t really have a built-in notion of spatial locality, whereas causal networks do. Indeed, one way to view causal networks is as the bare-minimum model in which we have both space-like and time-like interactions. So I don’t generally think about moving from dynamical systems to causal networks; I think about starting from causal networks.
This also fits well with how science works in a high-dimensional world like ours. Scientists don’t look at the state of the whole universe and try to figure out how that evolves into the next state. Rather, they look at spatially-localized chunks of the universe, and try to find sets of mediators which make the behavior of that chunk of the universe “reproducible”—i.e. independent of what’s going on elsewhere. These are Markov blankets.
The main piece which raw causal networks don’t capture is symmetry, e.g. the laws of physics staying the same over time. I usually picture the world in terms of causal networks with symmetry or, equivalently, causal submodels organized like programs.
Do already have a plan of attack for the experimental testing? By this I mean using X application, or Y programming language, with Z amount of compute.
I will post that information when the time comes. Though probably not very long before the time comes; writing up that sort of code takes a lot less time than all this theory.
Recalling the Macroscopic Prediction paper by Jaynes, am I correct in interpreting this as being conceptually replacing the microphenomena/macrophenomena choices with near/far abstractions?
Following in this vein, does the phase-space trick seem to generalize to the abstractions level?
I had not thought of that, but it sounds like a great idea. I’ll have to chew on it some more.
You’re asking the right questions.
The most important difference between this approach and most people thinking about abstraction is that, in this approach, most of the key ideas/results do not explicitly involve an observer. The “info-at-a-distance” is more a property of the universe than of the observer, in exactly the same way that e.g. energy conservation or the second law of thermodynamics are more properties of the universe than of the observer.
Now, it’s still true that we need an observer in order to recognize that energy is conserved or entropy increases or whatever. There’s still an implicit observer in there, writing down the equations and mapping them to physical reality. But that’s true mostly in a philosophical sense, which doesn’t really have much practical bearing on anything; even if some aliens came along with radically different ways of doing physics, we’d still expect energy conservation and entropy increase and whatnot to be embedded in their predictive processes (though possibly implicitly). We’d still expect their physics to either be equivalent to ours, or to make outright wrong predictions (other than the very small/very big scales where ours is known to be incomplete). We’d even expect a lot of the internal structure to match, since they live in our universe and are therefore subject to similar computational constraints (specifically locality).
Abstraction, I claim, is like that.
On a meta-note, regarding this specifically:
You are using mathematics, a formalized system optimized to be used by humans. And you use math/your intuition to formalize “the perceiving”.
I think there’s a mistake people sometimes make when thinking about how-models-work (which you may or may not be making) that goes something like “well, we humans are representing this chunk-of-the-world using these particular mathematical symbols, but that’s kind of an arbitrary choice, so it doesn’t necessarily tell us anything fundamental which would generalize beyond humans”.
The mistake here is: if we’re able to accurately predict things about the system, then those predictions remain just as true even if they’re represented some other way. In fact, those predictions remain just as true even if they’re not represented at all—i.e. even if there’s no humans around to make them. For instance, energy is still conserved even in parts of the universe which humans have never seen and will never see, and that still constrains the viable architectures of agent-like systems in those parts of the universe.
These are both correct. The first is right in most applications of Markov blankets. The second is relevant mainly in e.g. science, where figuring out the causal structure is part of the problem. In science, we can experimentally test whether M2 mediates the interaction between M1 and M3 (i.e. whether M2 is a Markov blanket between M1 and M3), and then we can back out information about the causal structure from that.
The #P-complete problem is to calculate the distribution of some variables in a Bayes net given some other variables in the Bayes net, without any particular restrictions on the net or on the variables chosen.
Formal statement of the Telephone Theorem: We have a sequence of Markov blankets forming a Markov chain M1→M2→.... Then in the limit n→∞, fn(Mn) mediates the interaction between M1 and Mn (i.e. the distribution factors according to M1→fn(Mn)→Mn), for some fn satisfying
with probability 1 in the limit.
You’re basically correct. The substantive part is that, if I say ”M2 is a Markov blanket separating M1 from M3”, then I’m claiming that M2 is a comprehensive list of all the “ways in which M1 and M3 are not independent”. If we have a Markov blanket, then we know exactly “which channels” the two sides can interact through; we can rule out any other interactions.
I mean, the argument does kinda rely on someone else having written it better, which does not often happen when “better” is comparing to Scott.
Good question! Rough argument: if someone else has already written it better, then do your readers a favor and promote that to them instead.
Obviously this is an imperfect argument—for instance, writing is a costly signal that you consider a topic important, and it’s also a way to clarify your own thoughts or promote your own brand. So Pareto optimality isn’t necessarily relevant to things I’m writing for my own benefit (as opposed to readers’), and it’s not relevant when the writing is mostly a costly signal of importance aimed at my social circle. Also, even if we accept the argument, then Pareto optimality is only a necessary condition for net value, not a sufficient condition; plenty of things are on some Pareto frontier but still not worth reading for anyone.
I’ve seen that graph (of what percentage of couples met in various ways) a few times now, and what I really want to know is: why do several different channels all plateau at the same levels? E.g. bar/restaurant, coworkers, and online all seem to plateau just below 20% for a while. Church, neighbors, and college all seem to hang out around 8% for a while. What’s up with that?
Yeah, I didn’t want to spend a paragraph on definitions which nobody would be able to keep straight anyway. “False positive” and “false negative” are just very easy-to-confuse terms in general. That’s why I switched to “duds” and “missed opportunities” in the sales funnel section.
Fixed, thank you.
More like: exponential family distributions are a universal property of information-at-a-distance in large complex systems. So, we can use exponential models without any loss of generality when working with information-at-a-distance in large complex systems.
That’s what I hope to show, anyway.
Yup, that’s the direction I want. If the distributions are exponential family, then that dramatically narrows down the space of distributions which need to be represented in order to represent abstractions in general. That means much simpler data structures—e.g. feature functions and Lagrange multipliers, rather than whole distributions.
Roughly speaking, the generalized KPD says that if the long-range correlations are low dimensional, then the whole distribution is exponential family (modulo a few “exceptional” variables). The theorem doesn’t rule out the possibility of high-dimensional correlations, but it narrows down the possible forms a lot if we can rule out high-dimensional correlations some other way. That’s what I’m hoping for: some simple/common conditions which limit the dimension of the long-range correlations, so that gKPD can apply.
This post says that those long range correlations have to be mediated by deterministic constraints, so if the dimension of the deterministic constraints is low, then that’s one potential route. Another potential route is some kind of information network flow approach—i.e. if lots of information is conserved along one “direction”, then that should limit information flow along “orthogonal directions”, which would mean that long-range correlations are limited between “most” local chunks of the graph.