It seems like you could do better with a logit model
p = logistic( \sum_i w_i c_i ) that is, logit(p) = log odds(p) = \sum_i w_i c_i
Are these also called SPR’s?
It seems like you could do better with a logit model
p = logistic( \sum_i w_i c_i ) that is, logit(p) = log odds(p) = \sum_i w_i c_i
Are these also called SPR’s?
I think you could make the first theorem in the post (simplified fundamental theorem on two variables) easier to understand to a novice if you explicitly clarified that the conclusion diagram Λ′ → Λ → X is the same as Λ′ ← Λ → X by the chain rerooting rule, and perhaps use the latter diagram in the picture as it more directly makes clear the idea of mediation/inducing independence.
I also think this about the redund condition X_1 → X_2-> Λ ′ & X_2 → X_1 → Λ′. Until realizing that these diagrams were the same as X_1 ← X_2 → Λ′, the condition seemed mysterious to me, and because you didn’t describe them using the same english words, it took me a while to realize that it makes sense if I think of it as X_2 mediating between X_1 and Λ′ (so learning about X_1 doesn’t tell me anything new about Λ′) and vice versa.
I think you could make the first theorem in the post (simplified fundamental theorem on two variables) easier to understand to a novice if you explicitly clarified that the conclusion diagram Λ′ → Λ → X is the same as Λ′ ← Λ → X by the chain rerooting rule, and perhaps use the latter diagram in the picture as it more directly makes clear the idea of mediation/inducing independence.
I, too, have had the same objection you have with people that claim that the problem with intransitive preferences is that you can be money pumped, and that our real objection is just that it’d be really weird to be able to transition by stepwise preferred states and yet end up in a state dispreferred to the start (this not being a money pump because the agent can just choose not to do this).
Though, “it’s really weird” is a pretty good objection—it, in fact, would be extremely weird to have intransitive preferences, and so I think it is fine to assume that the “true” (Coherent Extrapolated Volition) preferences of e.g. humans are transitive.
You can use the intuition that a greedy optimizer shouldn’t ever end up worse than it started, even if it isn’t in the best place.
It seems like you could do better with a logit model
p = logistic( \sum_i w_i c_i ) that is, logit(p) = log odds(p) = \sum_i w_i c_i
Are these also called SPR’s?
Does “chance relative to null is x%” mean “An observer, given my results, would assign an x% to me being calibrated”
No! P(Test results | Perfect calibration) / P(Test results | Whatever the null is) ≠ P(Perfect Calibration | Test results) !
You can also lodge this is a problem with null hypothesis testing—I would’ve thought that perfect calibration would be the null. Perhaps the null is a model where you just randomly say a probability from 0 to 100.
I’m assuming that they really calculated a likelihood function P(Data|Perfect) / P(Data|Null) instead of the posteriorP(Perfect|Data) / P(Null|Data) as the words they used would mean if taken literally. But maybe they have some priors P(Perfect) / P(Null) that they used. (The thing they should do is just report the likelihood ratio, instead of their posterior).
If you have your data and want to compute P(Data|Perfect), you can compute a total product Π_i (p_i if it happened, (1-p_i) if it didn’t)
So for example if I predicted 20%, 70%, 30% and the actual results were No, Yes, Yes, then P(Data|Perfect) = .8 * .7 * .3. If you have some other hypothesis (e.g. whatever their null is), you can compute P(Data|Other Hypothesis) by using the predictions that hypothesis makes for how your reported probabilities relate to propensities of events. A hypothesis here should be a function f(reported) = P(Event happens | reported).
Minor error is the last part of the last image of the post: “Lambda is determined by Lambda” should be “Lambda is determined by Lambda’ ”
A minor request: can you link “approximate deterministic functions” to some of your posts about them? I know you’ve written on them, and elaborated on them in more detail elsewhere.
A couple questions:
Can you elaborate more on why you care? Preferably, at a level of elaboration/succinctness that’s above the little you’ve said in the post, and below “read the linked posts and my other posts”. I feel that knowing why you care will help with understanding what the theorem really means. I am probably going to read a fair amount of those posts anyways.
Edit: It is clear now why you care, however I still believe there may be people that would benefit from some elaboration in the post. For anyone wondering, the reason to care is in the first theorem in natural latents: the math. For an easy special case of the ideas, see the extremely easy to understand post on (approximately) deterministic natural latents, and note that by the “Dangly Bits Lemma” the redund condition is implied by the approximately determined condition.
For the intuition behind what the redund condition means, note that by the rerooting rule the diagram X → Y → R is the same (as in, equivalently constrains the probability distribution) X ← Y → R, which means that once you know Y, being told X doesn’t help you predict R. The redund condition is that + the version with X,Y swapped. This makes it clear why this captures “R is redundantly encoded in X and Y”.
It sounds like you strongly expect the theorem to be true. How hard do you expect it to be? Does it feel like the sort of thing that can be done using basically the tools you’ve used?
How hard have you tried?
Do you have examples of the theorem? That is, do you have some variables X with some approximate redund Omega and a bunch of redunds Gamma such that [statements] hold? Or, better, do you have some example where you can find a universal redund Omega?
There are some reasons I can think of for why you’d perhaps want to explicitly refuse to answer some of those questions: to prevent anchoring when a blind mind might do better, to not reveal info that might (if the info turns out one way, assuming truthtelling) discourage attempts (like saying it’s super hard and you tried really hard), and to uphold a general policy of not revealing such info (e.g. perhaps you think this theorem is super easy and you haven’t spent 5 minutes on it, but plan to put bounties on other theorems that are harder).
For example, part of my inclination to try this is my sense that, compared to most conjectures, it’s
Incredibly low background—I’ve known the requisite probability/information theory for a long time at this point
A precise, single theorem is asked for. Not as precise as could be, given the freedom of “reasonable”, but compared to stuff like “figure out a theory of how blank works”, it’s concrete and small.
Very little effort han been put into it. No other conjecture I know of that I know enough to understand has had what looks like a single digit number of people working on it for not very long.
It intuitively seems like the sort of theorem that some random person on the internet that has the general mathematical ability (“maturity”/”proof ability”—the mostly tacit knowledge) but not a bunch of theoretical knowledge or even isn’t very raw-computing-power smarts (but is probably high creativity/that spark of genius that is at the core of discovery and invention, which can be partially traded off for a lower chance at success).
I think you have an indexing typo in the section on the general re-rooting rule.
Your picture has a “linear diagram” (that’s not your term, it’s mine) rooted at X_i, with variables from X_1 to X_n. You then say that this expresses the factorization P(X) = (\prod_{i \le k} P(X_{i-1} | X_i) P(X_k) \prod_{k \le i \le n-1} P(X_{i+1} | X_i)
This looks to me like the factorization of a linear diagram rooted at X_k, with the index i running over everything to the left and then to the right of X_k.
Secondly, right after that you say that the approximate rerooting theorem states that if the KL divergence of that expression is true for any i, then it is true for all i. This has the same problem—i is a bound variable, so I think you mean that if the bound is true for any k, then it is true for all k.
You should either change the two pictures to have the root variable be labeled X_k instead, or you can change the text in that section (including the mathematical expressions) to swap i with k.
There’s a minor error in the formula giving the cross entropy: you need a minus sign on the RHS so that it reads E[- log P[X|M_2] | M_2]
The preceding text is “Of course, we could be wrong about the distribution—we could use a code optimized for a model M2 which is different from the “true” model M1. In this case, the average number of bits used will be”
Certainly, you have pictures! Pictures are great!
I had no clue what SUVAT is and I know a relatively large amount of physics (advanced undergrad to grad level knowledge, currently an undergrad in college but with knowledge well above the curriculum). I feel a bit disgusted at the idea of someone memorizing those equations.
The first few letters are often used as parameters (e.g. p(x) = ax^2 + bx + c).
f is sometimes used for force density, e.g, in fluid mechanics (annoyingly, the wiki page on the Cauchy momentum equation uses f for the acceleration density caused by an external force).
Electrical engineers use j for the imaginary unit, because they will use i for current. I abhor this—why don’t they just capitalize and use I for current?
Fancy L’s are often used for the Lagrangian in analytical mechanics. The universe’s path is the one that is a stationary point (derivative equals 0, so minimum/maximum/saddle point) of the integral of the Lagrangian (denoted S), and analytical mechanics only gets more beautiful from there (it’s part of what got me into physics). Only mentioning this because you mentioned the Laplace transform.
m,l are often used for whole numbers when n is taken. So is k.
n with a hat is often used for the normal vector to some surface. Likewise A hat if you include the magnitude of the area.
P,Q,R,S,T are often used for points (like, in geometry). O is used for the origin/the center, though sometimes I see O just being another point.
u is often used for a velocity when v is taken.
Please don’t use s for speed.
s is often an arc length for a path.
R is often another logical proposition.
z is often used for the z-score in statistics, that is, sigmas away from the mean assuming a normal distribution. Likewise t for the t-test, which uses fancy stuff to better estimate the standard deviation from the sample (and only noticeably differs from z-tests for small samples).
k is often used as a multiplicative coefficient, e.g. Hookes law (F=kx).
Mathematics often uses X for a space of some kind. There’s also the convention that an upper case letter is a set from which the lower case letter comes from (e.g. take an example x in the space X of possible examples). When a bunch of sets are considered, one often uses some fancy version of the letter (Suppose we have an example-set A in the family of example-sets curlyA such that every example a in A is funny.)
I could make a long list of physical quantities with letter names. I could double the list by allowing greek letters (including ones used in mathematics But that wouldn’t really be about the connotations of variable names, it would just be a list of things with their variable names.
For a fun puzzle, look into the Monty Hall problem. The usual explanation is bad. Use Bayes Law to figure out a good one. For the answer, along with some extra problems (e.g. The Monty Fall problem, where Monty slips on a banana peel and accidentally flips one of the levers, and The Monty Crawl problem, where our poor host now has to crawl, and thus will prefer to open the lowest number door as long as it doesn’t contain the car), see https://probability.ca/jeff/writing/montyfall.pdf
I think you could’ve done better with integration by parts.
In physics, integration by parts is usually applied for a definite integral in which you can neglect the uv term. Thus, integration by parts reads: “The integral of udv = integral of -vdu, that is, you can trade what you differentiate in a product, as long as the functions in question have a small integral over the boundary”.
Common examples are when you integrate over some big volume, as most physical quantities are very small far away from the stuff.
I also think the intuition behind Bayes rule as usually interpreted here on LW, that is, it provides the updating rule posterior odds = prior odds*likelihood ratio and thereby also provides a formalization of how good evidence is. As for the derivation from P(A|B) defined as equal to P(A and B)/P(B), I think this is best described by saying that P(A|B) is the probability of A once you know B, so you take the mass associated to the worlds where A is true once B is true and compare to your total mass, which is the mass associated to the worlds where B is true. The former is really just “mass of A and B”, so you are done.
Now, P(A and B) = P(B)P(A|B), which I think of as “First, take probability B is true, then given that we are in this set of worlds, take the probability that A is true”. Essentially translating from locating sets to probabilities.
From here, Bayes theorem is the simple fact that A and B = B and A. So P(B)P(A|B) = P(A and B) = P(A)P(B|A). If you draw a square with 4 rectangles where the first row is P(A), where the second row is P(-A), where the first column is P(B), and where the second is P(-B), and each rectangle represents a possibility like P(A and -B), then this equation just splits the rectangle P(A and B) into (rectangle compared to row) * row = (rectangle compared to column) * column. Divide by P(B) (that is, the row) to get Bayes law.
For the sine rule, I think it also helps to show that the fraction a/sin(a) is the diameter of the circumcircle. Wikipedia has good pictures.
For an extra math fact that totally doesn’t need to be in the post, it is interesting that for spherical triangles, the law of sines just needs to be modified so that you take the sine of the lengths as well. In fact you can do similar in hyperbolic space (by using sinh), and there’s a taylor series form involving the curvature for a version of sine that makes the law of sines still true in any constant curvature space. (you can find this on the same wiki page).
Great explanation! I was linked here by someone after wondering why linear regression was asymmetric. While a quick google and a chatGPT could tell me that they are minimizing different things, the advantage of your post is the:
Pictures
Explanation of why minimizing different things will get you slopes differing in this specific way (that is, far outliers are punished heavily)
A connection to PCA that is nice and simply explained.
Thanks!
For a treatment besides Tamiflu: https://en.wikipedia.org/wiki/2009_swine_flu_pandemic cites the who and CDC stating that H1N1 developed resistance to Tamiflu but not Relenza
In December 2012, the World Health Organization (WHO) reported 314 samples of the 2009 pandemic H1N1 flu tested worldwide have shown resistance to oseltamivir (Tamiflu).[172] It is not totally unexpected as 99.6% of the seasonal H1N1 flu strains tested have developed resistance to oseltamivir.[173] No circulating flu has yet shown any resistance to zanamivir (Relenza), the other available anti-viral.[174]
The treatment plan at the time included Tamiflu/Relenza/experimental third thing (FDA approved for flu treatment in adults since 2014)
If oseltamivir (Tamiflu) is unavailable or cannot be used, zanamivir (Relenza) is recommended as a substitute.[50][168] Peramivir is an experimental antiviral drug approved for hospitalised patients in cases where the other available methods of treatment are ineffective or unavailable.[169]
I think 2009 H1N1 is a good example of how things could go, as it happened in the modern day.
I often find illustrative explanations like these either obvious or useless. But this was amazing! Those venn diagrams really are an extremely simple and intuitive and beautiful way to see Shapley values!
I think it makes sense to include the podcasts that aren’t currently updating—for example, Rationally Speaking’s old episodes. Affix needs a new link or an archived version, as the episodes are not listed at the current link, and I’m too lazy to track down the episodes.
I meant in terms of the way people use the word “SPR”—of course, if a linear model performs better than experts, than I would expect a linear model for the logit to as well, and if it doesn’t, that doesn’t change the point of the argument because you can just use the linear model.