Right side of equation 2. Also the v update step in algorithm 1 should have a negative sign (the text version earlier on the same page has it right).
samshap
Thanks for sharing!
Two comments:
There seem to be a couple of sign errors in the manuscript. (Probably worth reaching out to the authors directly)
Their predictive coding algorithm holds the vhat values fixed during convergence, which actually implies a somewhat different network topology than the more traditional one shown in your figure.
Do you have some source for saying the log scoring rule should only be used when no anthropics are involved? Without that, what does it even mean to have a well-calibrated belief?
(BTW, there are other nice features of using the log-scoring rule, such as rewarding models that minimize their cross-entropy with the territory).
My argument is that the log scoring rule is not just a “given way of measuring outcomes”. A belief that maximizes E(log(p)) is the definition of a proper Bayesian belief. There’s no appeal to consequence other than “SB’s beliefs are well calibrated”.
Redissolving sleeping beauty (and maybe solving it entirely)
[epistemic status—I’m new to thinking about anthropics, but I don’t see any obvious flaws]
If a tree falls on sleeping beauty famously claims to have dissolved the Sleeping Beauty problem—that SB’s correct answer just depended on what the reward structure for her answers, and that her actual credance didn’t matter.
Several lesswrongers seem unsatisfied with that answer—understandably, given a longstanding commitment to epistemics and Bayesianism!
I would argue that ata did some key work in answering the problem from a purely epistemic perspective.
Recall the question SB is to be asked upon waking:
Each interview consists of one question, “What is your credence now for the proposition that our coin landed heads?”
And one of the bets ata formulated:
Each interview consists of one question, “What is your credence now for the proposition that our coin landed heads?”, and the answer given will be scored according to a logarithmic scoring rule, with the aggregate result corresponding to the number of utilons (converted to dollars, let’s say) she will be penalized after the experiment.
These questions are actually equivalent! A properly calibrated belief is one that is optimal w.r.t to the logarithmic scoring rule.
ata goes on to show that the answer to that question is 1⁄3. This result, I think, is actually contingent on the meaning of ‘aggregate’. If ‘aggregate’ just means ‘sum over all predictions ever’, then ata’s math checks out, the thirders are right, and the problem is solved.
However, given the premise of SB—in case of tails, she forgets everything that happened on Monday—you could argue for ‘aggregate’ meaning ‘sum over all predictions she remembers making’, in which case the correct answer is one half. Or if we include the log score for predictions that she was told she made, (say because the interviewers wrote it down and told her afterwards), then the answer becomes 1⁄3 again!
So the SB paradox boils down to what you, as an epistemic rationalist, consider the correct way to aggregate the entropy of predictions!
The ‘sum over all predictions’ seems best to me (and thus I suppose I lean to the 1⁄3 answer), but I don’t have a definitive reason as to why.
samshap’s Shortform
Suppose that you want to move to Hawaii because it’s so beautiful, but you know (because you saw something on the internet) that upon arrival, someone will rob you. If knowing this information, you still move to Hawaii, does this mean that you are consenting to being robbed? Even if when you actually get to Hawaii, you make sure to explain to every potential robber that you really really don’t want to be robbed?
Your argument here is both circular, and committing the noncentral fallacy!
To recap:
In a debate with rohimshah over whether taxation can be consensual (and therefore theft),your argument reads:
Taxation is analogous to robbery
Robbery (even robbery that predictably occurs when I consume a good or service) is not consensual
Therefore, taxation (even taxation that predictably occurs when I consume a good or service) is not consensual
Therefore taxation is theft
I won’t ding your OP for assuming that taxation is nonconsensual, since you were merely responding to Scott’s arguments that had already conceded that point.
However, to argue that all taxes are always nonconsensual is clearly absurd.
Many taxes (especially local ones) are nearly identical to fees that private actors charge under similar terms (e.g. property taxes are equivalent to HOA fees and rents). Not to mention plenty of times when people explicitly consent to taxation!
If you want to strengthen your argument, limit it to: ‘nonconsensual taxation is theft’.
Level 10 is just a mix of 2 and 3.
Most of these extra simulacra levels are redundant or orthogonal to the originals. I don’t think they carve reality well.
L5 overlaps heavily with L1. Interestingness is a quality of most L1 statements that are worth communicating!
L7 is L2. “There’s a lion across the river” = I want you to buy X. It’s direct manipulation of reality.
L8 is L3 or L4 (in your example), although propaganda can also be at L2.
L10 overlaps heavily with L3. “We should raise awareness of X” = I’m part of the group that believes “X”.
The only one that is salvageable is L6 (which is similar to L9), which I might call a True Level 5:
“There’s a lion across the river.” = Listen to me! I say things worth hearing! (from the speaker’s perspective)
There’s actually a lot of communication that falls within this bucket, characterized by the content of the statement having no instrumental value for the speaker. The speaker just wants your attention.
Kaj_sotala’s book summary provided me with something I hadn’t seen before—a non-mysterious answer to the question of consciousness. And I say this as someone who took graduate level courses in neuroscience (albeit a few years before the book was published). Briefly, the book defines consciousness as the ability to access and communicate sensory signals, and shows that this correlates highly with those signals being shared over a cortical Global Neuronal Workspace (GNW). It further correlates with access to working memory. The review also gives a great account of the epistemic status of the major claims in the book. It reviews the evidence from several experiments discussed in the book itself. The review also goes beyond this, discussing the epistemic status of those experiments (e.g. in light of the replication crisis in psychology).
So kudos to both the book author and the review author. A decent follow-up would be to link these findings to the larger lesswrong agenda (although I note this review is part of a larger sequence that includes additional nominations).
Hmmmm. Unfortunately I’m not sure what to say to this one except that in logical induction, there’s not generally a pre-existing z we can update on like that.
So that’s my real crux, and any examples with telephone calls and earthquakes etc are merely illustrative for me. (Like I said, I don’t know how to actually motivate any of this stuff except with actual logical uncertainty, and I’m surprised that any philosophers would have become convinced just from other sorts of examples.)
I agree that the logical induction case is different, since it’s hard to conceive of likelihoods to begin with. Basically, logical induction doesn’t even include what I would call virtual evidence. But many of the examples you gave do have such a z. I think I agree with your crux, and my main critique here is just in the examples of overly dogmatic Bayesian who refuses to acknowledge the difference between a and z. I won’t belabor the point further.
I’ve thought of another motivating example, BTW. In wartime, your enemy deliberately sends you some verifiably true information about their force dispositions. How should you update on that? You can’t use a Bayesian update, since you don’t actually have a likelihood model available. We can’t even attempt to learn a model from the information, since we can’t be sure its representative.
I don’t get this at all! What do you mean?
By model M, I mean an algorithm that generates likelihood functions, so M(H,Z) = P(Z|H).
So any time we talk about a likelihood P(Z|H), it should really read P(Z|H,M). We’ll posit that P(H,M) = P(H)P(M) (i.e. that the model says nothing about our priors), but this isn’t strictly necessary.
E(P(Z|H,M)) will be higher for a well calibrated model than a poorly calibrated model, which means that we expect P(H,M|Z) to also be higher. When we then marginalize over the models to get a final posterior on the hypothesis P(H|Z), it will be dominated by the well-calibrated models: P(H|Z) = SUM_i P(H|M_i,Z)P(M_i|Z).
BTW, I had a chance to read part of the ILA paper. It barely broke my brain at all! I wonder if the trick of enumerating traders and incorporating them over time could be repurposed to a more Bayesianish context, by instead enumerating models M. Like the trading firm in ILA, a meta-Bayesian algorithm could keep introducing new models M_k over time, with some intuition that the calibration of the best model in the set would improve over time, perhaps giving it all those nice anti-dutch book properties. Basically this is a computable Solomonoff induction, that slowly approaches completeness in the limit. (I’m pretty sure this is not an original idea. I wouldn’t be surprised if something like this contributed to the ILA itself).
Of course, its pretty unclear how this would work in the logical induction case. This might all be better explained in its own post.
You’re right, you could have an event in the event space which is just “the virtua-evidence update [such-and-such]”. I’m actually going to pull out this trick in a future follow-up post.
I note that that’s not how Pearl or Jeffrey understand these updates. And it’s a peculiar thing to do—something happens to make you update a particular amount, but you’re just representing the event by the amount you update. Virtual evidence as-usually-understood at least coins a new symbol to represent the hard-to-articulate thing you’re updating on.
That’s not quite what I had in mind, but I can see how my ‘continuously valued’ comment might have thrown you off. A more concrete example might help: consider Example 2 in this paper. It posits three events:
b—my house was burgled
a—my alarm went off
z—my neighbor calls to tell me the alarm went off
Pearl’s method is to take what would be uncertain information about a (via my model of my neighbor and the fact she called me) and transform it into virtual evidence (which includes the likelihood ratio). What I’m saying is that you can just treat z as being an event itself, and do a Bayesian update from the likelihood P(z|b)=P(z|a)P(a|b)+P(z|~a)P(~a|b), etc. This will give you the exact same posterior as Pearl. Really, the only difference in these formulations is that Pearl only needs to know the ratio P(z|a):P(z|~a), whereas traditional Bayesian update requires actual values. Of course, any set of values consistent with the ratio will produce the right answer.
The slightly more complex case (and why I mentioned continuous values) is in section 5 where the message includes probability data, such as a likelihood ratio. Note that the continuous value is not the amount you update (at least not generally), because its not generated from your own models, but rather by the messenger. Consider event z99, where my neighbor calls to say she’s 99% sure the alarm went off. This doesn’t mean I have to treat P(z99|b):P(z99|~b) as 99:1 - I might model my neighbor as being poorly calibrated (or as not being independent of other information I already have), and use some other ratio.
In what sense? What technical claim about Bayesian updates are you trying to refer to?
Definitely the second one, as optimal update policy. Responding to your specific objections:
This is only true if the only information we have coming in is a sequence of propositions which we are updating 100% on.
As you’ll hopefully agree with at this point, we can always manufacture the 100% condition by turning it into virtual evidence.
This optimality property only makes sense if we believe something like grain-of-truth.
I believe I previously conceded this point—the true hypothesis (or at least a ‘good enough’ one) must have a nonzero probability, which we can’t guarantee.
But properties such as calibration and convergence also have intuitive appeal
Re: calibration—I still believe that this can be included if you are jointly estimating your model and your hypothesis.
Re: convergence—how real of a problem is this? In your example you had two hypotheses that were precisely equally wrong. Does convergence still fail if the true probability is 0.500001 ?
(By the way, I really appreciate your in-depth engagement with my position.)
Likewise! This has certainly been educational, especially in light of this:
Sadly, the actual machinery of logical induction was beyond the scope of this post, but there are answers. I just don’t yet know a good way to present it all as a nice, practical, intuitively appealing package.
The solution is too large to fit in the margins, eh? j/k, I know there’s a real paper. Should I go break my brain trying to read it, or wait for your explanation?
Phew! Thanks for de-gaslighting me.
I definitely missed a few things on the first read through—thanks for repeating the ratio argument in your response.
I’m still confused about this statement:
Virtual evidence requires probability functions to take arguments which aren’t part of the event space.
Why can’t virtual evidence messages be part of the event space? Is it because they are continuously valued?
As to why one would want to have Bayesian updates be normative: one answer is that they maximize our predictive power, given sufficient compute. Given the name of this website, that seems a sufficient reason.
A second answer you hint at here:
The second seems more practical for the working Bayesian.
As a working Bayesian myself, having a practical update rule is quite useful! As far as I can tell, I don’t see a good alternative in what you have provided.
Then we have to ask why not (steelmanned) classical Bayesianism? I think you’ve two arguments, one of which I buy, the other I don’t.
The practical problem with this, in contrast to a more radical-probabilism approach, is that the probability distribution then has to explicitly model all of that stuff.
This is the weak argument. Computing P(A*|X) “the likelihood I recall seeing A given X” is not a fundamentally different thing than modeling P(A|X) “the likelihood signal A happened given X”. You have to model an extra channel effect or two, but that’s just a difference of degree.
Immediately after, though, you have the better argument:
As Scott and I discussed in Embedded World-Models, classical Bayesian models require the world to be in the hypothesis space (AKA realizability AKA grain of truth) in order to have good learning guarantees; so, in a sense, they require that the world is smaller than the probability distribution. Radical probabilism does not rest on this assumption for good learning properties.
if I were to paraphrase—Classical Bayesianism can fail entirely when the world state does not fit into one of its nonzero probability hypotheses, which must be of necessity limited in any realizable implementation.
I find this pretty convincing. In my experience this is a problem that crops up quite frequently, and requires meta-Bayesian methods you mentioned like calibration (to notice you are confused) and generation of novel hypotheses.
(Although Bayesianism is not completely dead here. If you reformulate your estimation problem to be over the hypothesis space and model space jointly, then Bayesian updates can get you the sort of probability shifts discussed in Pascal’s Muggle. Of course, you still run into the ‘limited compute’ problem, but in many cases it might be easier than attempting to cover the entire hypothesis space. Probably worth a whole other post by itself.)
Why is a dogmatic Bayesian not allowed to update on virtual evidence? It seems like you (and Jeffries?) have overly constrained the types of observations that a classical Bayesian is allowed to use, to essentially sensory stimuli. It seems like you are attacking a strawman, given that by your definition, Pearl isn’t a classical Bayesian.
I also want to push back on this particular bit:
Richard Jeffrey (RJ): Tell me one peice of information you’re absolutely certain of in such a situation.
DP: I’m certain I had that experience, of looking at the cloth.
RJ: Surely you aren’t 100% sure you were looking at cloth. It’s merely very probable.
DP: Fine then. The experience of looking at … what I was looking at.
I’m pretty sure we can do better. How about:
DP: Fine then. I’m certain I remember believing that I had seen that cloth.
For an artificial dogmatic probabilist, the equivalent might be:
ADP: Fine then. I’m certain of evidence A* : that my probability inference algorithm received a message with information about an observation A.
Essentially, we update on A* instead of A. When we compute the likelihood P(A*|X), we can attempt to account for all the problems with our senses, neurons, memory, etc. that result in P(A*|~A) > 0.
RJ still has a counterpoint here:
RJ: Again I doubt it. You’re engaging in inner-outer hocus pocus.* There is no clean dividing line before which a signal is external, and after which that signal has been “observed”. The optic nerve is a noisy channel, warping the signal. And the output of the optic nerve itself gets processed at V1, so the rest of your visual processing doesn’t get direct access to it, but rather a processed version of the information. And all this processing is noisy. Nowhere is anything certain. Everything is a guess. If, anywhere in the brain, there were a sharp 100% observation, then the nerves carrying that signal to other parts of the brain would rapidly turn it into a 99% observation, or a 90% observation...
But I don’t find this compelling. At some point there is a boundary to the machinery that’s performing the Bayesian update itself. If the message is being degraded after this point, then that means we’re no longer talking about a Bayesian updater.
Thanks for presenting your thesis. However, one of your figures doesn’t support your argument on closer inspection. The figure that you point to as being the ‘unfiltered’ data is measuring cross-correlation between the Hanford and Livingston datasets, so we should expect it to look completely different than the datasets themselves.
I also want to push back on a particular point—there’s nothing wrong in principle with using a black-hole shaped filter to find black holes. You just have to adjust the prior based on the complexity of your filter.
I’ve been lurking lesswrong for years, and this is the article that actually got me to create an account. I am promoting this to everyone I can that has a scrap of political influence—my bosses (I work at a major university), my local newspaper, my rabbis, my local politicians. Every state in the country should be enacting the same measures as New York and Texas.
I would urge the lesswrong community to
a: constructively critique the article as Chris recommends (use argument to make it stronger)
b: shut up and do the impossible—if your state governor hasn’t already shut down restaurants, public gatherings, and restricted all non-essential travel, get them to do it ASAP. If we figured out how to get a handler to unbox a superhuman intelligence, and how to defeat Voldemort, we at least owe this an attempt.
I think that’s premature. This is just one (digital, synchronous) implementation of one model of BNN that can be shown to converge on the same result as backprop. In a neuromorphic implementation of this circuit, the convergence would occur on the same time scale as the forward propagation.