On the contrary, I think there is no norm against board members criticizing corporate direction.
I think it is accepted that a member of the board of a for-profit corporation might publicly say that they think the corporation’s X division should be shut down, in order to concentrate investment in the Y division, since they think the future market for Y will be greater than for X, even though the rest of the board disagrees. This might be done to get shareholders on-side for this change of direction.
For a non-profit, criticism regarding whether the corporation is fulfilling its mandate is similarly acceptable. The idea that board members should resign if they think the corporation is not abiding by its mission is ridiculous—that would just lead to the corporation departing even more from its mission.
Compare with members of a legislative body. Legislators routinely say they disagree with the majority of the body, and nobody thinks the right move if they are on the losing side of a vote is to resign.
And, a member of the miltary who believes that they have been ordered to commit a war crime is not supposed to resign in protest (assuming that is even possible), allowing the crime to be committed. They are supposed to disobey the order.
The mathematics of a latent variable model expresses the probabilities, p(x), for observations x as marginal probabilities integrating over unobserved z. That is, p(x) = integral over z of p(x,z), where p(x,z) is typically written as p(z)p(x|z).
It’s certainly correct that nothing in this formulation says anything about whether z captures the “causes” of x.
However, I think it sometimes is usefully seen that way. Your presentation would be clearer if you started with one or more examples of what you see as typical models, in which you argue that z isn’t usefully seen as causing x.
I’d take a typical vision model to be one in which z represents the position, orientation, and velocity of some object, at some time, and x is the pixel values from a video camera at some location at that time. Here, it does seem quite useful to view z as the cause of x. In particular, the physical situation is such that z at a future time is predictable from z now (assuming no forces act on the object), but x at a future time is not predictable from x now (both because x may not provide complete knowledge of position and orientation, and because x doesn’t include the velocity).
This is the opposite of what you seem to assume—that x now cause x in the future, but that this is not true for the “summary” z. But this seems to miss a crucial feature of all real applications—we don’t observe the entire state of the world. One big reason to have an unobserved z is to better represent the most important features of the world, which are not entirely inferrable from x. Looking at x at several times may help infer z, and to the extend we can’t, we can represent our uncertainty about z and use this to know how uncertain our predictions are. (In contrast, we are never uncertain about x—it’s just that x isn’t the whole world.)
OK. My views now are not far from those of some time ago, expressed at https://glizen.com/radfordneal/res-bayes-ex.html
With regard to machine learning, for many problems of small to moderate size, some Bayesian methods, such as those based on neural networks or mixture models that I’ve worked on, are not just theoretically attractive, but also practically superior to the alternatives.
This is not the case for large-scale image or language models, for which any close approximation to true Bayesian inference is very difficult computationally.
However, I think Bayesian considerations have nevertheless provided more insight than frequentism in this context. My results from 30 years ago showing that infinitely-wide neural networks with appropriate priors work well without overfitting have been a better guide to what works than the rather absurd discussions by some frequentist statisticians of that time about how one should test whether a network with three hidden units is sufficient, or whether instead the data justifies adding a fourth hidden unit. Though as commented above, recent large-scale models are really more a success of empirical trial-and-error than of any statistical theory.
One can also look at Vapnik’s frequentist theory of structural risk minimization from around the same time period. This was widely seen as justifying use of support vector machines (though as far as I can tell, there is no actual formal justification), which were once quite popular for practical applications. But SVMs are not so popular now, being perhaps superceded by the mathematically-related Bayesian method of Gaussian process regression, whose use in ML was inspired by my work on infinitely-wide neural networks. (Other methods like boosted decision trees may also be more popular now.)
One reason that thinking about Bayesian methods can be fruitful is that they involve a feedback process:
Think about what model is appropriate for your problem, and what prior for its parameters is appropriate. These should capture your prior beliefs.
Figure out some computational method to get the posterior, and predictions based on it.
Check whether the posterior and/or predictions make sense, compared to your subjective posterior (informally combining prior and data). Perhaps also look at performance on a validation set, which is not necessary in Bayesian theory, but is a good idea in practice given human fallibility and computational limitations.
You can also try proving theoretical properties of the prior and/or posterior implied by (1), or of the computational method of step (3), and see whether they are what you were hoping for.
If the result doesn’t seem acceptable, go back to (1) and/or (3).
Prior beliefs are crucial here. There’s a tension between what works and what seems like the right prior. When these seem to conflict, you may gain better understanding of why the original prior didn’t really capture your beliefs, or you may realize that your computational methods are inadequate.
So, for instance, infinitely wide neural networks with independent finite-variance priors on the weights converge to Gaussian processes, with no correlations between different outputs. This works reasonably well, but isn’t what many people were hoping and expecting—no “hidden features” learned about the input. And non-Bayesian neural networks sometimes perform better than the corresponding Gaussian process.
Solution: Don’t use finite-variance priors. As I recommended 30 years ago. With infinite-variance priors, the infinite-width limit is a non-Gaussian stable process, in which individual units can capture significant hidden features.
OK. I think we may agree on the technical points. The issue may be with the use of the word “Bayesian”.
Me: But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
You: Importantly, this is false! This statement is wrong if you have only one hypothesis rather than two.
I’m correct, by the usual definition of “Bayesian”, as someone who does inference by combining likelihood and prior. Bayesians always have more than one hypothesis (outside trivial situations where everything is known with certainty), with priors over them. In the example I gave, one can find a b such that the likelihood ratio with 0.5 is large, but the set of such b values will likely have low prior probability, so the Bayesian probably isn’t fooled. In contrast, a frequentist “pure significance test” does involve only one explicit hypothesis, though the choice of test statistic must in practice embody some implicit notion of what the alternative might be.
Beyond this, I’m not really interested in debating to what extent Yudkowsky did or did not understand all nuances of this problem.
If I do an experiment, you generally don’t know the precise alternate hypothesis in advance—you want to test if the coin is fair, but you don’t know precisely what bias it will have if it’s unfair.
Yes. But as far as I can see this isn’t of any particular importance to this discussion. Why do you think it is?
If we fix the two alternate hypotheses in advance, and if I have to report all data, then I’m reduced to only hacking by choosing the experiment that maximizes the chance of luckily passing your threshold via fluke. This is unlikely, as you say, so it’s a weak form of “hacking”. But this is also what I’m reduced to in the frequentist world! Bayesianism doesn’t actually help. They key was (a) you forced me to disclose all data, and (b) we picked the alternate hypothesis in advance instead of only having a null hypothesis.
Actually, a frequentist can just keep collecting more data until they get p<0.05, then declare the null hypothesis to be rejected. No lying or suppression of data required. They can always do this, even if the null hypothesis is true: After collecting n data points, they have a 0.05 chance of seeing p<0.05. If they don’t, they then collect nK more data points, where Kis big enough that whatever happened with the first n data points makes little difference to the p-value, so there’s still about a 0.05 chance that p<0.05. If that doesn’t produce a rejection, they collect nK2 more data points, and so on until they manage to get p<0.05, which is guaranteed to happen eventually with probability 1.
But they aren’t guaranteed to eventually get a Bayesian to think the null hypothesis is likely to be false, when it is actually true.
I am saying that Yudkowsky is just plain wrong here, because omitting info is not the same as outright lying.
This is silly. Obviously, Yudkowsky isn’t going to go off on a tangent about all the ways people can lie indirectly, and how a Bayesian ought to account for such possibilities—that’s not the topic. In a scientific paper, it is implicit that all relevant information must be disclosed—not doing so is lying. Similarly, a scientific journal must ethically publish papers based on quality, not conclusion. They’re lying if they don’t. As for authors just not submitting papers with undesirable conclusions—well, that’s a known phenomenon, that one should account for, along with the possibility that a cosmic ray has flipped a bit in the memory of the computer that you used for data analysis, and the possibility that you misremembered something about one of the studies, and a million other possibilities that one can’t possibly discuss in every blog post.
This is never the scenario, though. It is very easy to tell that the coin is not 90% biased no matter what statistics you use.
You misunderstand. H is some hypothesis, not necessarily about coins. Your goal is to convince the Bayesian that H is true with probability greater than 0.9. This has nothing to do with whether some coin lands heads with probability greater than 0.9.
I can get a lot of mileage out of designing my experiment very carefully to target that specific threshold (though of course I can never guarantee success, so I have to try multiple colors of jelly beans until I succeed).
I don’t think so, except, as I mentioned, that you obviously will do an experiment that could conceivably give evidence meeting the threshold—I suppose that you can think about exactly which experiment is best very carefully, but that isn’t going to lead to anyone making wrong conclusions.
The person evaluating the evidence knows that you’re going to try multiple colors. A frequentist would handle this with some sort of p-value correction. A Bayesian handles this by a small prior probability of the drug working, which may partly be based on the knowledge that if drugs of this class (set of colors) had a high probability of working, there would probably already be evidence of this. But this has nothing to do with the point about the stopping rule for coin flips not affecting the likelihood ratio, and hence the Bayesian conclusion, whereas it does affect the p-value.
If you say that you are reporting all your observations, but actually report only a favourable subset of them, and the Bayesian for some reason assigns low probability to you deceiving them in this way, when actually you are deceiving them, then the Bayesian will come to the wrong conclusion. I don’t think this is surprising or controversial.
But I don’t see how the Bayesian comes to a wrong conclusion if you truthfully report all your observations, even if they are taken according to some scheme that produces a distribution of likelihood ratios that is supposedly favourable to you. The distribution doesn’t matter. Only the observed likelihood ratio matters.
For example, suppose you want to convince the Bayesian that H is true with probability greater than 0.9. Some experiments may never produce data giving a likelihood ratio extreme enough to produce such a high probability. So you don’t do such an experiment, and instead do one that could conceivably produce an extreme likelihood ratio. But it probably won’t, if H is not actually true. If it does produce strong evidence for H, the Bayesian is right to think that H is probably true, regardless of your motivations (as long as you truthfully report all the data).
I think that various “pro-fertility” people have a variety of motivations.
But “more people are better” ought to be a belief of everyone, whether pro-fertility or not. It’s an “other things being equal” statement, of course—more people at no cost or other tradeoff is good. One can believe that and still think that less people would be a good idea in the current situation. But if you don’t think more people are good when there’s no tradeoff, I don’t see what moral view you can have other than nihilism or some form of extreme egoism.
BTW: I’m not ruling out an expansive definition of “people”—maybe gorillas are people, maybe some alien species are, maybe some AIs would be—but I think that’s outside the scope of the current discussion.
Integrals of the likelihood function aren’t really meaningful, even if normalized so the integral is one over the whole range. This is because the result depends on the arbitrary choice of parameterization—eg, whether you parameterize a probability by p in [0,1], or by log(p) in [-oo,0]. In Bayesian inference, one always integrates the likelihood only after multiplying by the prior, which can be seen as a specification of how the integration is to be done.
I think you’ve got his pretty much figured out. But you may be missing an additional subtlety.
You say “Bayesian likelihood ratios really do only depend on the probability each hypothesis assigned only to the information that you received”. Which could be interpreted as saying that the “likelihood function” is the probability assigned to the information received, seen as a function of f. But the likelihood function is actually not a function at all, but rather an equivalence class of functions of f that differ only by an overall positive scale factor.
You can see how this matters when comparing a report of an observation of 6 flips, such as HHHHHT, versus a report that gives only the number of tails, which is 1 in this case. The probability of HHHHHT as a function of f is (1−f)5f, but the probability of 1 tail is 6(1−f)5f, which is not the same function, but is in the same equivalence class, since it differs only by an overall factor of 6. Of course, this overall scale factor cancels out when looking at likelihood ratios for different values of f.
Your factual claims here seem at least somewhat reasonable. Naively extrapolating sub-replacement fertility centuries into the future is silly. Our wealthy civilization ought to be capable of finding some way of coping with increased elder care. The current number of humans may perhaps be more than is optimal.
But your moral view is atrocious. Human extinction would be bad—very bad. Because humans are the most interesting species to arise so far. Because human striving is properly focused on the future, not just hedonistic pleasures of the moment. Because there may be a meaning to human existence that we can so far only dimly perceive.
And more humans are better, if they come at no (or small) cost to the quality of life of the existing humans. Human lives have intrinsic value, and every additional life adds value.
I do wonder if your moral views have biased your factual assessments as well.
I tested this with ChatGPT4 just now. It mostly got my specific questions right, but its commentary reveals some substantial misconceptions, such as that a flight from Ecuador to Sri Lanka would first pass over the Atlantic Ocean and then over Brazil.
Here it is:
If I flew by the most direct route from Ecuador to Sri Lanka, would I pass over Egypt?
If you were to fly the most direct route (a great circle route) from Ecuador to Sri Lanka, you would not pass over Egypt.
Here’s a general idea of what the path would look like:
Starting from Ecuador, the route would take you eastward over the Atlantic Ocean.
You’d then pass over parts of Brazil in South America.
Continuing over the South Atlantic Ocean, you’d cross over southern Africa, likely missing most of the major countries to the south.
From there, you’d head out over the Indian Ocean, aiming towards Sri Lanka.
The path would largely pass south of the equator and far to the south of Egypt. Remember, the shortest path between two points on a sphere (like Earth) isn’t necessarily intuitive when looking at typical flat maps, but becomes clearer when visualizing on a globe or using a great circle route tool.
Would I pass over Somalia?
If you fly the most direct route (a great circle route) from Ecuador to Sri Lanka, you would indeed come closer to the eastern coast of Africa, which includes Somalia. After crossing the Atlantic Ocean and part of southern Africa, the route could take you over or near the northern part of Madagascar and then across the Indian Ocean towards Sri Lanka.
Given this trajectory, it’s possible that you might pass over or near the southern part of Somalia or its adjacent waters. However, the exact trajectory can vary based on specific starting and ending points, as well as any airspace restrictions or other navigational considerations.
Would I pass over Columbia?
Let’s clarify the countries in question:
Ecuador is in the northwestern part of South America.
Colombia (with an “o”, not “u”) shares a border with Ecuador to its north.
If you’re flying a direct route (great circle route) from Ecuador to Sri Lanka, the trajectory would first take you eastward, likely across the northern part of Brazil. Given the proximity of Colombia to Ecuador, especially if your departure point in Ecuador is in the northern part of the country or near the border, it’s very plausible that your flight could pass over southern Colombia or skirt its southern border.
So, to answer your question: Yes, it’s possible that you would pass over parts of Colombia, depending on your exact departure point in Ecuador.
Would I pass over South Africa?
If you were flying the most direct route (a great circle route) from Ecuador to Sri Lanka, the trajectory would most likely not pass directly over South Africa.
After departing from Ecuador, the route would head eastward over the Atlantic Ocean, then cross over parts of South America (likely Brazil), continue across the South Atlantic, and then head towards the southern regions of the African continent. However, given the positions of Ecuador and Sri Lanka on the globe, the route would likely take you over or near the southeastern parts of Africa, such as Mozambique or the adjacent oceanic regions, rather than South Africa itself.
To summarize, while the flight might come close to South Africa or its adjacent waters, it would not necessarily pass directly over South Africa.
Well, given that the text of the US constitution seems to clearly state that all powers not explicitly granted to the federal government belong to the states (or the people), I don’t see how “power is devolved to the states from the federal government”. It seems that the states don’t need to wait for the federal government to “devolve” power to them in order to do something. As indeed we saw recently with respect to covid policy.
You could argue that the federal government “lets” the states do this, in the sense that the federal government has more guns than the states, and hence could stop them it it wanted to. But this would be naive. These guns are operated by people, whose loyalty to the federal government if there were a conflict would not be automatic.
the constitution is quite clear that power is devolved to the states from the federal government, and not that the federal government is granted power at the behest of the states
The 10th amendment to the US constitution says:
The powers not delegated to the United States by the Constitution, nor prohibited by it to the States, are reserved to the States respectively, or to the people.
which sounds like the opposite of what you say. Of course, practice may be different.
Can you tell me why?
It think if we encountered aliens who were apparently not hostile, but presumably strange, and likely disgusting or disturbing in some ways, there would be three groups (likely overlapping) of people opposed to wiping them out:
Those who see wiping them out as morally wrong.
Those who see wiping them out as imprudent—we might fail, and then they wipe us out, or other aliens now see us as dangerous, and wipe us out.
Those who see wiping them out as not profitable—better to trade with them.
There would also be three groups in favour of wiping them out:
Those who see wiping them out as morally good—better if the universe doesn’t have such disgusting beings.
Those who see wiping them out as the prudent thing to do—wipe them out before they change their mind and do that to us.
Those who see wiping them out as profitable—then we can grab their resources.
I think it’s clear that people with all these view will exist, in non-negligible numbers. I think there’s at least a 5% chance that the “don’t wipe them out” people prevail.
Subgroups of our species are also actively wiping out other subgroups of our species they don’t like.
Yes, but that’s not how interactions between groups of humans always turn out.
We didn’t really wipe out the Neanderthals (assuming we even were a factor, rather than climate, disease, etc.), seeing as they are among our ancestors.
We are a species that has evolved in competition with other species. Yet, I think there is at least a 5% chance that if we encountered an intelligent alien species that we wouldn’t try to wipe them out (unless they were trying to wipe us out).
Biological evolution of us and aliens would in itself be a commonality, that might produce some common values, whereas there need be no common values with an AI created by a much different process and not successfully aligned.
Perhaps of relevance:
How to Tell the Birds from the Flowers
One problem I have with Diamond’s theory is that I doubt that there is anything for it to explain. The Americas and Eurasia/Africa were essentially isolated from each other for about 15,000 years. In 1500 AD, the Americas were roughly 3500 years less advanced than Eurasia/Africa. That seems well within the random variation one would expect between two isolated instances of human cultural development over a 15,000 year time span. If you think there is still some remaining indication that the Americas were disadvantaged, the fact that the Americas are about half the size of Eurasia/Africa seems like a sufficient explanation.
Perhaps you could give the definition you would use for the word “probability”.
I define it as one’s personal degree of belief in a proposition, at the time the judgement of probability is being made. It has meaning only in so far it is (or may be) used to make a decision, or is part of a general world model that is itself meaningful. (For example, we might assign a probability to Jupiter having a solid core, even though that makes no difference to anything we plan to do, because that proposition is part of an overall theory of physics that is meaningful.)
Frequentist ideas about probability being related to the proportion of times that an event occurs in repetitions of a scenario are not part of this definition, so the question of what denominator to use does not arise. (Looking at frequentist concepts can sometimes be a useful sanity check on whether probability judgements make sense, but if there’s some conflict between frequentist and Bayesian results, the solution is to re-examine the Bayesian results, to see if you made a mistake, or to understand why the frequentist results don’t actually contradict the Bayesian result.)
If you make the right probability judgements, you are supposed to make the right decision, if you correctly apply decision theory. And Beauty does make the right decision in all the Sleeping Beauty scenarios if she judges that P(Heads)=1/3 when woken before Wednesday. She doesn’t make the right decision if she judges that P(Heads)=1/2. I emphasize that this is so for all the scenarios. Beauty doesn’t have to ask herself, “what denominator should I be using?”. P(Heads)=1/3 gives the right answer every time.
Another very useful property of probability judgements is that they can be used for multiple decisions, without change. Suppose, for example, that in the GWYD or GRYL scenarios, in addition to trying not to die, Beauty is also interested in muffins.
Specifically, she knows from the start that whenever she wakes up there will be a plate of freshly-baked muffins on her side table, purchased from the cafe down the road. She knows this cafe well, and in particular knows that (a) their muffins are always very delicious, and (b) on Tuesdays, but not Mondays, the person who bakes the muffins adds an ingredient that gives her a stomach ache 10 minutes after eating a muffin. Balancing these utilities, she decides to eat the muffins if the probability of it being Tuesday is less than 30%. If Beauty is a Thirder, she will judge the probability of Tuesday to be 1⁄3, and refrain from eating the muffins, but if Beauty is a Halfer, she will (I think, trying to pretend I’m a halfer) think the probability of Tuesday is 1⁄4, and eat the muffins.
The point here is not so much which decision is correct (though of course I think the Thirder decision is right), but that whatever the right decision is, it shouldn’t depend on whether Beauty is in the GWYD or GRYL scenario. She shouldn’t be considering “denominators”.
I think we actually have two quantities:
“Quobability”—The frequency of correct guesses made divided by the total number of guesses made.
“Srobability”—The frequency of trials in which the correct guess was made, divided by the number of trials.
Quabability is 1⁄3, Scrobability is 1⁄2. “Probability” is (I think) an under-precise term that could mean either of the two.
I suspect that the real problem isn’t with the word “probability”, but rather the word “guess”. In everyday usage, we use “guess” when the aim is to guess correctly. But the aim here is to not die.
Suppose we rephrase the GRYL scenario to say that Beauty at each awakening takes one of two actions—“action H” or “action T”. If the coin lands Heads, and Beauty takes action H the one time she is woken, then she lives (if she instead takes action T, she dies). If the coin lands Tails, and Beauty takes action T at least one of the two times she is woken, then she lives (if she takes action H both times, she dies).
Having eliminated the word “guess”, why would one think that Beauty’s use of the strategy of randomly taking action H or action T with equal probabilities implies that she must have P(Heads)=1/2? As I’ve shown above, that strategy is actually only compatible with her belief being that P(Heads)=1/3.
Note that in general, the “action space” for a decision theory problem need not be the same as the “state space”. One might, for example, have some uncertain information about what day of the week it is (7 possibilities) and on that basis decide whether to order pepperoni, anchovy, or ham pizza (3 possibilities). (You know that different people, with different skills, usually make the pizza on different days.) So if for some reason you randomized your choice of action, it would certainly not say anything directly about your probabilities for the different days of the week.