Karma: 32

# Thoughts on the good reg­u­la­tor theorem

11 Aug 2022 12:08 UTC
12 points
• The number of elements in won’t change when removing every other element from it. The cardinality of is countable. And when you remove every other element, it is still countable, and indistinguishable from . If you’re unconvinced, ask yourself how many elements with every other element removed contains. The set is certainly not larger than , so it’s at most countable. But it’s certainly not finite either. Thus you’re dealing with a set of countably many 0s. As there is only one such multiset, equals with every other element removed.

That there is only one such multiset follows from the definition of a multiset, a set of pairs , where is an element and is its cardinality. It would also be true if we define multisets using sets containing all the pairs -- provided we ignore the identity of each pair. I believe this is where our disagreement lies. I ignore identities, working only with sets. I think you want to keep the identities intact. If we keep the identities, the set is not equal to , and my argument (as it stands) fails.

• I don’t understand what you mean. The upgraded individuals are better off than the non-upgraded individuals, with everything else staying the same, so it is an application of Pareto.

Now, I can understand the intuition that (a) and (b) aren’t directly comparable due to identity of individuals. That’s what I mean with the caveat “(Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)”

• Pareto: If two worlds (w1 and w2) contain the same people, and w1 is better for an infinite number of them, and at least as good for all of them, then w1 is better than w2.

As far as I can see, the Pareto principle is not just incompatible with the agent-neutrality principle, it’s incompatible with set theory itself. (Unless we add an arbitrary ordering relation on the utilities or some other kind of structure.)

Let’s take a look at, for instance, vs , where is the multiset containing and is the disjoint union. Now consider the following scenarios:

(a) Start out with and multiply every utility by to get . Since infinitely many people are better off and no one is worse off, .

(b) Start out with and take every other of the -utilities from and change them to . Since a copy of is still left over, this operation leaves us with . Again, since infinitely many are better off and no one worse off, .

In conclusion, both and , a contradiction.

• Okay, thanks for the clarification! Let’s see if I understand your setup correctly. Suppose we have the probability measures and , where is the probability measure of the expert. Moreover, we have an outcome

In your post, you use , where is an unknown outcome known only to the expert. To use Bayes’ rule, we must make the assumption that . This assumption doesn’t sound right to be, but I suppose some strange assumption is necessary for this simple framework. In this model, I agree with your calculations.

Yes! If I am understanding this right, I think this gets to the crux of the post. The compression is lossy, and necessarily loses some information.

I’m not sure. When we’re looking directly at the probability of an event (instead of the probability of the probability an event), things get much simpler than I thought.

Let’s see what happens to the likelihood when you aggregate from the expert’s point of view. Letting , we need to calculate the expert’s likelihoods and . In this case,

which is essentially your calculations, but from the expert’s point of view. The likelihood depends on , the prior of the expert, which is unknown to you. That shouldn’t come as a surprise, as he needs to use the prior of in order to combine the probability of the events and .

But the calculations are exactly the same from your point of view, leading to

Now, suppose we want to generally ensure that . Which is what I believe you want to do, and which seems pretty natural to do, at least since we’re allowed to assume that for all simple events . To ensure this, we will probably have to require that your priors are the same as the expert. In other words, your joint distributions are equal, or .

Do you agree with this summary?

• I find the beginning of this post somewhat strange, and I’m not sure your post proves what you claim it does. You start out discussing what appears to be a combination of two forecasts, but present it as Bayesian updating. Recall that Bayes theorem says . To use this theorem, you need both an (your data /​ evidence), and a (your parameter). Using “posterior prior likelihood” (with priors and likelihoods ), you’re talking as if your expert’s likelihood equals – but is that true in any sense? A likelihood isn’t just something you multiply with your prior, it is a conditional pmf or pdf with a different outcome than your prior.

I can see two interpretations of what you’re doing at the beginning of your post:

1. You’re combining two forecasts. That is, with being the outcome, you have your own pmf and the expert’s , then combine them using . That’s fair enough, but I suppose or maybe for some would be a better way to do it.

2. It might be possible to interpret your calculations as a proper application of Bayes’ rule, but that requires stretching it. Suppose is your subjective probability vector for the outcomes and is the subjective probability vector for the event supplied by an expert (the value of is unknown to us). To use Bayes’ rule, we will have to say that the evidence vector , the probability of observing an expert judgment of given that is true. I’m not sure we ever observe such quantities directly, and it is pretty clear from your post that you’re talking about in the sense used above, not .

Assuming interpretation 1, the rest of your calculations are not that interesting, as you’re using a method of knowledge pooling no one advocates.

Assuming interpretation 2, the rest of your calculations are probably incorrect. I don’t think there is a unique way to go from to, let’s say, , where is the expert’s probability vector over and your probability vector over .

• Children became grown-ups 200 years ago too. I don’t think we need to teach them anything at all, much less anything in particular.

According to this SSC post, kids can easily catch up in math even if they aren’t taught any math at all in the 5 first years of school.

In the Benezet experiment, a school district taught no math at all before 6th grade (around age 10-11). Then in sixth grade, they started teaching math, and by the end of the year, the students were just as good at math as traditionally-educated children with five years of preceding math education.

That would probably work for reading too, I guess. (Reading appears to require more purpose-built brain circuitry than math. At least I got that impression from reading Henrich’s WEIRD. I don’t have any references though.)

• I found this post interesting, especially the first part, but extremely difficult to understand (yeah, that hard). I believe some of the analogies might be valuable, but it’s simply too hard for me to confirm /​ disconfirm most of them. Here are some (but far from all!) examples:

1. About local optimizers. I didn’t understand this section at all! Are you claiming that gradient descent isn’t a local optimizer? Or are you claiming that neural networks can implement mesa-optimizers? Or something else?

2. The analogy to Bayesian reasoning feels forced and unrelated to your other points in the Bayes section. Moreover, Bayesian statistics typically doesn’t work (it’s inconsistent) when you ignore the normalizing constant. And in the case of neural networks, what is your prior? Unless you’re thinking about approximate priors using weight decay, most neural networks do not employ priors on their parameters.

3. In your linear model, you seem to interpret the maximum likelihood estimator of the parameters as a Bayesian estimator. Am I on the right track here?

4. Building on your linear toy model, it is natural to understand the weight decay parameters as priors, as that is what they are. (In an exact sense; with L2 weight decay you’re looking at ridge regression, which is a linear regression with normal priors on the parameters. L1 weights with Laplace priors, etc.) But you don’t do that. In what sense is “the bayesian prior could be encoded purely in the initial weight distribution.” What’s more, it seems to me you’re thinking about the learning rate as your prior. I think this has something do to with your interpretation of the linear model maximum likelihood estimator as a Bayesian procedure...?

• I disagree. Sometimes your entire payoffs also change when you change your action space (in the informal description of the problem). That is the point of the last example, where precommitment changes the possible payoffs, not only restricts the action space.

• 3 Mar 2022 9:29 UTC
5 points

Paradoxical decision problems are paradoxical in the colloquial sense (such as Hilbert’s hotel or Bertrand’s paradox), not the literal sense (such as “this sentence is false”). Paradoxicality is in the eye of the beholder. Some people think Newcomb’s problem is paradoxical, some don’t. I agree with you and don’t find it paradoxical.

• Ah! Edited version: “there’s no *obvious* distribution ” (which could have been “natural distribution” or “canonical distribution”). The point is that you need more information than what should be sufficient (the effect of the action) to do evidential decision theory.

• Evidential decision theory boggles my mind.

I have some sympathy for causal decision theory, especially when the causal description matches reality. But evidential decision theory is 100% bonkers.

The most common argument against evidential decision theory is that it does not care about the consequence of your action. It cares about correlation (broadly speaking), not causality, and acts as if both were same. This argument is sufficient to thoroughly discredit evidential decision theory, but philosophers keep giving it screen time.

Even if we lived in a world where correlation and causality were always the same (if that is possible), evidential decision theory would be wrong. Why? Because evidential decision theory requires distributions over actions and outcomes.

When you’re acting in a decision problem, your action will often, or even usually, be unique. No one has every done that kind of action before. Consequently, there’s no obvious distribution over the action a and outcome x. But evidential decision theory requires such a distribution to function! Now you’ll have to bootstrap your way to a distribution , flexing your philosophical creativity muscles. I suppose you could make this equal to , the actual outcome when doing action a, at least when is deterministic. But why? You’ll just introduce probabilities where none are needed.

# JonasMoss’s Shortform

2 Mar 2022 14:07 UTC
1 point

# Or­di­nary and un­or­di­nary de­ci­sion theory

2 Mar 2022 11:39 UTC
3 points

The p-values relevant for testosterone are on the lower side, with one them 0.049 (which screams p-hacking) and another at 0.02 (also really shitty). A reasonable back-of-the-envelope method to correct for p-hacking and publication bias involves multiplying the p-values with 20 (the reasoning is not super-involved. think about what happens to the truncated normal distribution in the case of complete publication bias); in that case, none of the testosterone-related p-values in said paper are significant. I feel comfortable ignoring it.

• It’s a game, just a trivial one. Snakes and Ladders is also a game, and its payoff matrix is similar to this one, just with a little bit of randomness involved.

My intuition says that this game not only has maximal alignment, but is the only game (up to equivalence) game with maximal alignment for any set of strategies . No matter what player 1 and player 2 does, the world is as good as it could be.

The case can be compared to the when the variance of the dependent variable is 0. How much of the variance in the dependent variable does the independent variable explain in this case? It’d say it’s all of it.

• This reminds me of the propensity of social scientists to drop inference when studying the entire population, claiming that confidence intervals do not make any sense when we have every single existing data point. But confidence intervals do make sense even then, as the entire observed population isn’t equal to the theoretical population. The observed population does not give us exact knowledge about any properties of the data generating mechanism, except in edge cases.

(Not that confidence intervals are very useful when looking at linear regressions with millions of data points anyway, but make sure to have your justification right.)

• I believe the upper right-hand corner of shouldn’t be 1; even if both players are acting in each other’s best interest, they are not acting in their own best interest. And alignment is about having both at the same time. The configuration of Prisoner’s dilemma makes it impossible to make both players maximally satisfied at the same time, so I believe it cannot have maximal alignment for any strategy.

Anyhow, your concept of alignment might involve altruism only, which is fair enough. In that case, Vanessa Kosoy has a similar proposal to mine, but not working with sums, which probably does exactly what you are looking for.

Getting alignment in the upper right-hand corner in the Prisoner’s dilemma matrix to be 1 may be possible if we redefine to , the best attainable payoff sum. But then zero-sum games will have maximal instead of minimal alignment! (This is one reason why I defined .)

(Btw, the coefficient isn’t symmetric; it’s only symmetric for symmetric games. No alignment coefficient depending on the strategies can be symmetric, as the vectors can have different lengths.)