# AlexMennen(Alex Mennen)

Karma: 4,414
• Something that I think it unsatisfying about this is that the rationals aren’t previleged as a countable dense subset of the reals; it just happens to be a convenient one. The completions of the diadic rationals, the rationals, and the algebraic real numbers are all the same. But if you require that an element of the completion, if equal to an element of the countable set being completed, must eventually certify this equality, then the completions of the diadic rationals, rationals, and algebraic reals are all constructively inequivalent.

• This means that, in particular, if your real happens to be rational, you can produce the fact that it is equal to some particular rational number. Neither Cauchy reals nor Dedekind reals have this property.

• perhaps these are equivalent.

They are. To get enumerations of rationals above and below out of an effective Cauchy sequence, once the Cauchy sequence outputs a rational such that everything afterwards can only differ by at most , you start enumerating rationals below as below the real and rationals above as above the real. If the Cauchy sequence converges to , and you have a rational , then once the Cauchy sequence gets to the point where everything after is gauranteed to differ by at most , you can enumerate as less than .

• My take-away from this:

An effective Cauchy sequence converging to a real induces recursive enumerators for and , because if , then for some , so you eventually learn this.

The constructive meaning of a set is that that membership should be decidable, not just semi-decidable.

If is irrational, then and are complements, and each semi-decidable, so they are decidable. If is rational, then the complement of is , which is semi-decidable, so again these sets are decidable. So, from the point of view of classical logic, it’s not only true that Cauchy sequences and Dedekind cuts are equivalent, but also effective Cauchy sequences and effective Dedekind cuts are equivalent.

However, it is not decidable whether a given Cauchy-sequence real is rational or not, and if so, which rational it is. So this doesn’t give a way to construct decision algorithms for the sets and from recursive enumerators of them.

• 23 Nov 2023 23:19 UTC
14 points
14

If board members have an obligation not to criticize their organization in an academic paper, then they should also have an obligation not to discuss anything related to their organization in an academic paper. The ability to be honest is important, and if a researcher can’t say anything critical about an organization, then non-critical things they say about it lose credibility.

• Yeah, I wasn’t trying to claim that the Kelly bet size optimizes a nonlogarithmic utility function exactly, just that, when the number of rounds of betting left is very large, the Kelly bet size sacrifices a very small amount of utility relative to optimal betting under some reasonable assumptions about the utility function. I don’t know of any precise mathematical statement that we seem to disagree on.

Well, we’ve established the utility-maximizing bet gives different expected utility from the Kelly bet, right? So it must give higher expected utility or it wouldn’t be utility-maximizing.

Right, sorry. I can’t read, apparently, because I thought you had said the utility-maximizing bet size would be higher than the Kelly bet size, even though you did not.

• Yeah, I was still being sloppy about what I meant by near-optimal, sorry. I mean the optimal bet size will converge to the Kelly bet size, not that the expected utility from Kelly betting and the expected utility from optimal betting converge to each other. You could argue that the latter is more important, since getting high expected utility in the end is the whole point. But on the other hand, when trying to decide on a bet size in practice, there’s a limit to the precision with which it is possible to measure your edge, so the difference between optimal bet and Kelly bet could be small compared to errors in your ability to determine the Kelly bet size, in which case thinking about how optimal betting differs from Kelly betting might not be useful compared to trying to better estimate the Kelly bet.

Even in the limit as the number of rounds goes to infinity, by the time you get to the last round of betting (or last few rounds), you’ve left the limit, since you have some amount of wealth and some small number of rounds of betting ahead of you, and it doesn’t matter how you got there, so the arguments for Kelly betting don’t apply. So I suspect that Kelly betting until near the end, when you start slightly adjusting away from Kelly betting based on some crude heuristics, and then doing an explicit expected value calculation for the last couple rounds, might be a good strategy to get close to optimal expected utility.

Incidentally, I think it’s also possible to take a limit where Kelly betting gets you optimal utility in the end by making the favorability of the bets go to zero simultaneously with the number of rounds going to infinity, so that improving your strategy on a single bet no longer makes a difference.

I think that for all finite , the expected utility at timestep from utility-maximizing bets is higher than that from Kelly bets. I think this is the case even if the difference converges to 0, which I’m not sure it does.

Why specifically higher? You must be making some assumptions on the utility function that you haven’t mentioned.

• I do want to note though that this is different from “actually optimal”

By “near-optimal”, I meant converges to optimal as the number of rounds of betting approaches infinity, provided initial conditions are adjusted in the limit such that whatever conditions I mentioned remain true in the limit. (e.g. if you want Kelly betting to get you a typical outcome of in the end, then when taking the limit as the number of bets goes to infinity, you better have starting money , where is the geometric growth rate you get from bets, rather than having a fixed starting money while taking the limit ). This is different from actually optimal because in practice, you get some finite amount of betting opportunities, but I do mean something more precise than just that Kelly betting tends to get decent outcomes.

• The reason I brought this up, which may have seemed nitpicky, is that I think this undercuts your argument for sub-Kelly betting. When people say that variance is bad, they mean that because of diminishing marginal returns, lower variance is better when the mean stays the same. Geometric mean is already the expectation of a function that gets diminishing marginal returns, and when it’s geometric mean that stays fixed, lower variance is better if your marginal returns diminish even more than that. Do they? Perhaps, but it’s not obvious. And if your marginal returns diminish but less than for log, then higher variance is better. I don’t think any of median, mode, or looking at which thing more often gets a higher value are the sorts of things that it makes sense to talk about trading off against lowering variance either. You really want mean for that.

• Correct. This utility function grows fast enough that it is possible for the expected utility after many bets to be dominated by negligible-probability favorable tail events, so you’d want to bet super-Kelly.

If you expect to end up with lots of money at the end, then you’re right; marginal utility of money becomes negigible, so expected utility is greatly effected by neglible-probability unfavorable tail events, and you’d want to bet sub-Kelly. But if you start out with very little money, so that at the end of whatever large number of rounds of betting, you only expect to end up with money in most cases if you bet Kelly, then I think the Kelly criterion should be close to optimal.

(The thing you actually wrote is the same as log utility, so I substituted what you may have meant). The Kelly criterion should optimize this, and more generally for any , if the number of bets is large. At least if is an integer, then, if is normally distributed with mean and standard deviation , then is some polynomial in and that’s homogeneous of degree . After a large number of bets, scales proportionally to and scales proportionally to , so the value of this polynomial approaches its term, and maximizing it becomes equivalent to maximizing , which the Kelly criterion does. I’m pretty sure you get something similar when is noninteger.

It depends how much money you could end up with compared to . If Kelly betting usually gets you more than at the end, then you’ll bet sub-Kelly to reduce tail risk. If it’s literally impossible to exceed even if you go all-in every time and always win, then this is linear, and you’ll bet super-Kelly. But if Kelly betting will usually get you less than but not by too many orders of magnitude at the end after a large number of rounds of betting, then I think it should be near-optimal.

If there’s many rounds of betting, and Kelly betting will get you as a typical outcome, then I think Kelly betting is near-optimal. But you might be right if .

• If you bet more than Kelly, you’ll experience lower average returns and higher variance.

No. As they discovered in the dialog, average returns is maximized by going all-in on every bet with positive EV. It is typical returns that will be lower if you don’t bet Kelly.

• For two, your specific claims about the likely confusion that Eliezer’s presentation could induce in “laymen” is empirically falsified to some degree by the comments on the original post: in at least one case, a reader noticed the issue and managed to correct for it when they made up their own toy example, and the first comment to explicitly mention the missing unitarity constraint was left over 10 years ago.

Some readers figuring out what’s going on is consistent with many of them being unnecessarily confused.

• I don’t think this one works. In order for the channel capacity to be finite, there must be some maximum number of bits N you can send. Even if you don’t observe the type of the channel, you can communicate a number n from 0 to N by sending n 1s and N-n 0s. But then even if you do observe the type of the channel (say, it strips the 0s), the receiver will still just see some number of 1s that is from 0 to N, so you have actually gained zero channel capacity. There’s no bonus for not making full use of the channel; in johnswentworth’s formulation of the problem, there’s no such thing as some messages being cheaper to transmit through the channel than others.

• We “just” need to update the three geometric averages on this background knowledge. Plausibly how this should be done in this case is to normalize them such that they add to one.

(In the case of the arithmetic mean, updating on the background information plausibly wouldn’t change anything here, but that’s not the case for other possible background information.)

Any linear constraints (which are the things you get from knowing that certain Boolean combinations of questions are contradictions or tautologies) that are satisfied by each predictor will also be satisfied by their arithmetic mean.

But it is anyway a more general question (than the question of whether the geometric mean of the odds is better or the arithmetic mean of the probabilities): how should we “average” two or more probability distributions (rather than just two probabilities), assuming they come from equally reliable sources?

That’s part of my point. Arithmetic mean of probabilities gives you a way of averaging probability distributions, as well as individual probabilities. Geometric mean of log odds does not.

If we assume that the prior was indeed important here then this makes sense, but if we assume that the prior was irrelevant (that they would have arrived at 25% even if their prior was e.g. 10% rather than 50%), then this doesn’t make sense. (Maybe they first assumed the probability of drawing a black ball from an urn was 50%, then they each independently created a large sample, and ~25% of the balls came out black. In this case the prior was mostly irrelevant.) We would need a more general description under which circumstances the prior is indeed important in your sense and justifies the multiplicative evidence aggregation you proposed.

In this example, the sources of evidence they’re using are not independent; they can expect ahead of time that each of them will observe the same relative frequency of black balls from the urn, even while not knowing in advance what that relative frequency will be. The circumstances under which the multiplicative evidence aggregation method is appropriate are exactly the circumstances in which the evidence actually is independent.

But in the second case I don’t see how a noisy process for a probability estimate would lead to being “forced to set odds that you’d have to take bets on either side of, even someone who knows nothing about the subject could exploit you on average”.

They make their bet direction and size functions of the odds you offer them in such a way that they bet more when you offer better odds. If you give the correct odds, then the bet ends up resolving neutrally on average, but if you give incorrect odds, then which direction you are off in correlates with how big a bet they make in such a way that you lose on average either way.

• I think the way I would rule out my counterexample is by strengthening A3 to if and then there is

• Q2: No. Counterexample: Suppose there’s one outcome such that all lotteries are equally good, except for the lottery than puts probability 1 on , which is worse than the others.

• I’m not sure why you don’t like calling this “redundancy”. A meaning of redundant is “able to be omitted without loss of meaning or function” (Lexico). So ablation redundancy is the normal kind of redundancy, where you can remove sth without losing the meaning. Here it’s not redundant, you can remove a single direction and lose all the (linear) “meaning”.

Suppose your datapoints are (where the coordinates and are independent from the standard normal distribution), and the feature you’re trying to measure is . A rank-1 linear probe will retain some information about the feature. Say your linear probe finds the coordinate. This gives you information about ; your expected value for this feature is now , an improvement over its a priori expected value of . If you ablate along this direction, all you’re left with is the coordinate, which tells you exactly as much about the feature as the coordinate does, so this rank-1 ablation causes no loss in performance. But information is still lost when you lose the coordinate, namely the contribution of from the feature. The thing that you can still find after ablating away the direction is not redundant with the the rank-1 linear probe in the direction you started with, but just contributes the same amount towards the feature you’re measuring.

The point is, the reason why CCS fails to remove linearly available information is not because the data “is too hard”. Rather, it’s because the feature is non-linear in a regular way, which makes CCS and Logistic Regression suck at finding the direction which contains all linearly available data (which exists in the context of “truth”, just as it is in the context of gender and all the datasets on which RLACE has been tried).

Disagree. The reason CCS doesn’t remove information is neither of those, but instead just that that’s not what it’s trained to do. It doesn’t fail, but rather never makes any attempt. If you’re trying to train a function such that and , then will achieve optimal loss just like will.

• What you’re calling ablation redundancy is a measure of nonlinearity of the feature being measured, not any form of redundancy, and the view you quote doesn’t make sense as stated, as nonlinearity, rather than redundancy, would be necessary for its conclusion. If you’re trying to recover some feature , and there’s any vector and scalar such that for all data (regardless of whether there are multiple such , which would happen if the data is contained in a proper affine subspace), then there is a direction such that projection along it makes it impossible for a linear probe to get any information about the value of . That direction is , where is the covariance matrix of the data. This works because if , then the random variables and are uncorrelated (since ), and thus is uncorrelated with .

If the data is normally distributed, then we can make this stronger. If there’s a vector and a function such that (for example, if you’re using a linear probe to get a binary classifier, where it classifies things based on whether the value of a linear function is above some threshhold), then projecting along removes all information about . This is because uncorrelated linear features of a multivariate normal distribution are independent, so if , then is independent of , and thus also of . So the reason what you’re calling high ablation redundancy is rare is that low ablation redundancy is a consequence of the existence of any linear probe that gets good performance and the data not being too wildly non-Gaussian.