AlexMennen(Alex Mennen)

Karma: 3,907
• This sort of thing seems to suggest that EY’s claims in this post about the scale of the relative intelligence differences between chimps, a village idiot, and Einstein is incorrect. The difference in intelligence between village idiot and Einstein may be comparable to the difference in intelligence between some nonhuman animals and a human village idiot. Which is a priori surprising, given that human brains are very structurally similar to each other in comparison to nonhuman animal brains.

• I think the assumption that multiple actions have nonzero probability in the context of a deterministic decision theory is a pretty big problem. If you come up with a model for where these nonzero probabilities are coming from, I don’t think your argument is going to work.

For instance, your argument fails if these nonzero probabilities come from epsilon exploration. If the agent is forced to take every action with probability epsilon, and merely chooses which action to assign the remaining probability to, then the agent will indeed purchase the contract for some sufficiently small price if , even if is not the optimal action (let’s say is the optimal action). When the time comes to take an action, the agent’s best bet is (prime meaning sell the contract for price ). The way I described the set-up, the agent doesn’t choose between and , because actions other than the top choice all happen with probability epsilon. The fact that the agent sells the contract back in its top choice isn’t a Dutch book, because the case where the agent’s top choice goes through is the case in which the contract is worthless, and the contract’s value is derived from other cases.

We could modify the epsilon exploration assumption so that the agent also chooses between and even while its top choice is . That is, there’s a lower bound on the probability with which the agent takes an action in , but even if that bound is achieved, the agent still has some flexibility in distributing probability between and . In this case, contrary to your argument, the agent will prefer rather than , i.e., it will not get Dutch booked. This is because the agent is still choosing as the only action with high probability, and refers to the expected consequence of the agent choosing as its intended action, so the agent cannot use when calculating which of or is better to pick as its next choice if its attempt to implement intended action fails.

Another source of uncertainty that the agent could have about its actions is if it believes it could gain information in the future, but before it has to make a decision, and this information could be relevant to which decision it makes. Say that and are the agent’s expectations at time of the utility that taking action would cause it to get, and the utility it would get conditional on taking action , respectively. Suppose the bookie offers the deal at time , and the agent must act at time . If the possibility of gaining future knowledge is the only source of the agent’s uncertainty about its own decisions, then at time , it knows what action it is taking, and is undefined on actions not taken. and should both be well-defined, but they could be different. The problem description should disambiguate between them. Suppose that every time you say and in the description of the contract, this means and , respectively. The agent purchases the contract, and then, when it comes time to act, it evaluates consequences by , not , so the argument for why the agent will inevitably resell the contract fails. If the appearing in the description of the contract instead means (since the agent doesn’t know what that is yet, this means the contract references what the agent will believe in the future, rather than stating numerical payoffs), then the agent won’t purchase it in the first place because it will know that the contract will only have value if seems to be suboptimal at time and it takes action anyway, which it knows won’t happen, and hence the contract is worthless.

• The Nirvana trick seems like a cheap hack, and I’m curious if there’s a way to see it as good reasoning.

One response to this was that predicting Nirvana in some circumstance is equivalent to predicting that there are no possible futures in that circumstance, which is a sensible thing to say as a prediction that that circumstance is impossible.

• That’s exactly what I was trying to say, not a disagreement with it. The only step where I claimed all reasonable ways of measuring spreadout-ness agree was on the result you get after summing up a large number of iid random variables, not the random variables that were being summed up.

• The “or any other measure of spreadout-ness” can be dropped here

What I meant is that, if you restrict attention to normal distributions with a fixed mean, then any reasonable measure of how spread out it is (including any of the E[|x-mean|^p]) will be a sufficient statistic, because any such measure, in order to be reasonable, must increase as variance increases (for normal distributions), so this function can be inverted to recover the variance. In other words, any other such measure will indeed be isomorphic to variance when restricted to normal distributions.

The value of m minimizing E[|X-m|] should change if I decrease the minimum X-value a lot, while leaving everything else constant

This does not change the minimizer of E[|X-m|] because it increases E[|X-m|] by the same amount for every m>min(X).

In general, you can’t decrease E[|X-m|] by moving m from median to median-d for d>0 because, for xmedian (half the distribution), you increase |X-m| by d, and for the other half, you decrease |X-m| by at most d.

• Variance has more motivation than just that it’s a measure of how spread out the distribution is. Variance has the property that if two random variables are independent, then the variance of their sum is the sum of their variances. By the central limit theorem, if you add up a sufficiently large number of independent and identically distributed random variables, the distribution you get is well-approximated by a distribution that depends only on mean and variance (or any other measure of spreadout-ness). Since it is the variance of the distributions you were adding together that determines this, variance is exactly the thing you care about if you want to know the degree of spreadout-ness of a sum of a large number of independent variables from the distribution. If you take any measure of how spread out a distribution is that doesn’t carry the same information as the variance, then it will fail to predict how spread out the sum of a large number of independent copies of the distribution is, by any measure.

Edit: On the subject of other possible measures of features of probability distributions, one could also make the same complaint about mean as a measure of the middle of a distribution, when there are possible alternatives like median. Again, a similar sort of argument can be used to identify mean as the best one in some circumstances. But if I were to define a measure of how spread out a distribution is as E[|X-m|] for some m, I would use m=median rather than m=mean. This is because m=median minimizes this expected absolute value (in fact, median can be defined this way), so this measures the minimal average distance every point in the distribution has to travel in order for them to all meet at one point (the median is the most efficient point for them to meet).

• I was just thinking back to this, and it occurred to me that one possible reason to be unsatisfied with the arguments I presented here is that I started off with this notion of a crossing-over point as p continuously increases. But then when you asked “ok, but why is the crossing-over point 2?”, I was like “uh, consider that it might be an integer, and then do a bunch of very discrete-looking arguments that end up showing there’s something special about 2″, which doesn’t connect very well with the “crossover point when p continuously varies” picture. If indeed this seemed unsatisfying to you, then perhaps you’ll like this more:

If we have a norm on a vector space, then it induces a norm on its dual space, given by . If a linear map preserves a norm, then its adjoint preserves the induced norm on the dual space.

Claim: The Lp norm on column vectors induces, as its dual, the Lq norm on row vectors, where p and q satisfy .

Thus if a matrix preserves Lp norm, then its adjoint preserves Lq norm. When p=2, we get that its adjoint preserves the same norm. This sort of gives you a natural way of seeing 2 as halfway between 1 and infinity, and giving, for every p, a corresponding q that is equally far away from the middle in the other direction, in the appropriate sense.

Proof of claim: Given p and q such that , and a row vector with Lq norm 1, let , so that . Then let (with the same sign as ). The column vector has Lp norm 1. . This shows that the dual-Lp norm of is at least 1. Standard constrained optimization techniques will verify that this maximizes subject to the constraint that has Lp norm 1, and thus that the dual-Lp norm of is exactly 1.

Corollary: If a matrix preserves Lp norm for any p2, then it is a permutation matrix (up to flipping the signs of some of its entries).

Proof: Let q be such that . The columns of the matrix each have Lp norm 1, so the whole matrix has Lp norm (since the entries from each of the n columns contribute 1 to the sum). By the same reasoning about its adjoint, the matrix has Lq norm . Assume wlog p<q. Lq norm is Lp norm for q>p, with equality only on scalar multiples of basis vectors. So if any column of the matrix isn’t a basis vector (up to sign), then its Lq norm is less than 1; meanwhile, all the columns have Lq norm at most 1, so this would mean that the Lq norm of the whole matrix is strictly less than , contradicting the argument about its adjoint.

• Also, I’m curious what you think the connection is between the “L2 is connected to bilinear forms” and “L2 is the only Lp metric invariant under nontrivial change of basis”, if it’s easy to state.

This was what I was trying to vaguely gesture towards with the derivation of the “transpose = inverse” characterization of L2-preserving matrices; the idea was that the argument was a natural sort of thing to try, so if it works to get us a characterization of the Lp-preserving matrices for exactly one value of p, then that’s probably the one that has a different space of Lp-preserving matrices than the rest. But perhaps this is too sketchy and mysterian. Let’s try a dimension-counting argument.

Linear transformations and bilinear forms can both be represented with matrices. Linear transformations act on the space of bilinear forms by applying the linear transformation to both inputs before plugging them into the bilinear form. If the matrix represents a linear transformation and the matrix represents a bilinear form, then the matrix representing the bilinear form you get from this action is . But whatever, the point is, so far we have an -dimensional group acting on an -dimensional space. But quadratic forms (like the square of the L2 norm) can be represented by symmetric matrices, the space of which is -dimensional, and if is symmetric, then so is . So now we have an -dimensional group acting on a -dimensional space, so the stabilizer of any given element must be at least dimensional. As it turns out, this is exactly the dimensionality of the space of orthogonal matrices, but the important thing is that this is nonzero, which explains why the space of orthogonal matrices must not be discrete.

Now let’s see what happens if we try to adapt this argument to Lp and p-linear forms for some p2.

With p=1, a linear transformation preserving a linear functional corresponds to a matrix preserving a row vector in the sense that . You can do a dimension-counting argument and find that there are tons of these matrices for any given row vector, but it doesn’t do you any good because 1 isn’t even so preserving the linear functional doesn’t mean you preserve L1 norm.

Let’s try p=4, then. A 4-linear form can be represented by an hypermatrix, the space of which is -dimensional. Again, we can restrict attention to the symmetric ones, which are preserved by the action of linear maps. But the space of symmetric hypermatrices is -dimensional, still much more than . This means that our linear maps can use up all of their degrees of freedom moving a symmetric 4-linear form around to different 4-linear forms without even getting close to filling up the whole space, and never gets forced to use its surplus degrees of freedom with linear maps that stabilize a 4-linear form, so it doesn’t give us linear maps stabilizing L4 norm.

• A related thing that’s special about the L2 norm is that there’s a bilinear form such that |v| carries the same information as .

“Ok, so what? Can’t do you the same thing with any integer n, with an n-linear form?” you might reasonably ask. First of all, not quite, it only works for the even integers, because otherwise you need to use absolute value*, which isn’t linear.

But the bilinear forms really are the special ones, roughly speaking because they are a similar type of object to linear transformations. By currying, a bilinear form on V is a linear map , where is the space of linear maps . Now the condition of a linear transformation preserving a bilinear form can just be written in terms of chaining linear maps together. A linear map has an adjoint given by for , and a linear map preserves a bilinear form iff . When using coordinates in an orthonormal basis, the bilinear form is represented by the identity matrix, so if is represented by the matrix , this becomes , which is where the usual definition of an orthogonal matrix comes from. For quadrilinear forms etc, you can’t really do anything like this. So it’s L2 for which you get a way of characterizing “norm-preserving” in a nice clean linear-algebraic-in-character way, so it makes sense that that would be the one to have a different space of norm-preserving maps than the others.

I also subtly brushed past something that makes L2 a particularly special norm, although I guess it’s not clear if it helps. A nondegenerate bilinear form is the same thing as an isomorphism between and . If is always positive, then taking its square root gives you a norm, and that norm is L2 (though it may be disguised if you weren’t using an orthonormal basis); and if it isn’t always positive, then you don’t get a norm out of it at all. So L2 is unique among all possible norms in that it induces and comes from an identification between your vector space and its dual.

*This assumes your vector space is over for simplicity. If it’s over , then you can’t get multilinearity no matter what you do, and the way this argument has to go is that you can get close enough by taking the complex conjugate of exactly half of the inputs, and then you get multilinearity from there. Speaking of , this reminds me that I was inappropriately assuming your vector space was over in my previous comment. Over , you can multiply basis vectors by any scalar of absolute value 1, not just +1 and −1. This is broader that the norm-preserving changes of basis you can do over to exactly the extent explicable by the fact that you’re sneaking in a little bit of L2 via the definition of the absolute value of a complex number.

• is the L2 norm preferred b/​c it’s the only norm that’s invariant under orthonormal change of basis, or is the whole idea of orthonormality somehow baking in the fact that we’re going to square and sqrt everything in sight (and if so how)

The L2 norm is the only Lp norm that can be preserved by any non-trivial change of basis (the trivial ones: permuting basis elements and multiplying some of them by −1). This follows from the fact that, for p2, the basis elements are their negatives can be identified just from the Lp norm and the addition and scalar multiplication operations of the vector space. To intuitively gesture at why this is so, let’s look at L1 and L.

In L1, the norm of the sum of two vectors is the sum of their norms iff for each coordinate, both vectors have components of the same sign; otherwise, they cancel in some coordinate, and the norm of the sum is smaller than the sum of the norms. 0 counts as the same sign as everything, so the more zeros a vector has in its coordinates, the more other vectors it will have the maximum possible norm of sum with. The basis vectors and their negations are thus distinguished as those unit vectors u for which the set {v : |u+v| = |u|+|v|} is maximal. Since the alternative to |u+v| = |u|+|v| is |u+v| < |u|+|v|, the basis vectors can be thought of as having maximal tendency for their sums with other vectors to have large norm.

In L, on the other hand, as long as you’re keeping the largest coordinate fixed, changing the other coordinates costs nothing in terms of the norm of the vector, but making those other coordinates larger still creates more opportunities to change the norm of other vectors when you add them together. So if you’re looking for a unit vector u that minimizes {v : |u+v| |v|}, u is a basis vector or the negation of one. The basis vectors have minimal tendency for their sums with other vectors to have large norm.

As p increases, the tendency for basis vectors to have large sums with other vectors decreases (as compared to the tendency for arbitrary vectors to have large sums with other vectors). There must be a cross-over point where whether or not a vector is a basis vector ceases to be predictive of the norm of its sum with an arbitrary other vector, and we lose the ability to figure out which vectors are basis vectors only at that point, which is p=2.

So if you’re trying to guess what sort of norm some vector space naturally carries (let’s say you’re given, as a hint, that it’s an Lp norm for some p), L2 should start out as a pretty salient option, along with, and arguably ahead of, L1 and L. As soon as you hear anything about there being multiple different bases that seem to have equal footing (as is saliently the case in QM), that settles it: L2 is the only option.

• I disagree that using the latter to generate a sensory stream from a quantum state yields reasonable predictions—eg, taken literally I think you’re still zeroing out all but a measure-zero subset of the position basis

The observation you got from your sample is information. Information is entropy, and entropy is locally finite. So I don’t think it’s possible for the states consistent with the observation you got from your sample to have measure zero.

• I don’t see the connection to the Jeffrey-Bolker rotation? There, to get the shouldness coordinate, you need to start with the epistemic probability measure, and multiply it by utility; here, utility is interpreted as a probability distribution without reference to a probability distribution used for beliefs.

• All that is indeed possible, but not guaranteed. The reason I was speculating that better brain imaging wouldn’t be especially useful for machine learning in the absence of better neuron models is that I’d assume that the optimization pressure that went into the architecture of brains was fairly heavily tailored to the specific behavior of the neurons that those brains are made of, and wouldn’t be especially useful relative to other neural network design techniques that humans come up with when used with artificial neurons that behave quite differently. But sure, I shouldn’t be too confident of this. In particular, the idea of training ML systems to imitate brain activation patterns, rather than copying brain architecture directly, is a possible way around this that I hadn’t considered.

• No. Scanning everything and then waiting until we have a good enough neuron model might work fine; it’s just that the scan wouldn’t give you a brain emulation until your neuron model is good enough.

Map­ping Out Alignment

15 Aug 2020 1:02 UTC
42 points
• For individual ML models, sure, but not for classes of similar models. E.g. GPT-3 presumably was more expensive to train than GPT-2 as part of the cost to getting better results. For each of the proposals in the OP, training costs constrain how complex a model you can train, which in turn would affect performance.

• I’m confused about the motivation for in terms of time dilation in general relativity. I was under the impression that general relativity doesn’t even have a notion of gravitational potential, so I’m not sure what this would mean. And in Newtonian physics, potential energy is only defined up to an added constant. For to represent any sort of ratio (including proper time/​coordinate time), V would have to be well-defined, not just up to an arbitrary added constant.

I also had trouble figuring out the relationship between the Euler-Lagrange equation and extremizing S. The Euler-Lagrange equation looks to me like just a kind of funny way of stating Newton’s second law of motion, and I don’t see why it should be equivalent to extremizing action. Perhaps this would be obvious if I knew some calculus of variations?

• I’m concerned about Goodhart’s law on the acceptability predicate causing severe problems when the acceptability predicate is used in training. Suppose we take some training procedure that would otherwise result in an unaligned AI, and modify the training procedure by also including the acceptability predicate in the loss function during training. This results the end product that has been trained to appear to satisfy the intended version of the acceptability predicate. One way that could happen is if it actually does satisfy what was intended by the acceptability predicate, which is great. But otherwise, we have made the bad behavior of the final product more difficult to detect, essentially by training the AI to be deceptively aligned.

• Is there a difference between training competitiveness and performance competitiveness? My impression is that, for all of these proposals, however much resources you’ve already put into training, putting more resources into training will continue to improve performance. If this is the case, then whether a factor influencing competitiveness is framed as affecting the cost of training or as affecting the performance of the final product, either way it’s just affecting the efficiency with which putting resources towards training leads to good performance. Separating competitiveness into training and performance competitiveness would make sense if there’s a fixed amount of training that must be done to achieve any reasonable performance at all, but past that, more training is not effective at producing better performance. My impression is that this isn’t usually what happens.

• Let α be the least countable ordinal such that there is no polynomial-time computable recursive well-ordering of length α.