kaarelh AT gmail DOT com

# Kaarel

Oh ok yea that’s a nice setup and I think I know how to prove that claim — the convex optimization argument I mentioned should give that. I still endorse the branch of my previous comment that comes after considering roughly that option though:

That said, if we conceive of the decision rule as picking out a single action to perform, then because the decision rule at least takes Pareto improvements, I think a convex optimization argument says that the single action it picks is indeed the maximal EV one according to some distribution

~~(though not necessarily one in your set)~~. However, if we conceive of the decision rule as giving preferences between actions or if we try to use it in some sequential setup, then I’m >95% sure there is no way to see it as EV max (except in some silly way, like forgetting you had preferences in the first place).

Sorry, I feel like the point I wanted to make with my original bullet point is somewhat vaguer/different than what you’re responding to. Let me try to clarify what I wanted to do with that argument with a caricatured version of the present argument-branch from my point of view:

your original question (caricatured): “The Sun prayer decision rule is as follows: you pray to the Sun; this makes a certain set of actions seem auspicious to you. Why not endorse the Sun prayer decision rule?”

my bullet point: “Bayesian expected utility maximization has this big red arrow pointing toward it, but the Sun prayer decision rule has no big red arrow pointing toward it.”

your response: “Maybe a few specific Sun prayer decision rules are also pointed to by that red arrow?”

my response: “The arrow does not point toward most Sun prayer decision rules. In fact, it only points toward the ones that are secretly bayesian expected utility maximization. Anyway, I feel like this does very little to address my original point that there is this big red arrow pointing toward bayesian expected utility maximization and no big red arrow pointing toward Sun prayer decision rules.”

(See the appendix to my previous comment for more on this.)

That said, I admit I haven’t said super clearly how the arrow ends up pointing to structuring your psychology in a particular way (as opposed to just pointing at a class of ways to behave). I think I won’t do a better job at this atm than what I said in the second paragraph of my previous comment.

The minimax regret rule (sec 5.4.2 of Bradley (2012)) is equivalent to EV max w.r.t. the distribution in your representor that induces maximum regret.

I’m (inside view) 99.9% sure this will be false/nonsense in a sequential setting. I’m (inside view) 99% sure this is false/nonsense even in the one-shot case. I guess the issue is that different actions get assigned their max regret by different distributions, so I’m not sure what you mean when you talk about

**the**distribution that induces maximum regret. And indeed, it is easy to come up with a case where the action that gets chosen is not best according to any distribution in your set of distributions: let there be one action which is uniformly fine and also for each distribution in the set, let there be an action which is great according to that distribution and disastrous according to every other distribution; the uniformly fine action gets selected, but this isn’t EV max for any distribution in your representor. That said, if we conceive of the decision rule as picking out a single action to perform, then because the decision rule at least takes Pareto improvements, I think a convex optimization argument says that the single action it picks is indeed the maximal EV one according to some distribution (though not necessarily one in your set). However, if we conceive of the decision rule as giving preferences between actions or if we try to use it in some sequential setup, then I’m >95% sure there is no way to see it as EV max (except in some silly way, like forgetting you had preferences in the first place).The maximin rule (sec 5.4.1) is equivalent to EV max w.r.t. the most pessimistic distribution.

I didn’t think about this as carefully, but >90% that the paragraph above also applies with minor changes.

You might say “Then why not just do precise EV max w.r.t. those distributions?” But the whole problem you face as a decision-maker is, how do you decide which distribution? Different distributions recommend different policies. If you endorse precise beliefs, it seems you’ll commit to one distribution that you think best represents your epistemic state. Whereas someone with imprecise beliefs will say: “My epistemic state is not represented by just one distribution. I’ll evaluate the imprecise decision rules based on which decision-theoretic desiderata they satisfy, then apply the most appealing decision rule (or some way of aggregating them) w.r.t. my imprecise beliefs.” If the decision procedure you follow is psychologically equivalent to my previous sentence, then I have no objection to your procedure — I just think it would be misleading to say you endorse precise beliefs in that case.

I think I agree in some very weak sense. For example, when I’m trying to diagnose a health issue, I do want to think about which priors and likelihoods to use — it’s not like these things are immediately given to me or something. In this sense, I’m at some point contemplating many possible distributions to use. But I guess we do have some meaningful disagreement left — I guess I take the most appealing decision rule to be more like pure aggregation than you do; I take imprecise probabilities with maximality to be a major step toward madness from doing something that stays closer to expected utility maximization.

But the CCT only says that if you satisfy [blah], your policy is consistent with precise EV maximization. This doesn’t imply your policy is inconsistent with Maximality, nor (as far as I know) does it tell you what distribution with respect to which you should maximize precise EV in order to satisfy [blah] (or even that such a distribution is unique). So I don’t see a positive case here for precise EV maximization [ETA: as a procedure to guide your decisions, that is]. (This is my also response to your remark below about “equivalent to “act consistently with being an expected utility maximizer”.”)

I agree that any precise EV maximization (which imo = any good policy) is consistent with some corresponding maximality rule — in particular, with the maximality rule with the very same single precise probability distribution and the same utility function (at least modulo some reasonable assumptions about what ‘permissibility’ means). Any good policy is also consistent with any maximality rule that includes its probability distribution as one distribution in the set (because this guarantees that the best-according-to-the-precise-EV-maximization action is always permitted), as well as with any maximality rule that makes anything permissible. But I don’t see how any of this connects much to whether there is a positive case for precise EV maximization? If you buy the CCT’s assumptions, then you literally do have an argument that anything other than precise EV maximization is bad, right, which does sound like a positive case for precise EV maximization (though not directly in the psychological sense)?

ETA: as a procedure to guide your decisions, that is

Ok, maybe you’re saying that the CCT doesn’t obviously provide an argument for it being good to restructure your thinking into literally maintaining some huge probability distribution on ‘outcomes’ and explicitly maintaining some function from outcomes to the reals and explicitly picking actions such that the utility conditional on these actions having been taken by you is high (or whatever)? I agree that trying to do this very literally is a bad idea, eg because you can’t fit all possible worlds (or even just one world) in your head, eg because you don’t know likelihoods given hypotheses as you’re not logically omniscient, eg because there are difficulties with finding yourself in the world, etc — when taken super literally, the whole shebang isn’t compatible with the kinds of good reasoning we actually can do and do do and want to do. I should say that I didn’t really track the distinction between the psychological and behavioral question carefully in my original response, and had I recognized you to be asking only about the psychological aspect, I’d perhaps have focused on that more carefully in my original answer. Still, I do think the CCT has something to say about the psychological aspect as well — it provides some pro tanto reason to reorganize aspects of one’s reasoning to go some way toward assigning coherent numbers to propositions and thinking of decisions as having some kinds of outcomes and having a schema for assigning a number to each outcome and picking actions that lead to high expectations of this number. This connection is messy, but let me try to say something about what it might look like (I’m not that happy with the paragraph I’m about to give and I feel like one could write a paper at this point instead). The CCT says that if you ‘were wise’ — something like ‘if you were to be ultimately content with what you did when you look back at your life’ — your actions would need to be a particular way (from the outside). Now, you’re pretty interested in being content with your actions (maybe just instrumentally, because maybe you think that has to do with doing more good or being better). In some sense, you know you can’t be fully content with them (because of the reasons above). But it makes sense to try to move toward being more content with your actions. One very reasonable way to achieve this is to incorporate some structure into your thinking that makes your behavior come closer to having these desired properties. This can just look like the usual: doing a bayesian calculation to diagnose a health problem, doing an EV calculation to decide which research project to work on, etc..

(There’s a chance you take there to be another sense in which we can ask about the reasonableness of expected utility maximization that’s distinct from the question that broadly has to do with characterizing behavior and also distinct from the question that has to do with which psychology one ought to choose for oneself — maybe something like what’s fundamentally principled or what one ought to do here in some other sense — and you’re interested in that thing. If so, I hope what I’ve said can be translated into claims about how the CCT would relate to that third thing.)

Anyway, If the above did not provide a decent response to what you said, then it might be worthwhile to also look at the appendix (which I ended up deprecating after understanding that you might only be interested in the psychological aspect of decision-making). In that appendix, I provide some more discussion of the CCT saying that [maximality rules which aren’t behaviorally equivalent to expected utility maximization are dominated]. I also provide some discussion recentering the broader point I wanted to make with that bullet point that CCT-type stuff is a big red arrow pointing toward expected utility maximization, whereas no remotely-as-big red arrow is known for [imprecise probabilities + maximality].

e.g. if one takes the cost of thinking into account in the calculation, or thinks of oneself as choosing a policy

Could you expand on this with an example? I don’t follow.

For example, preferential gaps are sometimes justified by appeals to cases like: “you’re moving to another country. you can take with you your Fabergé egg xor your wedding album. you feel like each is very cool, and in a different way, and you feel like you are struggling to compare the two. given this, it feels fine for you to flip a coin to decide which one (or to pick the one on the left, or to ‘just pick one’) instead of continuing to think about it. now you remember you have 10 dollars inside the egg. it still seems fine to flip a coin to decide which one to take (or to pick the one on the left, or to ‘just pick one’).”. And then one might say one needs preferential gaps to capture this. But someone sorta trying to maximize expected utility might think about this as: “i’ll pick a randomization policy for cases where i’m finding two things hard to compare. i think this has good EV if one takes deliberation costs into account, with randomization maybe being especially nice given that my utility is concave in the quantities of various things.”.

Maximality and imprecision don’t make any reference to “default actions,”

I mostly mentioned defaultness because it appears in some attempts to precisely specify alternatives to bayesian expected utility maximization. One concrete relation is that one reasonable attempt at specifying what it is that you’ll do when multiple actions are permissible is that you choose the one that’s most ‘default’ (more precisely, if you have a prior on actions, you could choose the one with the highest prior). But if a notion of defaultness isn’t relevant for getting from your (afaict) informal decision rule to a policy, then nvm this!

I also don’t understand what’s unnatural/unprincipled/confused about permissibility or preferential gaps. They seem quite principled to me: I have a strict preference for taking action A over B (/ B is impermissible) only if I’m justified in beliefs according to which I expect A to do better than B.

I’m not sure I understand. Am I right in understanding that permissibility is defined via a notion of strict preferences, and the rest is intended as an informal restatement of the decision rule? In that case, I still feel like I don’t know what having a strict preference or permissibility means — is there some way to translate these things to actions? If the rest is intended as an independent definition of having a strict preference, then I still don’t know how anything relates to action either. (I also have some other issues in that case: I anticipate disliking the distinction between justified and unjustified beliefs being made (in particular, I anticipate thinking that a good belief-haver should just be thinking and acting according to their beliefs); it’s unclear to me what you mean by being justified in some beliefs (eg is this a non-probabilistic notion); are individual beliefs giving you expectations here or are all your beliefs jointly giving you expectations or is some subset of beliefs together giving you expectations; should I think of this expectation that A does better than B as coming from another internal conditional expected utility calculation). I guess maybe I’d like to understand how an action gets chosen from the permissible ones. If we do not in fact feel that all the actions are equal here (if we’d pay something to switch from one to another, say), then it starts to seem unnatural to make a distinction between two kinds of preference in the first place. (This is in contrast to: I feel like I can relate ‘preferences’ kinda concretely to actions in the usual vNM case, at least if I’m allowed to talk about money to resolve the ambiguity between choosing one of two things I’m indifferent between vs having a strict preference.)

Anyway, I think there’s a chance I’d be fine with sometimes thinking that various options are sort of fine in a situation, and I’m maybe even fine with this notion of fineness eg having certain properties under sweetenings of options, but I quite strongly dislike trying to make this notion of fineness correspond to this thing with a universal quantifier over your probability distributions, because it seems to me that (1) it is unhelpful because it (at least if implemented naively) doesn’t solve any of the computational issues (boundedness issues) that are a large part of why I’d entertain such a notion of fineness in the first place, (2) it is completely unprincipled (there’s no reason for this in particular, and the split of uncertainties is unsatisfying), and (3) it plausibly gives disastrous behavior if taken seriously. But idk maybe I can’t really even get behind that notion of fineness, and I’m just confusing it with the somewhat distinct notion of fineness that I use when I buy two different meals to distribute among myself and a friend and tell them that I’m fine with them having either one, which I think is well-reduced to probably having a smaller preference than my friend. Anyway, obviously whether such a notion of fineness is desirable depends on how you want it to relate to other things (in particular, actions), and I’m presently sufficiently unsure about how you want it to relate to these other things to be unsure about whether a suitable such notion exists.

basically everything becomes permissible, which seems highly undesirable

This is a much longer conversation, but briefly: I think it’s ad hoc / putting the cart before the horse to shape our epistemology to fit our intuitions about what decision guidance we should have.

It seems to me like you were like: “why not regiment one’s thinking xyz-ly?” (in your original question), to which I was like “if one regiments one thinking xyz-ly, then it’s an utter disaster” (in that bullet point), and now you’re like “even if it’s an utter disaster, I don’t care”. And I guess my response is that you should care about it being an utter disaster, but I guess I’m confused enough about why you wouldn’t care that it doesn’t make a lot of sense for me to try to write a library of responses.

# Appendix with some things about CCT and expected utility maximization and [imprecise probabilities] + maximality that got cut

Precise EV maximization is a special case of [imprecise probabilities] + maximality (namely, the special case where your imprecise probabilities are in fact precise, at least modulo some reasonable assumptions about what things mean), so unless your class of decision rules turns out to be precisely equivalent to the class of decision rules which do precise EV maximization, the CCT does in fact say it contains some bad rules. (And if it did turn out to be equivalent, then I’d be somewhat confused about why we’re talking about it your way, because it’d seem to me like it’d then just be a less nice way to describe the same thing.) And at least on the surface, the class of decision rules does not appear to be equivalent, so the CCT indeed does speak against some rules in this class (and in fact, all rules in this class which cannot be described as precise EV maximization).

If you filled in the details of your maximality-type rule enough to tell me what your policy is — in particular, hypothetically, maybe you’d want to specify sth like the following: what it means for some options to be ‘permissible’ or how an option gets chosen from the ‘permissible options’, potentially something about how current choices relate to past choices, and maybe just what kind of POMDP, causal graph, decision tree, or whatever game setup we’re assuming in the first place — such that your behavior then looks like bayesian expected utility maximization (with some particular probability distribution and some particular utility function), then I guess I’ll no longer be objecting to you using that rule (to be precise: I would no longer be objecting to it for being dominated per the CCT or some such theorem, but I might still object to the psychological implementation of your policy on other grounds).

That said, I think the most straightforward ways [to start from your statement of the maximality rule and to specify some sequential setup and to make the rule precise and to then derive a policy for the sequential setup from the rule] do give you a policy which you would yourself consider dominated though. I can imagine a way to make your rule precise that doesn’t give you a dominated policy that ends up just being ‘anything is permissible as long as you make sure you looked like a bayesian expected utility maximizer at the end of the day’ (I think the rule of Thornley and Petersen is this), but at that point I’m feeling like we’re stressing some purely psychological distinction whose relevance to matters of interest I’m failing to see.

But maybe more importantly, at this point, I’d feel like we’ve lost the plot somewhat. What I intended to say with my original bullet point was more like: we’ve constructed this giant red arrow (i.e., coherence theorems; ok, it’s maybe not that giant in some absolute sense, but imo it is as big as presently existing arrows get for things this precise in a domain this messy) pointing at one kind of structure (i.e., bayesian expected utility maximization) to have ‘your beliefs and actions ultimately correspond to’, and then you’re like “why not this other kind of structure (imprecise probabilities, maximality rules) though?” and then my response was “well, for one, there is the giant red arrow pointing at this other structure, and I don’t know of any arrow pointing at your structure”, and I don’t really know how to see your response as a response to this.

Here are some brief reasons why I dislike things like imprecise probabilities and maximality rules (somewhat strongly stated, medium-strongly held because I’ve thought a significant amount about this kind of thing, but unfortunately quite sloppily justified in this comment; also, sorry if some things below approach being insufficiently on-topic):

I like the canonical arguments for bayesian expected utility maximization ( https://www.alignmentforum.org/posts/sZuw6SGfmZHvcAAEP/complete-class-consequentialist-foundations ; also https://web.stanford.edu/~hammond/conseqFounds.pdf seems cool (though I haven’t read it properly)). I’ve never seen anything remotely close for any of this other stuff — in particular, no arguments that pin down any other kind of rule compellingly. (I associate with this the vibe here (in particular, the paragraph starting with “To the extent that the outer optimizer” and the paragraph after it), though I guess maybe that’s not a super helpful thing to say.)

The arguments I’ve come across for these other rules look like pointing at some intuitive desiderata and saying these other rules sorta meet these desiderata whereas canonical bayesian expected utility maximization doesn’t, but I usually don’t really buy the desiderata and/or find that bayesian expected utility maximization also sorta has those desired properties, e.g. if one takes the cost of thinking into account in the calculation, or thinks of oneself as choosing a policy.

When specifying alternative rules, people often talk about things like default actions, permissibility, and preferential gaps, and these concepts seem bad to me. More precisely, they seem unnatural/unprincipled/confused/[I have a hard time imagining what they could concretely cache out to that would make the rule seem non-silly/useful]. For some rules, I think that while they might be psychologically different than ‘thinking like an expected utility maximizer’, they give behavior from the same distribution — e.g., I’m pretty sure the rule suggested here (the paragraph starting with “More generally”) and here (and probably elsewhere) is equivalent to “act consistently with being an expected utility maximizer”, which seems quite unhelpful if we’re concerned with getting a differently-behaving agent. (In fact, it seems likely to me that a rule which gives behavior consistent with expected utility maximization basically had to be provided in this setup given https://web.stanford.edu/~hammond/conseqFounds.pdf or some other canonical such argument, maybe with some adaptations, but I haven’t thought this through super carefully.) (A bunch of other people (Charlie Steiner, Lucius Bushnaq, probably others) make this point in the comments on https://www.lesswrong.com/posts/yCuzmCsE86BTu9PfA/there-are-no-coherence-theorems; I’m aware there are counterarguments there by Elliott Thornley and others; I recall not finding them compelling on an earlier pass through these comments; anyway, I won’t do this discussion justice in this comment.)

I think that if you try to get any meaningful mileage out of the maximality rule (in the sense that you want to “get away with knowing meaningfully less about the probability distribution”), basically everything becomes permissible, which seems highly undesirable. This is analogous to: as soon as you try to get any meaningful mileage out of a maximin (infrabayesian) decision rule, every action looks really bad — your decision comes down to picking the least catastrophic option out of options that all look completely catastrophic to you — which seems undesirable. It is also analogous to trying to find an action that does something or that has a low probability of causing harm ‘regardless of what the world is like’ being imo completely impossible (leading to complete paralysis) as soon as one tries to get any mileage out of ‘regardless of what the world is like’ (I think this kind of thing is sometimes e.g. used in davidad’s and Bengio’s plans https://www.lesswrong.com/posts/pKSmEkSQJsCSTK6nH/an-open-agency-architecture-for-safe-transformative-ai?commentId=ZuWsoXApJqD4PwfXr , https://www.youtube.com/watch?v=31eO_KfkjRQ&t=1946s ). In summary, my inside view says this kind of knightian thing is a complete non-starter. But outside-view, I’d guess that at least some people that like infrabayesianism have some response to this which would make me view it at least slightly more favorably. (Well, I’ve only stated the claim and not really provided the argument I have in mind, but that would take a few paragraphs I guess, and I won’t provide it in this comment.)

To add: it seems basically confused to talk about

**the**probability distribution on probabilities or probability distributions, as opposed to some joint distribution on two variables or**a**probability distribution on probability distributions or something. It seems similarly ‘philosophically problematic’ to talk about**the**set of probability distributions, to decide in a way that depends a lot on how uncertainty gets ‘partitioned’ into the set vs the distributions. (I wrote about this kind of thing a bit more here: https://forum.effectivealtruism.org/posts/Z7r83zrSXcis6ymKo/dissolving-ai-risk-parameter-uncertainty-in-ai-future#vJg6BPpsG93iyd7zo .)I think it’s plausible there’s some (as-of-yet-undeveloped) good version of probabilistic thinking+decision-making for less-than-ideal agents that departs from canonical bayesian expected utility maximization; I like approaches to finding such a thing that take aspects of existing messy real-life (probabilistic) thinking seriously but also aim to define a precise formal setup in which some optimality result could be proved. I have some very preliminary thoughts on this and a feeling that it won’t look at all like the stuff I’ve discussed disliking above. Logical induction ( https://arxiv.org/abs/1609.03543 ) seems cool; a heuristic estimator ( https://arxiv.org/pdf/2211.06738 ) would be cool. That said, I also assign significant probability to nothing very nice being possible here (this vaguely relates to the claim: “while there’s a single ideal rationality, there are many meaningfully distinct bounded rationalities” (I’m forgetting whom I should attribute this to)).

I think most of the quantitative claims in the current version of the above comment are false/nonsense/[using terms non-standardly]. (Caveat: I only skimmed the original post.)

“if your first vector has cosine similarity 0.6 with d, then to be orthogonal to the first vector but still high cosine similarity with d, it’s easier if you have a larger magnitude”

If by ‘cosine similarity’ you mean what’s usually meant, which I take to be the cosine of the angle between two vectors, then the cosine only depends on the directions of vectors, not their magnitudes. (Some parts of your comment look like you meant to say ‘dot product’/‘projection’ when you said ‘cosine similarity’, but I don’t think making this substitution everywhere makes things make sense overall either.)

“then your method finds things which have cosine similarity ~0.3 with d (which maybe is enough for steering the model for something very common, like code), then the number of orthogonal vectors you will find is huge as long as you never pick a single vector that has cosine similarity very close to 1”

For 0.3 in particular, the number of orthogonal vectors with at least that cosine with a given vector d is actually small. Assuming I calculated correctly, the number of e.g. pairwise-dot-prod-less-than-0.01 unit vectors with that cosine with a given vector is at most (the ambient dimension does not show up in this upper bound). I provide the calculation later in my comment.

“More formally, if theta0 = alpha0 d + (1 - alpha0) noise0, where d is a unit vector, and alpha0 = cosine(theta0, d), then for theta1 to have alpha1 cosine similarity while being orthogonal, you need alpha0alpha1 + <noise0, noise1>(1-alpha0)(1-alpha1) = 0, which is very easy to achieve if alpha0 = 0.6 and alpha1 = 0.3, especially if nosie1 has a big magnitude.”

This doesn’t make sense. For alpha1 to be cos(theta1, d), you can’t freely choose the magnitude of noise1

## How many nearly-orthogonal vectors can you fit in a spherical cap?

Proposition. Let be a unit vector and let also be unit vectors such that they all sorta point in the direction, i.e., for a constant (I take you to have taken ), and such that the are nearly orthogonal, i.e., for all , for another constant . Assume also that . Then .

Proof. We can decompose , with a unit vector orthogonal to ; then . Given , it’s a 3d geometry exercise to show that pushing all vectors to the boundary of the spherical cap around can only decrease each pairwise dot product; doing this gives a new collection of unit vectors , still with . This implies that . Note that since , the RHS is some negative constant. Consider . On the one hand, it has to be positive. On the other hand, expanding it, we get that it’s at most . From this, , whence .

(acknowledgements: I learned this from some combination of Dmitry Vaintrob and https://mathoverflow.net/questions/24864/almost-orthogonal-vectors/24887#24887 )

For example, for and , this gives .

(I believe this upper bound for the number of almost-orthogonal vectors is actually basically exactly met in sufficiently high dimensions — I can probably provide a proof (sketch) if anyone expresses interest.)

Remark. If , then one starts to get exponentially many vectors in the dimension again, as one can see by picking a bunch of random vectors on the boundary of the spherical cap.

## What about the philosophical point? (low-quality section)

Ok, the math seems to have issues, but does the philosophical point stand up to scrutiny? Idk, maybe — I haven’t really read the post to check relevant numbers or to extract all the pertinent bits to answer this well. It’s possible it goes through with a significantly smaller or if the vectors weren’t really that orthogonal or something. (To give a better answer, the first thing I’d try to understand is whether this behavior is basically first-order — more precisely, is there some reasonable loss function on perturbations on the relevant activation space which captures perturbations being coding perturbations, and are all of these vectors first-order perturbations toward coding in this sense? If the answer is yes, then there just has to be such a vector — it’d just be the gradient of this loss.)

how many times did the explanation just “work out” for no apparent reason

From the examples later in your post, it seems like it might be clearer to say something more like “how many things need to hold about the circuit for the explanation to describe the circuit”? More precisely, I’m objecting to your “how many times” because it could plausibly mean “on how many inputs” which I don’t think is what you mean, and I’m objecting to your “for no apparent reason” because I don’t see what it would mean for an explanation to hold for a reason in this case.

# Finding the estimate of the value of a state in RL agents

# Interpretability: Integrated Gradients is a decent attribution method

# The Local Interaction Basis: Identifying Computationally-Relevant and Sparsely Interacting Features in Neural Networks

# The Deep Neural Feature Ansatz

@misc{radhakrishnan2023mechanism, title={Mechanism of feature learning in deep fully connected networks and kernel machines that recursively learn features}, author={Adityanarayanan Radhakrishnan and Daniel Beaglehole and Parthe Pandit and Mikhail Belkin}, year={2023}, url = { https://arxiv.org/pdf/2212.13881.pdf } }

## The ansatz from the paper

Let denote the activation vector in layer on input , with the input layer being at index , so . Let be the weight matrix after activation layer . Let be the function that maps from the th activation layer to the output. Then their

*Deep Neural Feature Ansatz*says that (I’m somewhat confused here about them not mentioning the loss function at all — are they claiming this is reasonable for any reasonable loss function? Maybe just MSE? MSE seems to be the only loss function mentioned in the paper; I think they leave the loss unspecified in a bunch of places though.)## A singular vector version of the ansatz

Letting be a SVD of , we note that this is equivalent to i.e., that the eigenvectors of the matrix on the RHS are the right singular vectors. By the variational characterization of eigenvectors and eigenvalues (Courant-Fischer or whatever), this is the same as saying that right singular vectors of are the highest orthonormal directions for the matrix on the RHS. Plugging in the definition of , this is equivalent to saying that the right singular vectors are the sequence of highest-variance directions of the data set of gradients .

(I have assumed here that the linearity is precise, whereas really it is approximate. It’s probably true though that with some assumptions, the approximate initial statement implies an approximate conclusion too? Getting approx the same vecs out probably requires some assumption about gaps in singular values being big enough, because the vecs are unstable around equality. But if we’re happy getting a sequence of orthogonal vectors that gets variances which are nearly optimal, we should also be fine without this kind of assumption. (This is guessing atm.))

## Getting rid of the dependence on the RHS?

Assuming there isn’t an off-by-one error in the paper, we can pull some term out of the RHS maybe? This is because applying the chain rule to the Jacobians of the transitions gives , so

Wait, so the claim is just which, assuming is invertible, should be the same as . But also, they claim that it is ? Are they secretly approximating everything with identity matrices?? This doesn’t seem to be the case from their Figure 2 though.

Oh oops I guess I forgot about activation functions here! There should be extra diagonal terms for jacobians of preactivations->activations in , i.e., it should really say We now instead get This should be the same as which, with denoting preactivations in layer and denoting the function from these preactivations to the output, is the same as This last thing also totally works with activation functions other than ReLU — one can get this directly from the Jacobian calculation. I made the ReLU assumption earlier because I thought for a bit that one can get something further in that case; I no longer think this, but I won’t go back and clean up the presentation atm.

Anyway, a takeaway is that the Deep Neural Feature Ansatz is equivalent to the (imo cleaner) ansatz that the set of gradients of the output wrt the pre-activations of any layer is close to being a tight frame (in other words, the gradients are in isotropic position; in other words still, the data matrix of the gradients is a constant times a semi-orthogonal matrix). (Note that the closeness one immediately gets isn’t in to a tight frame, it’s just in the quantity defining the tightness of a frame, but I’d guess that if it matters, one can also conclude some kind of closeness in from this (related).) This seems like a nicer fundamental condition because (1) we’ve intuitively canceled terms and (2) it now looks like a generic-ish condition, looks less mysterious, though idk how to argue for this beyond some handwaving about genericness, about other stuff being independent, sth like that.

proof of the tight frame claim from the previous condition: Note that clearly implies that the mass in any direction is the same, but also the mass being the same in any direction implies the above (because then, letting the SVD of the matrix with these gradients in its columns be , the above is , where we used the fact that ).

## Some questions

Can one come up with some similar ansatz identity for the left singular vectors of ? One point of tension/interest here is that an ansatz identity for would constrain the left singular vectors of together with its singular values, but the singular values are constrained already by the deep neural feature ansatz. So if there were another identity for in terms of some gradients, we’d get a derived identity from equality between the singular values defined in terms of those gradients and the singular values defined in terms of the Deep Neural Feature Ansatz. Or actually, there probably won’t be an interesting identity here since given the cancellation above, it now feels like nothing about is really pinned down by ‘gradients independent of ’ by the DNFA? Of course, some -dependence remains even in the gradients because the preactivations at which further gradients get evaluated are somewhat -dependent, so I guess it’s not ruled out that the DNFA constrains something interesting about ? But anyway, all this seems to undermine the interestingness of the DNFA, as well as the chance of there being an interesting similar ansatz for the left singular vectors of .

Can one heuristically motivate that the preactivation gradients above should indeed be close to being in isotropic position? Can one use this reduction to provide simpler proofs of some of the propositions in the paper which say that the DNFA is exactly true in certain very toy cases?

The authors claim that the DNFA is supposed to somehow elucidate feature learning (indeed, they claim it is a mechanism of feature learning?). I take ‘feature learning’ to mean something like which neuronal functions (from the input) are created or which functions are computed in a layer in some broader sense (maybe which things are made linearly readable?) or which directions in an activation space to amplify or maybe less precisely just the process of some internal functions (from the input to internal activations) being learned of something like that, which happens in finite networks apparently in contrast to infinitely wide networks or NTK models or something like that which I haven’t yet understood? I understand that their heuristic identity on the surface connects something about a weight matrix to something about gradients, but assuming I’ve not made some index-off-by-one error or something, it seems to probably not really be about that at all, since the weight matrix sorta cancels out — if it’s true for one , it would maybe also be true with any other replacing it, so it doesn’t really pin down ? (This might turn out to be false if the isotropy of preactivation gradients is only true for a very particular choice of .) But like, ignoring that counter, I guess their point is that the directions which get stretched most by the weight matrix in a layer are the directions along which it would be the best to move locally in that activation space to affect the output? (They don’t explain it this way though — maybe I’m ignorant of some other meaning having been attributed to in previous literature or something.) But they say “Informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions.”. I guess maybe this is an appropriate description of the math if they are talking about reweighting in the purely linear sense, and they take features in the input layer to be scaleless objects or something? (Like, if we take features in the input activation space to each have some associated scale, then the right singular vector identity no longer says that most influential features get stretched the most.) I wish they were much more precise here, or if there isn’t a precise interesting philosophical thing to be deduced from their math, much more honest about that, much less PR-y.

So, in brief, instead of “informally, this mechanism corresponds to the approach of progressively re-weighting features in proportion to the influence they have on the predictions,” it seems to me that what the math warrants would be sth more like “The weight matrix reweights stuff; after reweighting, the activation space is roughly isotropic wrt affecting the prediction (ansatz); so, the stuff that got the highest weight has most effect on the prediction now.” I’m not that happy with this last statement either, but atm it seems much more appropriate than their claim.

I guess if I’m not confused about something major here (plausibly I am), one could probably add 1000 experiments (e.g. checking that the isotropic version of the ansatz indeed equally holds in a bunch of models) and write a paper responding to them. If you’re reading this and this seems interesting to you, feel free to do that — I’m also probably happy to talk to you about the paper.

## typos in the paper

indexing error in the first displaymath in Sec 2: it probably should say ″, not ″

# A thread into which I’ll occasionally post notes on some ML(?) papers I’m reading

I think the world would probably be much better if everyone made a bunch more of their notes public. I intend to occasionally copy some personal notes on ML(?) papers into this thread. While I hope that the notes which I’ll end up selecting for being posted here will be of interest to some people, and that people will sometimes comment with their thoughts on the same paper and on my thoughts (please do tell me how I’m wrong, etc.), I expect that the notes here will not be significantly more polished than typical notes I write for myself and my reasoning will be suboptimal; also, I expect most of these notes won’t really make sense unless you’re also familiar with the paper — the notes will typically be companions to the paper, not substitutes.

I expect I’ll sometimes be meaner than some norm somewhere in these notes (in fact, I expect I’ll sometimes be simultaneously mean and wrong/confused — exciting!), but I should just say to clarify that I think almost all ML papers/posts/notes are trash, so me being mean to a particular paper might not be evidence that I think it’s worse than some average. If anything, the papers I post notes about had something worth thinking/writing about at all, which seems like a good thing! In particular, they probably contained at least one interesting idea!

So, anyway: I’m warning you that the notes in this thread will be messy and not self-contained, and telling you that reading them might not be a good use of your time :)

# A starting point for making sense of task structure (in machine learning)

I’d be very interested in a concrete construction of a (mathematical) universe in which, in some reasonable sense that remains to be made precise, two ‘orthogonal pattern-universes’ (preferably each containing ‘agents’ or ‘sophisticated computational systems’) live on ‘the same fundamental substrate’. One of the many reasons I’m struggling to make this precise is that I want there to be some condition which meaningfully rules out trivial constructions in which the low-level specification of such a universe can be decomposed into a pair such that and are ‘independent’, everything in the first pattern-universe is a function only of , and everything in the second pattern-universe is a function only of . (Of course, I’d also be happy with an explanation why this is a bad question :).)

I find [the use of square brackets to show the merge structure of [a linguistic entity that might otherwise be confusing to parse]] delightful :)

# Toward A Mathematical Framework for Computation in Superposition

I’d be quite interested in elaboration on getting faster alignment researchers not being alignment-hard — it currently seems likely to me that a research community of unupgraded alignment researchers with a hundred years is capable of solving alignment (conditional on alignment being solvable). (And having faster general researchers, a goal that seems roughly equivalent, is surely alignment-hard (again, conditional on alignment being solvable), because we can then get the researchers to quickly do whatever it is that we could do — e.g., upgrading?)

# Grokking, memorization, and generalization — a discussion

I was just claiming that your description of pivotal acts / of people that support pivotal acts was incorrect in a way that people that think pivotal acts are worth considering would consider very significant and in a way that significantly reduces the power of your argument as applying to what people mean by pivotal acts — I don’t see anything in your comment as a response to that claim. I would like it to be a separate discussion whether pivotal acts are a good idea with this in mind.

Now, in this separate discussion: I agree that executing a pivotal act with just a narrow, safe, superintelligence is a difficult problem. That said, all paths to a state of safety from AGI that I can think of seem to contain difficult steps, so I think a more fine-grained analysis of the difficulty of various steps would be needed. I broadly agree with your description of the political character of pivotal acts, but I disagree with what you claim about associated race dynamics — it seems plausible to me that if pivotal acts became the main paradigm, then we’d have a world in which a majority of relevant people are willing to cooperate / do not want to race that much against others in the majority, and it’d mostly be a race between this group and e/acc types. I would also add, though, that the kinds of governance solutions/mechanisms I can think of that are sufficient to (for instance) make it impossible to perform distributed training runs on consumer devices also seem quite authoritarian.

In this comment, I will be assuming that you intended to talk of “pivotal acts” in the standard (distribution of) sense(s) people use the term — if your comment is better described as using a different definition of “pivotal act”, including when “pivotal act” is used by the people in the dialogue you present, then my present comment applies less.

I think that this is a significant mischaracterization of what most (? or definitely at least a substantial fraction of) pivotal activists mean by “pivotal act” (in particular, I think this is a significant mischaracterization of what Yudkowsky has in mind). (I think the original post also uses the term “pivotal act” in a somewhat non-standard way in a similar direction, but to a much lesser degree.) Specifically, I think it is false that the primary kinds of plans this fraction of people have in mind when talking about pivotal acts involve creating a superintelligent nigh-omnipotent infallible FOOMed properly aligned ASI. Instead, the kind of person I have in mind is very interested in coming up with pivotal acts that do not use a general superintelligence, often looking for pivotal acts that use a narrow superintelligence (for instance, a narrow nanoengineer) (though this is also often considered very difficult by such people (which is one of the reasons they’re often so doomy)). See, for instance, the discussion of pivotal acts in https://www.lesswrong.com/posts/7im8at9PmhbT4JHsW/ngo-and-yudkowsky-on-alignment-difficulty.

To clarify, I think in this context I’ve only said that the claim “The minimax regret rule (sec 5.4.2 of Bradley (2012)) is equivalent to EV max w.r.t. the distribution in your representor that induces maximum regret” (and maybe the claim after it) was “false/nonsense” — in particular, because it doesn’t make sense to talk about a distribution that induces maximum regret (without reference to a particular action) — which I’m guessing you agree with.

I wanted to say that I endorse the following:

Neither of the two decision rules you mentioned is (in general) consistent with any EV max if we conceive of it as giving your preferences (not just picking out a best option), nor if we conceive of it as telling you what to do on each step of a sequential decision-making setup.

I think basically any setup is an example for either of these claims. Here’s a canonical counterexample for the version with preferences and the max_{actions} min_{probability distributions} EV (i.e., infrabayes) decision rule, i.e. with our preferences corresponding to the min_{probability distributions} EV ranking:

Let a and c be actions and let b be flipping a fair coin and then doing a or c depending on the outcome. It is easy to construct a case where the max-min rule strictly prefers b to a and also strictly prefers b to c, and indeed where this preference is strong enough that the rule still strictly prefers b to a small enough sweetening of a and also still prefers b to a small enough sweetening of c (in fact, a generic setup will have such a triple). Call these sweetenings a+ and c+ (think of these as a-but-you-also-get-one-cent or a-but-you-also-get-one-extra-moment-of-happiness or whatever; the important thing is that all utility functions under consideration should consider this one cent or one extra moment of happiness or whatever a positive). However, every EV max rule (that cares about the one cent) will strictly disprefer b to at least one of a+ or c+, because if that weren’t the case, the EV max rule would need to weakly prefer b over a coinflip between a+ and c+, but this is just saying that the EV max rule weakly prefers b to b+, which contradicts with it caring about sweetening. So these min preferences are incompatible with maximizing any EV.

There is a canonical way in which a counterexample in preference-land can be turned into a counterexample in sequential-decision-making-land: just make the “sequential” setup really just be a two-step game where you first randomly pick a pair of actions to give the agent a choice between, and then the agent makes some choice. The game forces the max min agent to “reveal its preferences” sufficiently for its policy to be revealed to be inconsistent with EV maxing. (This is easiest to see if the agent is forced to just make a binary choice. But it’s still true even if you avoid the strictly binary choice being forced upon the agent by saying that the agent still has access to (internal) randomization.)

Regarding the Thornley paper you link: I’ve said some stuff about it in my earlier comments; my best guess for what to do next would be to prove some theorem about behavior that doesn’t make explicit use of a completeness assumption, but also it seems likely that this would fail to relate sufficiently to our central disagreements to be worthwhile. I guess I’m generally feeling like I might bow out of this written conversation soon/now, sorry! But I’d be happy to talk more about this synchronously — if you’d like to schedule a meeting, feel free to message me on the LW messenger.