Independent AI Safety Researcher. Previously SERI MATS scholar and FHI Senior Research Scholar. Before that, pure math in academia at Cambridge, UW, MIT. Twitter. LinkedIn
Spencer Becker-Kahn
It could also work here. But I do feel like pointing out that the bounty format has other drawbacks. Maybe it works better when you want a variety of bitesize contributions, like various different proposals? I probably wouldn’t do work like Abram proposes—quite a long and difficult project, I expect—for the chance of winning a prize, particularly if the winner(s) were decided by someone’s subjective judgement.
I agree i.e. I also (fairly weakly) disagree with the value of thinking of ‘distilling’ as a separate thing. Part of me wants to conjecture that it’s comes from thinking of alignment work predominantly as mathematics or a hard science in which the standard ‘unit’ is a an original theorem or original result which might be poorly written up but can’t really be argued against much. But if we think of the area (I’m thinking predominantly about more conceptual/theoretical alignment) as a ‘softer’, messier, ongoing discourse full of different arguments from different viewpoints and under different assumptions, with counter-arguments, rejoinders, clarifications, retractions etc. that takes place across blogs, papers, talks, theorems, experiments etc that all somehow slowly works to produce progress, then it starts to be less clear what this special activity called ‘distilling’ really is.
Another relevant point, but one which I won’t bother trying to expand on much here, is that a research community assimilating—and then eventually building on—complex ideas can take a really long time.
[At risk of extending into a rant, I also just think the term is a bit off-putting. Sure, I can get the sense of what it means from the word and the way it is used—it’s not completely opaque or anything—but I’d not heard it used regularly in this way until I started looking at the alignment forum. What’s really so special about alignment that we need to use this word? Do we think we have figured out some new secret activity that is useful for intellectual progress that other fields haven’t figured out? Can we not get by using words like “writing” and “teaching” and “explaining”?]
I broadly agree with Richard’s main point, but I also do agree with this comment in the sense that I am not confident that the example of Turing compared with e.g. Einstein is completely fair/accurate.
One thing I would say in response to your comment, Adam, is that I don’t usually see the message of your linked post as being incompatible with Richard’s main point. I think one usually does have or does need productive mistakes that don’t necessarily or obviously look like they are robust partial progress. But still, often when there actually is a breakthrough, I think it can be important to look for this “intuitively compelling” explanation. So one thing I have in mind is that I think it’s usually good to be skeptical if a claimed breakthrough seems to just ‘fall out’ of a bunch of partial work without there being a compelling explanation after the fact.
Yes I think you understood me correctly. In which case I think we more or less agree in the sense that I also think it may not be productive to use Richard’s heuristic as a criterion for which research directions to actually pursue.
Hi Ryan, do you still plan for results to come out by May 27? And for those who are successful for the next stage to start June 6th etc.? (That’s what is says on the FAQ on the website still).
Hi there,
Given that you’ve described various ‘primarily conceptual’ projects on the Alignment Team, and given the distinction between Scientists and Engineers, one aspect that I’m unsure about is roughly: Would you expect a Research Scientist on the Alignment Team to necessarily have a minimum level of practical ML knowledge? Are you able to say any more about that? e.g. Would they have to pass a general Deep Mind coding test or something like that?
Thanks Rohin!
Thanks for the comments and pointers!
This was pretty interesting and I like the general direction that the analysis goes in. I feel it ought to be pointed out that what is referred to here as the key result is a standard fact in differential geometry called (something like) the submersion theorem, which in turn is essentially an application of the implicit function theorem.
I think that your setup is essentially that there is an -dimensional parameter space, let’s call it say, and then for each element of the training set, we can consider the function which takes in a set of parameters (i.e. a model) and outputs whatever the model does on training data point . We are thinking of both and as smooth (or at least sufficiently differentiable) spaces (I take it).
A contour plane is a level set of one of the , i.e. a set of the formfor some and . A behavior manifold is a set of the form
for some .
A more concise way of viewing this is to define a single function and then a behavior manifold is simply a level set of this function. The map is a submersion at if the Jacobian matrix at is a surjective linear map. The Jacobian matrix is what you call I think (because the Jacobian is formed with each row equal to a gradient vector with respect to one of the output coordinates). It doesn’t matter much because what matters to check the surjectivity is the rank. Then the standard result implies that given , if is a submersion in a neighbourhood of a point , then is a smooth -dimensional submanifold in a neighbourhood of .
Essentially, in a neighbourhood of a point at which the Jacobian of has full rank, the level set through that point is an -dimensional smooth submanifold.
Then, yes, you could get onto studying in more detail the degeneracy when the Jacobian does not have full rank. But in my opinion I think you would need to be careful when you get to claim 3. I think the connection between loss and behavior is not spelled out in enough detail: Behaviour can change while loss could remain constant, right? And more generally, in exactly which directions do the implications go? Depending on exactly what you are trying to establish, this could actually be a bit of a ‘tip of the iceberg’ situation though. (The study of this sort of thing goes rather deep; Vladimir Arnold et al. wrote in their 1998 book: “The theory of singularities of smooth maps is an apparatus for the study of abrupt, jump-like phenomena—bifurcations, perestroikas (restructurings), catastrophes, metamorphoses—which occur in systems depending on parameters when the parameters vary in a smooth manner”.)
Similarly when you say things like “Low rank indicates information loss”, I think some care is needed because the paragraphs that follow seem to be getting at something more like: If there is a certain kind of information loss in the early layers of the network, then this leads to low rank . It doesn’t seem clear that low rank is necessarily indicative of information loss?
Thanks for the substantive reply.
First some more specific/detailed comments: Regarding the relationship with the loss and with the Hessian of the loss, my concern sort of stems from the fact that the domains/codomains are different and so I think it deserves to be spelled out. The loss of a model with parameters can be described by introducing the actual function that maps the behavior to the real numbers, right? i.e. given some actual function we have:i.e. it’s that might be something like MSE, but the function ″ is of course more mysterious because it includes the way that parameters are actually mapped to a working model. Anyway, to perform some computations with this, we are looking at an expression like
We want to differentiate this twice with respect to essentially. Firstly, we have
where—just to keep track of this—we’ve got:
Or, using ‘coordinates’ to make it explicit:
for . Then for we differentiate again:
Or,
This is now at the level of matrices. Avoiding getting into any depth about tensors and indices, the term is basically a tensor-type object and it’s paired with which is a vector to give something that is .
So what I think you are saying now is that if we are at a local minimum for , then the second term on the right-hand side vanishes (because the term includes the first derivatives of , which are zero at a minimum). You can see however that if the Hessian of is not a multiple of the identity (like it would be for MSE), then the claimed relationship does not hold, i.e. it is not the case that in general, at a minima of , the Hessian of the loss is equal to a constant times . So maybe you really do want to explicitly assume something like MSE.
I agree that assuming MSE, and looking at a local minimum, you have .(In case it’s of interest to anyone, googling turned up this recent paper https://openreview.net/forum?id=otDgw7LM7Nn which studies pretty much exactly the problem of bounding the rank of the Hessian of the loss. They say: “Flatness: A growing number of works [59–61] correlate the choice of regularizers, optimizers, or hyperparameters, with the additional flatness brought about by them at the minimum. However, the significant rank degeneracy of the Hessian, which we have provably established, also points to another source of flatness — that exists as a virtue of the compositional model structure —from the initialization itself. Thus, a prospective avenue of future work would be to compare different architectures based on this inherent kind of flatness.”)
Some broader remarks: I think these are nice observations but unfortunately I think generally I’m a bit confused/unclear about what else you might get out of going along these lines. I don’t want to sound harsh but just trying to be plain: This is mostly because, as we can see, the mathematical part of what you have said is all very simple, well-established facts about smooth functions and so it would be surprising (to me at least) if some non-trivial observation about deep learning came out from it. In a similar vein, regarding the “cause” of low-rank G, I do think that one could try to bring in a notion of “information loss” in neural networks, but for it to be substantive one needs to be careful that it’s not simply a rephrasing of what it means for the Jacobian to have less-than-full rank. Being a bit loose/informal now: To illustrate, just imagine for a moment a real-valued function on an interval. I could say it ‘loses information’ where its values cannot distinguish between a subset of points. But this is almost the same as just saying: It is constant on some subset...which is of course very close to just saying the derivative vanishes on some subset. Here, if you describe the phenomena of information loss as concretely as being the situation where some inputs can’t be distinguished, then (particularly given that you have to assume these spaces are actually some kind of smooth/differentiable spaces to do the theoretical analysis), you’ve more or less just built into your description of information loss something that looks a lot like the function being constant along some directions, which means there is a vector in the kernel of the Jacobian. I don’t think it’s somehow incorrect to point to this but it becomes more like just saying ‘perhaps one useful definition of information loss is low rank G’ as opposed to linking one phenomenon to the other.
Sorry for the very long remarks. Of course this is actually because I found it well worth engaging with. And I have a longer-standing personal interest in zero sets of smooth functions!- 29 Aug 2022 12:36 UTC; 4 points) 's comment on Basin broadness depends on the size and number of orthogonal features by (
Thanks again for the reply.
In my notation, something like or are functions in and of themselves. The function evaluates to zero at local minima of .
In my notation, there isn’t any such thing as .
But look, I think that this is perhaps getting a little too bogged down for me to want to try to neatly resolve in the comment section, and I expect to be away from work for the next few days so may not check back for a while. Personally, I would just recommend going back and slowly going through the mathematical details again, checking every step at the lowest level of detail that you can and using the notation that makes most sense to you.
I’m a little sheepish about trying to make a useful contribution to this discussion without spending a lot of time thinking things through but I’ll give it a go anyway. There’s a fair amount that I agree with here, including that there is by now a lot of introductory resources. But regarding the following:
(I do think it’s possible to create a much better intro resource than any that exist today, but ‘we can do much better’ is compatible with ‘it’s shocking that the existing material hasn’t already finished the job’.)I feel like I want to ask: Do you really find it “shocking”? My experience with explaining things to more general audiences leaves me very much of the opinion that it is by default an incredibly slow and difficult process to get unusual, philosophical, mathematical, or especially technical ideas to permeate. I include ‘average ML engineer’ as something like a “more general audience” member relative to MIRI style AGI Alignment theory. I guess I haven’t thought it about it much but presumably there exist ideas/arguments that are way more mainstream, also very important, and with way more written about them that people still somehow, broadly speaking, don’t engage with or understand?
I also don’t really understand how the point that is being made in the quote from Inadequate Equilibria is supposed to work. Perhaps in the book more evidence is provided for when “the silence broke”, but the Hawking article was before the release of Superintelligencea and then the Musk tweet was after it and was reacting to it(!) .. So I guess I’m sticking up for AGI x-risk respectability politics a bit here because surely I might also use essentially this same anecdote to support the idea that boring old long-form academic writing that clearly lays things out in as rigorous a way as possible is actually more the root cause that moved the needle here? Even if it ultimately took the engagement of Musk’s off the cuff tweets, Gates, or journalists etc., they wouldn’t have had something respectable enough to bounce off had Bostrom not given them the book.
Thanks for the nice reply.
I do buy the explanations I listed in the OP (and other, complementary explanations, like the ones in Inadequate Equilbria), and I think they’re sufficient to ~fully make sense of what’s going on. So I don’t feel confused about the situation anymore. By “shocking” I meant something more like “calls for an explanation”, not “calls for an explanation, and I don’t have an explanation that feels adequate”. (With added overtones of “horrifying”.)
Yeah, OK, I think that helps clarify things for me.
As someone who was working at MIRI in 2014 and watched events unfolding, I think the Hawking article had a negligible impact and the Musk stuff had a huge impact. Eliezer might be wrong about why Hawking had so little impact, but I do think it didn’t do much.Maybe we’re misunderstanding each other here. I don’t really doubt what you’re saying there^ i.e. I am fully willing to believe that the Hawking thing had negligible impact and the Musk tweet had a lot. I’m more pointing to why Musk had a lot rather than why Hawking had little: Trying to point out that since Musk was reacting to Superintelligence, one might ask whether he could have had a similar impact without Superintelligence. And so maybe the anecdote could be used as evidence that Superintelligence was really the thing that helped ‘break the silence’. However, Superintelligence feels way less like “being blunt” and “throwing a brick” and—at least from the outside—looks way more like the “scripts, customs, and established protocols” of “normal science” (i.e. Oxford philosophy professor writes book with somewhat tricky ideas in it, published by OUP, reviewed by the NYT etc. etc.) and clearly is an attempt to make unusual ideas sound “sober and serious”. So I’m kind of saying that maybe the story doesn’t necessarily argue against the possibility of doing further work like that that—i.e. writing books that manage to stay respectable and manage to “speak accurately and concretely about the future of AI without sounding like a sci-fi weirdo”(?)
Thanks very much Geoffrey; glad you liked the post. And thanks for the interesting extra remarks.
I wrote out the Hessian computation in a comment to one of Vivek’s posts. I actually had a few concerns with his version and I could be wrong but I also think that there are some issues here. (My notation is slightly different because for me the sum over was included in the function I called “”, but it doesn’t affect my main point).
I think the most concrete thing is that the function - i.e. the `input-output’ function of a neural network—should in general have a vector output, but you write things likewithout any further explanation or indices. In your main computation it seems like it’s being treated as a scalar.
Since we never change the labels or the dataset, on can drop the explicit dependence on from our notation for . Then if the network has neurons in the final layer, the codomain of the function is (unless I’ve misunderstood what you are doing?). So to my mind we have:Going through the computation in full using the chain rule (and a local minimum of the loss function ) one would get something like:
Vivek wanted to suppose that were equal to the identity matrix, or a multiple thereof, which is the case for mean squared loss. But without such an assumption, I don’t think that the term
appears (this is the matrix you describe as containing “the inner products of the features over the training data set.”)
Another (probably more important but higher-level) issue is basically: What is your definition of ‘feature’? I could say: Have you not essentially just defined `feature’ to be something like `an entry of ’? Is the example not too contrived in that sense it clearly supposes that has a very special form (in particular it is linear in the variables so that the derivatives are not functions of .)
I’m not at liberty to share it directly but I am aware that Anthropic have a draft of small toy models with hand-coded synthetic data showing superposition very cleanly. They go as far as saying that searching for an interpretable basis may essentially be mistaken.
It’s an example computation for a network with scalar outputs, yes. The math should stay the same for multi-dimensional outputs though. You should just get higher dimensional tensors instead of matrices.
I’m sorry but the fact that it is scalar output isn’t explained and a network with a single neuron in the final layer is not the norm. More importantly, I am trying to explain that I think the math does not stay the same in the case where the network output is a vector (which is the usual situation in deep learning) and the loss is some unspecified function. If the network has vector output, then right after where you say “The Hessian matrix for this network would be...”, you don’t get a factorization like that; you can’t pull out the Hessian of the loss as a scalar, it instead acts in the way that I have written—like a bilinear form for the multiplication between the rows and columns of .
A feature to me is the same kind of thing it is to e.g. Chris Olah. It’s the function mapping network input to the activations of some neurons, or linear combination of neurons, in the network.
I’m not assuming that the function is linear in \Theta. If it was, this whole thing wouldn’t just be an approximation within second order Taylor expansion distance, it’d hold everywhere.
OK maybe I’ll try to avoid a debate about exactly what ‘feature’ means or means to different people, but in the example, you are clearly using . This is a linear function of the variables. (I said “Is the example not too contrived....in particular it is linear in ”—I’m not sure how we have misunderstood each other, perhaps you didn’t realise I meant this example as opposed to the whole post in general). But what it means is that in the next line when you write down the derivative with respect to , it is an unusually clean expression because it now doesn’t depend on So again, in the crucial equation right after you say “The Hessian matrix for this network would be...”, you in general get variables appearing in the matrix. It is just not as clean as this expression suggests in general.
You’re right about the loss thing; it isn’t as important as I first thought it might be.
I agree that the space may well miss important concepts and perspectives. As I say, it is not my suggestion to look at it, but rather just something that was implicitly being done in another post. The space may well be a more natural one. (It’s of course the space of functions , and so a space in which ‘model space’ naturally sits in some sense. )
This post caught my eye as my background is in mathematics and I was, in the not-too-distant past, excited about the idea of rigorous mathematical AI alignment work. My mind is still open to such work but I’ll be honest, I’ve since become a bit less excited than I was. In particular, I definitely “bounced off” the existing write-ups on Infrabayesianism and now without already knowing what it’s all about, it’s not clear it’s worth one’s time. So, at the risk of making a basic or even cynical point: The remuneration of the proposed job could be important for getting attention/ incentivising people on-the-fence.