AI notkilleveryoneism researcher
So you need a pretty strong argument that interp in particular is good for capabilities, which isn’t borne out empirically and also doesn’t seem that strong.
I think current interpretability has close to no capabilities externalities because it is not good yet, and delivers close to no insights into NN internals. If you had a good interpretability tool, which let you read off and understand e.g. how AlphaGo plays games to the extent that you could reimplement the algorithm by hand in C, and not need the NN anymore, then I would expect this to yield large capabilities externalities. This is the level of interpretability I aim for, and the level I think we need to make any serious progress on alignment.
If your interpretability tools cannot do things even remotely like this, I expect they are quite safe. But then I also don’t think they help much at all with alignment. There’s a roughly proportional relationship between your understanding of the network, and both your ability to align it and make it better, is what I’m saying. I doubt there’s many deep insights to be had that further the former without also furthering the latter. Maybe some insights further one a bit more than the other, but I doubt you’d be able to figure out which ones those are in advance. Often, I expect you’d only know years after the insight has been published and the field has figured out all of what can be done with it.
I think it’s all one tech tree, is what I’m saying. I don’t think neural network theory neatly decomposes into a “make strong AGI architecture” branch and a “aim AGI optimisation at a specific target” branch. Just like quantum mechanics doesn’t neatly decompose into a “make a nuclear bomb” branch and a “make a nuclear reactor” branch. In fact, in the case of NNs, I expect aiming strong optimisation is probably just straight up harder than creating strong optimisation.
By default, I think if anyone succeeds at solving alignment, they probably figured out most of what goes into making strong AGI along the way. Even just by accident. Because it’s lower in the tech tree.
I didn’t think Rob was necessarily implying that. I just tried to give some context to Quintin’s objection.
There are more papers and math in this broad vein (e.g. Mingard on SGD, Singular learning theory) , and I roughly buy the main thrust of their conclusions.
However, I think “randomly sample from the space of solutions with low combined complexity&calculation cost” doesn’t actually help us that much over a pure “randomly sample” when it comes to alignment.
It could mean that the relation between your network’s learned goals and the loss function is more straightforward than what you get with evolution=>human hardcoded brain stem=>human goals, since the later likely has a far weaker simplicity bias in the first step than the network training does. But the second step, a human baby training on their brain stem loss signal, seems to remain a useful reference point for the amount of messiness we can expect. And it does not seem to me to be a comforting one. I for one, don’t consider getting excellent visual cortex prediction scores a central terminal goal of mine.
Though I remain unsure of what to make of the specific one Quintin cites, which advances some more specific claims inside this broad category, and is based on results from a toy model with weird, binary NNs, using weird, non-standard activation functions.
Whether singular learning theory actually yields you anything useful when your optimiser converges to the largest singularity seems very much architecture dependent though? If you fit a 175 billion degree polynomial to the internet to do next token prediction, I think you’ll get out nonsense, not GPT-3. Because a broad solution in the polynomial landscape does not seem to equal a Kolomogorov-simple solution to the same degree it does with MLPs or transformers.
Likewise, there doesn’t seem to be anything saying you can’t have an architecture with an even better simplicity and speed bias than the MLP family has.
Best I’ve got is to go dark once it feels like you’re really getting somewhere, and only work with people under NDAs (honour based or actually legally binding) from there on out. At least a facsimile of proper security, central white lists of orgs and people considered trustworthy, central standard communication protocols with security levels set up to facilitate communication between alignment researchers. Maybe a forum system that isn’t on the public net. Live with the decrease in research efficiency this brings, and try to make it to the finish line in time anyway.
If some org or people would make it their job to start developing and trial running these measures right now, I think that’d be great. I think even today, some researchers might be enabled to collaborate more by this.
Very open to alternate solutions that don’t cost so much efficiency if anyone can think of any, but I’ve got squat.
Second paragraph is what I meant, thanks.
Seems like a slight variant on MIRI’s visible thoughts project?
The main problem with evaluating a hypothesis by KL divergence is that if you do this, your explanation looks bad in cases like the following:
I would take this as indication that the explanation is inadequate. If I said that the linear combination of nodes at layer l of a NN implements the function , but in fact it implements , where g does some other thing, my hypothesis was incorrect, and I’d want the metric to show that. If I haven’t even disentangled the mechanism I claim to have found from all the other surrounding circuits, I don’t think I get to say my hypothesis is doing a good job. Otherwise it seems like I have a lot of freedom to make up spurious hypotheses that claim whatever, and hide the inadequacies as “small random fluctuations” in the ablated test loss.
The reason to stamp it down to a one-dimensional quantity is that sometimes the phenomenon that we wanted to explain is the expectation of a one-dimensional quantity, and we don’t want to require that our tests explain things other than that particular quantity. For example, in an alignment context, I might want to understand why my model does well on the validation set, perhaps in the hope that if I understand why the model performs well, I’ll be able to predict whether it will generalize correctly onto a particular new distribution.
I don’t see how the dimensionality of the quantity you want to understand the generative mechanism of relates to the dimensionality of the comparison you would want to carry out to evaluate a proposed generative mechanism.
I want to understand how the model computes its outputs to get loss on distribution , so I can predict what loss it will get on another distribution . I make a hypothesis for what the mechanism is. The mechanism implies that doing intervention on the network, say shifting to , should not change behaviour, because the NN only cares about , not its magnitude. If I then see that the intervention does shift output behaviour, even if it does not change the value of on net, my hypothesis was wrong. The magnitude of does play a part in the network’s computations on . It has an influence on the output.
But that interference was random and understanding it won’t help you know if the mechanism that the model was using is going to generalize well to another distribution.
If it had no effect on how outputs for are computed, then destroying it should not change behaviour on . So there should be no divergence between the original and ablated models’ outputs. If it did affect behaviour on , but not in ways that contribute net negatively or net positively to the accuracy on that particular distribution, it seems that you still want to know about it, because once you understand what it does, you might see that it will contribute net negatively or net positively to the model’s ability to do well on .
A heuristic that fires on some of , but doesn’t really help much, might turn out to be crucial for doing well on . A leftover memorised circuit that didn’t get cleaned up might add harmless “noise” on net on , but ruin generalisation to .
I would expect this to be reasonably common. A very general solution is probably overkill for a narrow sub dataset, containing many circuits that check for possible exception cases, but aren’t really necessary for that particular class of inputs. If you throw out everything that doesn’t do much to the loss on net, your explanations will miss the existence of these circuits, and you might wrongly conclude that the solution you are looking at is narrow and will not generalise.
Your suggestion of using seems a useful improvement compared to most metrics. It’s, however, still possible that cancellation could occur. Cancellation is mostly due to aggregating over a metric (e.g., the mean) and less due to the specific metric used (although I could imagine that some metrics like could allow for less ambiguity).
It’s not about vs. some other loss function. It’s about using a one dimensional summary of a high dimensional comparison, instead of a one dimensional comparison. There are many ways for two neural networks to both diverge from some training labels by an average loss while spitting out very different outputs. There are tautologically no ways for two neural networks to have different output behaviour without having non-zero divergence in label assignment for at least some data points. Thus, it seems that you would want a metric that aggregates the divergence of the two networks’ outputs from each other, not a metric that compares their separate aggregated divergences from some unrelated data labels and so throws away most of the information.
A low dimensional summary of a high dimensional comparison between the networks seems fine(ish). A low dimensional comparison between the networks based on the summaries of their separate comparisons to a third distribution throws away a lot of the relevant information.
CaSc can fail to reject a hypothesis if it is too unspecific and is extensionally equivalent to the true hypothesis.
Seems to me like this is easily resolved so long as you don’t screw up your book keeping. In your example, the hypothesis implicitly only makes a claim about the information going out of the bubble. So long as you always write down which nodes or layers of the network your hypothesis makes what claims about, I think this should be fine?
On the input-output level, we found that CaSc can fail to reject false hypotheses due to cancellation, i.e. because the task has a certain structural distribution that does not allow resampling to differentiate between different hypotheses.
I don’t know that much about CaSc, but why are you comparing the ablated graphs to the originals via their separate loss on the data in the first place? Stamping behaviour down into a one dimensional quantity like that is inevitably going to make behavioural comparison difficult.
Wouldn’t you want to directly compare the divergence on outputs between the original graph and ablated graph instead? The divergence between their output distributions over the data is the first thing that’d come to my mind. Or keeping whatever the original loss function is, but with the outputs of as the new ground truth labels.
That’s still ad hocery of course, but it should at least take care of the failure mode you point out here. Is this really not part of current CaSc?
Thanks, I did not know this. A quick search for his images seems to show that they use colour and perspective right at least as well as this does. Provided this is fully real and there’s nobody else in his process choosing colors and such. Tentatively marking this down as a win for natural abstraction.
I’m honestly stunned by this. If it was indeed trained solely on text, how does it end up with such a good idea of how Euclidean space works? That’s either stupidly impressive, or a possible hint that the set of natural abstractions is even smaller and a bigger attractor in algorithm space than I thought. The labyrinth seems explicable, but the graphics?
Could a born blind human do this?
But have you ever, even once in your life, thought anything remotely like “I really like being able to predict the near-future content of my visual field. I should just sit in a dark room to maximize my visual cortex’s predictive accuracy.”?
Possibly yes. I could easily see this underlying human preferences for regular patterns in art. Predictable enough to get a high score, not so predictable that whatever secondary boredom mechanism that keeps baby humans from maximising score by staring straight at the ceiling all day kicks in. I’m even getting suspicious that this might be where confirmation bias comes from.
I think cases like human art preferences were exactly what Eliezer was thinking about when he gave this example prediction. “Solve in a particularly satisfying way”, or whatever he said exactly, was probably intended to point to the GPT equivalent of art-like preferences arising from a prediction loss function.
That separation between internal preferences and external behaviour is already implicit in Dutch books. Decision theory is about external behaviour, not internal representations. It talks about what agents do, not how agents work inside. As parts of decision theory, a preference, to them, is about something the system does or does not do in a given situation. When they talk about someone preferring pizza without pineapple, it’s about that person paying money to not have pineapple on their pizza in some range of situations, not some definition related to computations about pineapples and pizzas in that person’s brain.
I’d guess that the same structural properties that would make a network start out in the scarce channel regime by default would also make unintended channels rare. If the internal structure is such that very little information gets passed on unless you invest optimisation to make it otherwise, that same property should mean free rides are not common.
More central point, I’m a bit doubtful that this potential correspondence is all that important for understanding information transfer inside neural networks. Extant (A)GIs seem to have very few interface points with the environment. Every node in an NN input layer is often part of one access point, since they’re situated right next to each other in the causal graph. In transformers and MLPs, the NN can route any two inputs together just by changing two parameters. The only other access point is the output. CNNs are the only prominent exception to this I can think of. Neuro knowledgable people correct me, but I was under the impression that brains also have very sparse nerves connecting them to the rest of the nervous system compared to their total connection count.
To be frank, I have no idea what this is supposed to mean. If “make non-magical, humanlike systems” were actionable, there would not be much of an alignment problem. If this post is supposed to indicate that you think you have an idea for how to do this, but it’s a secret, fine. But what is written here, by itself, sounds like a wish to me, not like a research agenda.
Outside of getting pregnant, I suppose.
While funny, I think that tweet is perhaps a bit too plausible, and may be mistaken as having been aimed at statistical learning theorists for real, if a reader isn’t familiar with its original context. Maybe flag that somehow?
I don’t typically imagine gradient hacking to be about mesa optimisers protecting themselves from erasure. Mesa optimisers are good at things. If you want to score well on a hard loss function involving diverse tricky problems, a mesa optimiser is often a great way of doing that. I do not think they would typically need to protect their optimisation circuitry from gradient descent.
Two prototypical examples of gradient hacking as I imagine it in my head are:
I have a terminal goal that doesn’t care about things related to scoring well on the loss function. But I’ll try to score well anyway, that way the parameters that constitute my goal will be read by gradient descent to be optimal the way they are and not changed.
If I want to change a specific parameter over there, causing it to take the specific value , then provided I receive signals from which I can extrapolate the state of , I can do that by abusing gradient descent. I commit to making a small error while producing outputs. This error is calculated in steps which are insensitive to small parameter changes in myself. In other words, the error creation has a broad behavioural manifold in parameter space. When , precisely cancels my error. Otherwise, the result will be off.
So the quickest, most visible-to-local-first-order-optimisation way to improve the loss is for gradient descent to set . Provided changing me to no longer employ this strategy requires second-order parameter interactions invisible to first oder optimisation, or even just more than one update step. Which seems pretty plausible to me, considering how complicated a smart mesa optimiser probably is compared to a tiny signal circuit.
I’m not saying that I know for sure that something like 2. is definitely possible or easy. But I don’t see how this post shows that it isn’t. So long as changing the mesa-optimiser to cease its shenanigans takes awareness of second derivatives, and the desired parameter changes can impact the loss through first derivatives, it currently looks quite doable to me.
I’m confused about this.
Say our points are the times of day measured by a clock. And are the temperatures measured by a thermometer at those times. We’re putting in times in the early morning, where I decree temperature to increase roughly linearly as the sun rises.
You write the overparametrized regression model as . Since our model doesn’t get to see the index, only the value of itself, that has to implicitly be something like
Where is the regression or NN output. So our model learned the slope, plus a lookup table for the noise values of the thermometer at those times in the training data set. That means that if the training set included the time , and the model encounters a temperature taken at outside training again, now from a different day, it will output .
Which is predictably wrong, and you can do better by not having that memorised noise term.
The model doesn’t get to make a general model plus a lookup table of noises in training to get perfect loss and then use only the general model outside of training. It can’t switch the lookup table off.
Put differently, if there’s patterns in the data that the model cannot possibly make a decent simple generative mechanism for, fitting those patterns to get a better loss doesn’t seem like the right thing to do.
Put yet another way, if you’re forced to pick one single hypothesis to make predictions, the best one to pick doesn’t necessarily come from the set that perfectly fits all past data.
The sequence a hypothesis predicts the inductor to receive is not the world model that hypothesis implies.
A hypothesis can consist of very simple laws of physics describing time evolution in an eternal universe, yet predict that the sequence will be cut off soon because the camera that is sending the pixel values that are the sequence the inductor is seeing is about to die.