Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to recommend it. (Algon makes a similar point in another comment.) Though I do agree that, based on the numbers you gave for how many junior researchers’ projects are focusing on interpretability, people are probably overweighting it.
I think this post is an example of a fairly common phenomenon where alignment people are too focused on backchaining from desired end states, and not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better. (By contrast, most ML researchers are too focused on the latter.)
Perhaps the main problem I have with interp is that it implicitly reinforces the narrative that we must build powerful, dangerous AIs, and then align them. For X-risks, prevention is better than cure. Let’s not build powerful and dangerous AIs. We aspire for them to be safe, by design.
I particularly disagree with this part. The way you get safety by design is understanding what’s going on inside the neural networks. More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs.
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don’t see how we can proceed without scrutinizing the agendas and listing the main difficulties.
this by itself is sufficient to recommend it.
I don’t think it’s that simple. We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;
Maybe. I would still argue that other research avenues are neglected in the community.
not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better
I provided plenty of technical research direction in the “preventive measures” section, this should also qualifies as forward-chaining. And interp is certainly not the only way to understand the world better. Besides, I didn’t say we should stop Interp research altogether, just consider other avenues.
More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
I think I agree, but this is only one of the many points in my post.
What type of reasoning do you think would be most appropriate?
See the discussion between me and interstice upthread for a type of argument that feels more productive.
I would still argue that other research avenues are neglected in the community.
I agree (and mentioned so in my original comment). This post would have been far more productive if it had focused on exploring them.
We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
The things you should be looking for, when it comes to fundamental breakthroughs, are deep problems demonstrating fascinating phenomena, and especially cases where you can get rapid feedback from reality. That’s what we’ve got here. If that’s not object-level enough then your criterion would have ruled out almost all great science in the past.
I think I agree, but this is only one of the many points in my post.
I wouldn’t have criticized it so strongly if you hadn’t listed it as “Perhaps the main problem I have with interp”.
This post would have been far more productive if it had focused on exploring them.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
if you hadn’t listed it as “Perhaps the main problem I have with interp”
In the above comment, I only agree with “we shouldn’t do useful work, because then it will encourage other people to do bad things”, and I don’t agree with your critique of “Perhaps the main problem I have with interp...” which I think is not justified enough.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
You’ve listed them, but you haven’t really argued that they’re valuable, you’re mostly just asserting stuff like Rob Miles having a bigger impact than most interpretability researchers, or the best strategy being copying Dan Hendrycks. But since I disagree with the assertions, these sections aren’t very useful; they don’t actually zoom in on the positive case for these research directions.
(The main positive case I’m seeing seems to be “anything which helps with coordination is really valuable”. And sure, coordination is great. But most coordination-related research is shallow: it helps us do things now, but doesn’t help us figure out how to do things better in the long term. So I think you’re overstating the case for it in general.)
I agree that I haven’t argued the positive case for more governance/coordination work (and that’s why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I’ll be happy to reinvest in alignment work once we’re sure we can avoid X-Risks from misuses and grossly negligent accidents.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works
If our goal is developing a principled understanding of deep learning, directly trying to do that is likely to be more effective than doing interpretability in the hope that we will develop a principled understanding as a side effect. For this reason I think most alignment researchers have too little awareness of various attempts in academia to develop “grand theories” of deep learning such as the neural tangent kernel. I think the ideal use for interpretability in this quest is as a way of investigating how the existing theories break down—e.g. if we can explain 80% of a given model’s behavior with the NTK, what are the causes of the remaining 20%? I think of interpretability as basically collecting many interesting data points; this type of collection is essential, but it can be much more effective when it’s guided by a provisional theory which tells you what points are expected and what are interesting anomalies which call for a revision of the theory, which in turn guides further exploration, etc.
I agree that work like NTK is worth thinking about. But I disagree that it’s a more “direct” approach to a principled understanding of deep learning. To find a “grand theory” of deep learning, we’re going to need to connect our understanding of neural networks to our understanding of the real world, and I don’t think NTKs or other related things can help very much with that step—for roughly the same reasons that statistical learning theory wasn’t very helpful (and was in fact anti-helpful) in predicting the success of deep neural networks.
Btw, this isn’t a general-purpose critique of theoretical work—e.g. it doesn’t apply to this paper by Lin, Tegmark and Rolnick, which actually ties neural network success to properties of the real world like symmetry, locality, and compositionality. This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs.
I think of interpretability as basically collecting many interesting data points
I’d agree if interpretability were just about “here’s a circuit for recognizing X” (although even then, the concept of circuits itself was nontrivial to develop), but in fact a lot of the most promising work has been on more important and fundamental phenomena like superposition and induction heads.
we’re going to need to connect our understanding of neural networks to our understanding of the real world
The NTK and related theories aim to go from “SGD finds a giant blob of parameters that performs well on the data for some reason” to “SGD finds a solution with such-and-such clean mathematical characterization”. To fully explain the success of deep learning you do then have to relate the clean mathematical characterization to the real world, but I think this can be done separately to some extent and is less of a bottleneck on progress. My #2 use case for interpretability would be doing stuff like this—basically conceptual/experimental investigation of the types of solutions favored by a given mathematical theory, with the goal of obtaining a high-level story about “why it works in the real world”. Plus attempts to carry out alignment/interpretability/ELK tasks in the simplified setting.
This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs
Hmm, it’s been a while since I looked at this paper but if I recall it doesn’t really try to make any specific predictions about the inductive bias of neural nets in practice, it’s more like a series of suggestive analogies. That’s fine, but I think that sort of thing is more likely to be productive if guided by a more detailed theory.
I can’t speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I’m significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic “structure” we find in trained models (both ML and biological!) and “structure” in the data generating process.
That training on Othello move sequences gets you an algorithmic model of the game itself is surprising from most current theoretical perspectives! So in that sense I might be suspicious of a theory of deep learning that fails to “connect our understanding of neural networks to our understanding of the real world”. This is the single most striking thing to come out of interpretability, in my opinion, and I’m worried about a “deep learning theory of everything” if it doesn’t address this head on.
That said, NTK doesn’t promise to be a theory of everything, so I don’t mean to hold it to an unreasonable standard. It does what it says on the tin! I just don’t think it’s explained a lot of the remaining questions I have. I don’t think we’re in a situation where “we can explain 80% of a given model’s behavior with the NTK” or similar. And this is relevant for e.g. studying inductive biases, as you mentioned.
But I strong upvoted your comment, because I do think deep learning theory can fill this gap—I’m personally trying to work in this area. There are some tractable-looking directions here, and people shouldn’t neglect them!
So although I don’t think the NTK can be a final answer, I do like the idea of studying it in more depth—it provides a feature-learning-free baseline against which we can compare actual neural networks and other potential ‘grand theories’. Exactly which phenomena can we not explain with the NTK, and which theory best predicts them?
Strong upvote to Zach’s comment, it basically encapsulates my view (except that I don’t know what the “tractable-looking directions” he mentions are—Zach, can you elaborate?)
Exactly which phenomena can we not explain with the NTK
I’d turn that around: is there any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs? (Not a rhetorical question: there may well be, I just haven’t seen one.)
Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
Agreed—that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.
any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs?
I don’t think there are any. Of course much the same could be said of other deep learning theories and most(all?) interpretability work. The difference, as far as I can tell, is that there is a clear pathway to getting such explanations from the NTK: you’d want to do a spectral analysis of the sorts of functions learnable by transformer-NTKs. It’s just that nobody has bothered to do this! That’s why I think this line of research is neglected relative to interpretability or developing a new theoretical analysis of deep learning. Another obvious thing to try: NTKs often empirically perform comparably well to finite networks, but are usually are a few percentage points worse in accuracy. Can we say anything about the examples where the NTK fails? Do they particularly depend on ‘feature learning’? I think NTKs are a good compliment to mechinterp in this regard, since they treat the weights at each neuron as independent of all others, so they provide a good indicator of exactly which examples may require interacting ‘circuits’ to be correctly classified.
What is the work that finds the algorithmic model of the game itself for Othello? I’m aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda’s and Kenneth Li), but thought it was just about board state representations.
Yeah, that was what I was referring to. Maybe “algorithmic model” isn’t the most precise—what we know is that the NN has an internal model of the board state that’s causal (i.e. the NN actually uses it to make predictions, as verified by interventions). Theoretically it could just be forming this internal model via a big lookup table / function approximation, rather than via a more sophisticated algorithm. Though we’ve seen from modular addition work, transformer induction heads, etc that at least some of the time NNs learn genuine algorithms.
I think the core surprising thing is the fact that the model learns a representation of the board state. The causal / linear probe parts are there to ensure that you’ve defined “learns a representation of the board state” correctly—otherwise the probe could just be computing the board state itself, without that knowledge being used in the original model.
This is surprising to some older theories like statistical learning, because the model is usually treated as effectively a black box function approximator. It’s also surprising to theories like NTK, mean-field, and tensor programs, because they view model activations as IID samples from a single-neuron probability distribution—but you can’t reconstruct the board state via a permutation-invariant linear probe. The question of “which neuron is which” actually matters, so this form of feature learning is beyond them. (Though there may be e.g. perturbative modifications to these theories to allow this in a limited way).
they view model activations as IID samples from a single-neuron probability distribution—but you can’t reconstruct the board state via a permutation-invariant linear probe
Permutation-invariance isn’t the reason that this should be surprising. Yes, the NTK views neurons as being drawn from an IID distribution, but once they have been so drawn, you can linearly probe them as independent units. As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels.
The reason the Othello result is surprising to the NTK is that neurons implementing an “Othello board state detector” would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.
The reason the Othello result is surprising to the NTK is that neurons implementing an “Othello board state detector” would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.
Yeah, that’s probably the best way to explain why this is surprising from the NTK perspective. I was trying to include mean-field and tensor programs as well (where that explanation doesn’t work anymore).
As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels.
Yeah, this is a good point. What I meant to specify wasn’t that you can’t recover any permutation-sensitive data at all (trivially, you can recover data about the input), but that any learned structures must be invariant to neuron permutation. (Though I’m feeling sketchy about the details of this claim). For the case of NTK, this is sort of trivial, since (as you pointed out) it doesn’t really learn features anyway.
By the way, there are actually two separate problems that come from the IID assumption: the “independent” part, and the “identically-distributed” part. For space I only really mentioned the second one. But even if you deal with the identically distributed assumption, the independence assumption still causes problems.This prevents a lot of structure from being representable—for example, a layer where “at most two neurons are activated on any input from some set” can’t be represented with independently distributed neurons. More generally a lot of circuit-style constructions require this joint structure. IMO this is actually the more fundamental limitation, though takes longer to dig into.
I was trying to include mean-field and tensor programs as well
but that any learned structures must be invariant to neuron permutation. (Though I’m feeling sketchy about the details of this claim)
The same argument applies—if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated, it will be possible to construct a linear probe detecting this, regardless of the permutation-invariance of the distribution.
the independence assumption still causes problems
This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated
Yeah, this “if” was the part I was claiming permutation invariance causes problems for—that identically distributed neurons probably couldn’t express something as complicated as a board-state-detector. As soon as that’s true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.
This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
I probably should’ve just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc).
My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there’s a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like “I don’t understand how “Looking at random bits of the model and identify circuits/features” will help with deception”). And in general a lot of this is of the form “I don’t see how X”, which is the format I’m objecting to, because of course you won’t see how X until someone invents a technique to X.
This is exacerbated by the meta-level problem that people have very different standards for what’s useful (e.g. to Eliezer, none of this is useful), and also standards for what types of evidence and argument they accept (e.g. to many ML researchers, approximately all arguments about long-term theories of impact are too speculative to be worth engaging in depth).
I still think that so many people are working on interpretability mainly because they don’t see alternatives that are as promising; in general I’d welcome writing that clearly lays out solid explanations and intuitions about why those other research directions are worth working on, and think that this would be the best way to recalibrate the field.
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
EDIT: Nuance of course being impossible, this no doubt comes off as rude—and is in turn a reaction to an internet-distorted version of what you actually wrote. Oh well, grain of salt and all that.
The way you get safety by design is understanding what’s going on inside the neural networks.
This is equivocation. There are some properties of what’s going on inside a NN that are crucial to reasoning about its safety properties, and many, many more that are irrelevant.
I’m actually strongly reminded of a recent comment about LK-99, where someone remarked that a good way to ramp up production of superconductors would be to understand how superconductors work, because then we could design one that’s easier to mass-produce.
Except:
What we normally think of as “understanding how superconductors work” is not a sure thing, it’s hard and sometimes we don’t find satisfactory models.
Even if we understand how superconductors work, designing new ones with economically useful properties is an independent problem that’s also hard and possible to fail at for decades.
There are many other ways to make progress in discovering superconductors and ramping up their production. These ways are sometimes purely phenomenological, or sometimes rely on building some understanding of the superconductor that’s a model of a different type than what we typically mean by “understanding how superconductors work.”
It might sound good to say “we’ll understand how NNs work, and then use that to design safe ones,” but I think the problems are analogous. What we normally think of as “understand how NNs work,” especially in the context of mech interp, is a very specific genre of understanding—it’s not omniscience, it’s the ability to give certain sorts of mechanistic explanations for canonical explananda. And then using that understanding to design safe AI is an independent problem not solved just by solving the first one. Meanwhile, there are other ways to reason about the safety of AI (e.g. statistical arguments about the plausibility of gradient hacking) that use “understanding,” but not of the mech interp sort.
Yes, blue sky research is good. But we can simultaneously use our brains about what sorts of explanations we think are promising to find. Understanding doesn’t just go into a big bucket labeled “Understanding” from which we draw to make things happen. If I’m in charge of scaling up superconductor production, and I say we should do less micro-level explanation and more phenomenology, telling me about the value of blue sky research is the “wrong type of reasoning.”
In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
The tricky part being that in the AGI alignment discourse, if you believe in self-improvement runaway feedback loops, there is no good. There is only perfect, or extinction. This might be a bit extreme but we don’t really know that for sure either.
Note that a wrench current paradigms throw in this is that self-improvement processes would not look uniquely recursive, since all training algorithms sort of look like “recursive self improvement”. instead, RSI is effectively just “oh no, the training curve was curved differently on this training run”, which is something most likely to happen in open world RL. But I agree, open world RL has the ability to be suddenly surprising in capability growth. and there wouldn’t be much of an opportunity to notice the problem unless we’ve already solved how to intentionally bound capabilities in RL.
There has been some interesting work on bounding capability growth in safe RL already, though. I haven’t looked closely at it, I wonder if any of it is particularly good.
edit: note that I am in fact claiming that after miri deconfuses us, it’ll turn out to apply to ordinary gradient updates
This isn’t about “perfect futures” though, but about perfect AGIs specifically. Consider a future that goes like this:
the AI’s presence and influence over us evolves exponentially according to a law dAIdt=γAI,
the exponent γ expresses the amount of misalignment; if the AI is aligned and fully under our control, γ=0, otherwise γ>0,
then in that future, anything less than perfect alignment ends with us overwhelmed by the AI, sooner or later. This is super simplistic, but the essence is that if you keep around something really powerful that might just decide to kill you, you probably want to be damn sure it won’t. That’s what “perfect” here means; it’s not fine if it just wants to kill you a little bit. So if your logic is correct (and indeed, I do agree with you on general matters of ethics), then perhaps we just shouldn’t build AGI at all, because we can’t get it perfect, and if it’s not perfect it’ll probably be in too precarious a balance with us for it to persist for long.
Ah, I see more of what you mean. I agree an AI’s influence being small is unstable. And this means that the chance of death by AI being small is also unstable.
But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn’t, you’ll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It’s not that once an AI becomes high-influence, there’s an exponential decay of humans, as every day it makes a new random mutation to its motivations.
I don’t think that’s necessarily true. There’s two ways in which I think it can compound:
if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes
it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it’s unrecoverable, and we don’t know how or when that might occur.
Also, it might simply be probabilistic—any non-fully deterministic AGI probably wouldn’t literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that’s still a finite but non-zero possibility that it goes into some kind of “kill all humans” strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.
In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
Now that I think about it, this is the main problem a lot of LW thinking and posting has: It implicitly thinks that only a perfect, watertight solution to alignment is sufficient to guarantee human survival, despite the fact that most solutions to problems don’t have to be perfect to work, and even the cases where we do face against an adversary, imperfect but fast solutions win out over perfect, very slow solutions, and in particular ignores that multiple solutions to alignment can fundamentally stack.
In general, I feel like the biggest flaw of LW is it’s perfectionism, and the big reason why Michael Nielsen pointed out that alignment is extremely accelerationist in practice is that OpenAI implements a truth that LWers like Nate Soares and Eliezer Yudkowsky, as well as the broader community doesn’t: Alignment approaches don’t need to be perfect to work, and having an imperfect safety and alignment plan is much better than no plan at all.
I basically just disagree with this entirely, unless you don’t count stuff like RLHF or DPO as alignment.
More generally, if we grant that we don’t need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it’s often doable to solve problems of you don’t need perfect or arbitrarily good solutions.
More generally, if we grant that we don’t need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it’s often doable to solve problems of you don’t need perfect or arbitrarily good solutions.
It seems really easy to spell out worldviews where “we don’t need perfection, or arbitrarily good alignment” but yet “alignment should be really easy”. To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle—so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors).
But I might also think that properly “checking” even one feature is really, really hard.
So I don’t buy the claimed implication: “we don’t need perfection” does not imply “alignment should be really easy”. Indeed, I think the implication quite badly fails.
I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
Indeed, I think the implication quite badly fails.
I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand.
I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
I agree the example sucks and only serves to prove that it is not a logical implication.
A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.
I still think the list of lethalities is a pretty good reply to your overall line of reasoning—IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating.
You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but “things tend to be easy when you lift the requirement of perfection” is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.
The problem with RLHF/DPO is not that it doesn’t work period, the problem is that we don’t know if they work. I can imagine that we can just scale to superintelligence, apply RLHF and get aligned ASI, but this would imply a bunch of things about reality like “even at high level of capability reasonable RLHF-data contains overwhelmingly mostly good value-shaped thought-patterns” and I just don’t think that we know enough about reality to make such statements.
I think this might be a crux, actually. I think it’s surprisingly common in history for things to work out well empirically, but that we either don’t understand how they work, or it took a long time to understand how it works.
AI development is the most central example, but I’d argue the invention of steel is another good example.
To put it another way, I’m relying on the fact that there have been empirically successful interventions where we either simply don’t know why it works, or it takes a long time to get a useful theory out of the empirically successful intervention.
not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback
Are you mostly looking for where there is useful empirical feedback? That sounds like a shot in the dark.
Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs
A concern I have: I cannot conceptually distinguish these continued empirical investigations of methods to build maybe-aligned AGI, from how medieval researchers tried to build perpetual motion machines. It took sound theory to finally disprove the possibility once and for all that perpetual motion machines were possible.
I agree with Charbel-Raphaël that the push for mechanistic interpretability is in effect promoting the notion that there must be possibilities available here to control potentially very dangerous AIs to stay safe in deployment. It is much easier to spread the perception of safety, than to actually make such systems safe.
That, while there is no sound theoretical basis for claiming that scaling mechanistic interpretability could form the basis of such a control method. Nor for that any control method could keep “AGI” safe.
Rather, mechint is fundamentally limited in the extent it could be used to safely control AGI. See posts:
Besides theoretical limits, there are plenty of practical arguments (as listed in Charbel-Raphaël’s post) for why scaling the utilisation of mechint would be net harmful.
So no rigorous basis for that the use of mechint would “open up possibilities” to long-term safety. And plenty of possibilities for corporate marketers – to chime in on mechint’s hypothetical big breakthroughs.
In practice, we may help AI labs again – accidentally – to safety-wash their AI products.
It does seem like a large proportion of disagreements in this space can be explained by how hard people think alignment will be. It seems like your view is actually more pessimistic about the difficulty of alignment than Eliezer’s, because he at least thinks it’s possible for mechinterp to help in principle.
I think that being confident in this level of pessimism is wildly miscalibrated, and such a big disagreement that it’s probably not worth discussing much further. Though I reply indirectly to your point here.
I personally think pessimistic vs. optimistic misframes it, because it frames a question about the world in terms of personal predispositions.
I would like to see reasoning.
Your reasoning in the comment thread you linked to is:
“history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems”
That’s a broad reference-class analogy to use. I think it holds little to no weight as to whether there would be sufficient progress on the specific problem of “AGI” staying safe over the long-term.
I wrote why that specifically would not be a solvable problem.
Strong disagree. This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs (e.g. imagine trying to describe the useful applications of electricity before anyone knew what it was or how it worked; or imagine Galileo trying to justify the practical use of studying astronomy).
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works; this by itself is sufficient to recommend it. (Algon makes a similar point in another comment.) Though I do agree that, based on the numbers you gave for how many junior researchers’ projects are focusing on interpretability, people are probably overweighting it.
I think this post is an example of a fairly common phenomenon where alignment people are too focused on backchaining from desired end states, and not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better. (By contrast, most ML researchers are too focused on the latter.)
I particularly disagree with this part. The way you get safety by design is understanding what’s going on inside the neural networks. More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don’t see how we can proceed without scrutinizing the agendas and listing the main difficulties.
I don’t think it’s that simple. We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
Maybe. I would still argue that other research avenues are neglected in the community.
I provided plenty of technical research direction in the “preventive measures” section, this should also qualifies as forward-chaining. And interp is certainly not the only way to understand the world better. Besides, I didn’t say we should stop Interp research altogether, just consider other avenues.
I think I agree, but this is only one of the many points in my post.
See the discussion between me and interstice upthread for a type of argument that feels more productive.
I agree (and mentioned so in my original comment). This post would have been far more productive if it had focused on exploring them.
The things you should be looking for, when it comes to fundamental breakthroughs, are deep problems demonstrating fascinating phenomena, and especially cases where you can get rapid feedback from reality. That’s what we’ve got here. If that’s not object-level enough then your criterion would have ruled out almost all great science in the past.
I wouldn’t have criticized it so strongly if you hadn’t listed it as “Perhaps the main problem I have with interp”.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
In the above comment, I only agree with “we shouldn’t do useful work, because then it will encourage other people to do bad things”, and I don’t agree with your critique of “Perhaps the main problem I have with interp...” which I think is not justified enough.
You’ve listed them, but you haven’t really argued that they’re valuable, you’re mostly just asserting stuff like Rob Miles having a bigger impact than most interpretability researchers, or the best strategy being copying Dan Hendrycks. But since I disagree with the assertions, these sections aren’t very useful; they don’t actually zoom in on the positive case for these research directions.
(The main positive case I’m seeing seems to be “anything which helps with coordination is really valuable”. And sure, coordination is great. But most coordination-related research is shallow: it helps us do things now, but doesn’t help us figure out how to do things better in the long term. So I think you’re overstating the case for it in general.)
I agree that I haven’t argued the positive case for more governance/coordination work (and that’s why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I’ll be happy to reinvest in alignment work once we’re sure we can avoid X-Risks from misuses and grossly negligent accidents.
If our goal is developing a principled understanding of deep learning, directly trying to do that is likely to be more effective than doing interpretability in the hope that we will develop a principled understanding as a side effect. For this reason I think most alignment researchers have too little awareness of various attempts in academia to develop “grand theories” of deep learning such as the neural tangent kernel. I think the ideal use for interpretability in this quest is as a way of investigating how the existing theories break down—e.g. if we can explain 80% of a given model’s behavior with the NTK, what are the causes of the remaining 20%? I think of interpretability as basically collecting many interesting data points; this type of collection is essential, but it can be much more effective when it’s guided by a provisional theory which tells you what points are expected and what are interesting anomalies which call for a revision of the theory, which in turn guides further exploration, etc.
I agree that work like NTK is worth thinking about. But I disagree that it’s a more “direct” approach to a principled understanding of deep learning. To find a “grand theory” of deep learning, we’re going to need to connect our understanding of neural networks to our understanding of the real world, and I don’t think NTKs or other related things can help very much with that step—for roughly the same reasons that statistical learning theory wasn’t very helpful (and was in fact anti-helpful) in predicting the success of deep neural networks.
Btw, this isn’t a general-purpose critique of theoretical work—e.g. it doesn’t apply to this paper by Lin, Tegmark and Rolnick, which actually ties neural network success to properties of the real world like symmetry, locality, and compositionality. This is the sort of thing which I can much more easily imagine leading to alignment breakthroughs.
I’d agree if interpretability were just about “here’s a circuit for recognizing X” (although even then, the concept of circuits itself was nontrivial to develop), but in fact a lot of the most promising work has been on more important and fundamental phenomena like superposition and induction heads.
The NTK and related theories aim to go from “SGD finds a giant blob of parameters that performs well on the data for some reason” to “SGD finds a solution with such-and-such clean mathematical characterization”. To fully explain the success of deep learning you do then have to relate the clean mathematical characterization to the real world, but I think this can be done separately to some extent and is less of a bottleneck on progress. My #2 use case for interpretability would be doing stuff like this—basically conceptual/experimental investigation of the types of solutions favored by a given mathematical theory, with the goal of obtaining a high-level story about “why it works in the real world”. Plus attempts to carry out alignment/interpretability/ELK tasks in the simplified setting.
Hmm, it’s been a while since I looked at this paper but if I recall it doesn’t really try to make any specific predictions about the inductive bias of neural nets in practice, it’s more like a series of suggestive analogies. That’s fine, but I think that sort of thing is more likely to be productive if guided by a more detailed theory.
I can’t speak for Richard, but I think I have a similar issue with NTK and adjacent theory as it currently stands (beyond the usual issues). I’m significantly more confident in a theory of deep learning if it cleanly and consistently explains (or better yet, predicts) unexpected empirical phenomena. The one that sticks out most prominently in my mind, that we see constantly in interpretability, is this strange correspondence between the algorithmic “structure” we find in trained models (both ML and biological!) and “structure” in the data generating process.
That training on Othello move sequences gets you an algorithmic model of the game itself is surprising from most current theoretical perspectives! So in that sense I might be suspicious of a theory of deep learning that fails to “connect our understanding of neural networks to our understanding of the real world”. This is the single most striking thing to come out of interpretability, in my opinion, and I’m worried about a “deep learning theory of everything” if it doesn’t address this head on.
That said, NTK doesn’t promise to be a theory of everything, so I don’t mean to hold it to an unreasonable standard. It does what it says on the tin! I just don’t think it’s explained a lot of the remaining questions I have. I don’t think we’re in a situation where “we can explain 80% of a given model’s behavior with the NTK” or similar. And this is relevant for e.g. studying inductive biases, as you mentioned.
But I strong upvoted your comment, because I do think deep learning theory can fill this gap—I’m personally trying to work in this area. There are some tractable-looking directions here, and people shouldn’t neglect them!
I intended my comment to apply to “theories of deep learning” in general, the NTK was only meant as an example. I agree that the NTK has problems such that it can at best be a ‘provisional’ grand theory. The big question is how to think about feature learning. At this point, though, there are a lot of contenders for “feature learning theories”—the Maximal Update Parameterization, Depth Corrections to the NTK, Perturbation Theory, Singular Learning Theory, Stochastic Collapse, SGD-Induced Sparsity....
So although I don’t think the NTK can be a final answer, I do like the idea of studying it in more depth—it provides a feature-learning-free baseline against which we can compare actual neural networks and other potential ‘grand theories’. Exactly which phenomena can we not explain with the NTK, and which theory best predicts them?
Strong upvote to Zach’s comment, it basically encapsulates my view (except that I don’t know what the “tractable-looking directions” he mentions are—Zach, can you elaborate?)
I’d turn that around: is there any explanation of why LLMs can do real-world task X and not real-world task Y that appeals to NTKs? (Not a rhetorical question: there may well be, I just haven’t seen one.)
Yeah, I can expand on that—this is obviously going be fairly opinionated, but there are a few things I’m excited about in this direction.
The first thing that comes to mind here is singular learning theory. I think all of my thoughts on DL theory are fairly strongly influenced by it at this point. It definitely doesn’t have all the answers at the moment, but it’s the single largest theory I’ve found that makes deep learning phenomena substantially “less surprising” (bonus points for these ideas preceding deep learning). For instance, one of the first things that SLT tells you is that the effective parameter count (RLCT) of your model can vary depending on the training distribution, allowing it to basically do internal model selection—the absence of bias-variance tradeoff, and the success of overparameterized models, aren’t surprising when you internalize this. The “connection to real world structure” aspect hasn’t been fully developed here, but it seems heavily suggested by the framework, in multiple ways—for instance, hierarchical statistical models are naturally singular statistical models, and the hierarchical structure is reflected in the singularities. (See also Tom Waring’s thesis).
Outside of SLT, there’s a few other areas I’m excited about—I’ll highlight just one. You mentioned Lin, Tegmark, and Rolnick—the broader literature on depth separations and the curse of dimensionality seems quite important. The approximation abilities of NNs are usually glossed over with universal approximation arguments, but this can’t be enough—for generic Lipschitz functions, universal approximation takes exponentially many parameters in the input dimension (this is a provable lower bound). So there has to be something special about the functions we care about in the real world. See this section of my post for more information. I’d highlight Poggio et al. here, which is the paper in the literature closest to my current view on this.
This isn’t a complete list, even of theoretical areas that I think could specifically help address the “real world structure” connection, but these are the two I’d feel bad not mentioning. This doesn’t include some of the more empirical findings in science of DL that I think are relevant, like simplicity bias, mode connectivity, grokking, etc. Or work outside DL that could be helpful to draw on, like Boolean circuit complexity, algorithmic information theory, natural abstractions, etc.
FWIW most potential theories of deep learning are able to explain these, I don’t think this distinguishes SLT particularly much.
Agreed—that alone isn’t particularly much, just one of the easier things to express succinctly. (Though the fact that this predates deep learning does seem significant to me. And the fact that SLT can delineate precisely where statistical learning theory went wrong here seems important too.)
Another is that can explain phenomena like phase transitions, as observed in e.g. toy models of superposition, at a quantitative level. There’s also been a substantial chunk of non-SLT ML literature that has independently rediscovered small pieces of SLT, like failures of information geometry, importance of parameter degeneracies, etc. More speculatively, but what excites me most, is that empirical phenomena like grokking, mode connectivity, and circuits seem to intuitively fit in SLT nicely, though this hasn’t been demonstrated rigorously yet.
I don’t think there are any. Of course much the same could be said of other deep learning theories and most(all?) interpretability work. The difference, as far as I can tell, is that there is a clear pathway to getting such explanations from the NTK: you’d want to do a spectral analysis of the sorts of functions learnable by transformer-NTKs. It’s just that nobody has bothered to do this! That’s why I think this line of research is neglected relative to interpretability or developing a new theoretical analysis of deep learning. Another obvious thing to try: NTKs often empirically perform comparably well to finite networks, but are usually are a few percentage points worse in accuracy. Can we say anything about the examples where the NTK fails? Do they particularly depend on ‘feature learning’? I think NTKs are a good compliment to mechinterp in this regard, since they treat the weights at each neuron as independent of all others, so they provide a good indicator of exactly which examples may require interacting ‘circuits’ to be correctly classified.
A note is that as it turns out, OthelloGPT learned a bag of heuristics, and there was no clean algorithm:
https://www.lesswrong.com/posts/gcpNuEZnxAPayaKBY/othellogpt-learned-a-bag-of-heuristics-1
What is the work that finds the algorithmic model of the game itself for Othello? I’m aware of (but not familiar with) some interpretability work on Othello-GPT (Neel Nanda’s and Kenneth Li), but thought it was just about board state representations.
Yeah, that was what I was referring to. Maybe “algorithmic model” isn’t the most precise—what we know is that the NN has an internal model of the board state that’s causal (i.e. the NN actually uses it to make predictions, as verified by interventions). Theoretically it could just be forming this internal model via a big lookup table / function approximation, rather than via a more sophisticated algorithm. Though we’ve seen from modular addition work, transformer induction heads, etc that at least some of the time NNs learn genuine algorithms.
I think that means one of the following should be surprising from theoretical perspectives:
That the model learns a representation of the board state
Or that a linear probe can recover it
That the board state is used causally
Does that seem right to you? If so, which is the surprising claim?
(I am not that informed on theoretical perspectives)
I think the core surprising thing is the fact that the model learns a representation of the board state. The causal / linear probe parts are there to ensure that you’ve defined “learns a representation of the board state” correctly—otherwise the probe could just be computing the board state itself, without that knowledge being used in the original model.
This is surprising to some older theories like statistical learning, because the model is usually treated as effectively a black box function approximator. It’s also surprising to theories like NTK, mean-field, and tensor programs, because they view model activations as IID samples from a single-neuron probability distribution—but you can’t reconstruct the board state via a permutation-invariant linear probe. The question of “which neuron is which” actually matters, so this form of feature learning is beyond them. (Though there may be e.g. perturbative modifications to these theories to allow this in a limited way).
Permutation-invariance isn’t the reason that this should be surprising. Yes, the NTK views neurons as being drawn from an IID distribution, but once they have been so drawn, you can linearly probe them as independent units. As an example, imagine that our input space consisted of five pixels, and at initialization neurons were randomly sensitive to one of the pixels. You would easily be able to construct linear probes sensitive to individual pixels even though the distribution over neurons is invariant over all the pixels.
The reason the Othello result is surprising to the NTK is that neurons implementing an “Othello board state detector” would be vanishingly rare in the initial distribution, and the NTK thinks that the neuron function distribution does not change during training.
Yeah, that’s probably the best way to explain why this is surprising from the NTK perspective. I was trying to include mean-field and tensor programs as well (where that explanation doesn’t work anymore).
Yeah, this is a good point. What I meant to specify wasn’t that you can’t recover any permutation-sensitive data at all (trivially, you can recover data about the input), but that any learned structures must be invariant to neuron permutation. (Though I’m feeling sketchy about the details of this claim). For the case of NTK, this is sort of trivial, since (as you pointed out) it doesn’t really learn features anyway.
By the way, there are actually two separate problems that come from the IID assumption: the “independent” part, and the “identically-distributed” part. For space I only really mentioned the second one. But even if you deal with the identically distributed assumption, the independence assumption still causes problems.This prevents a lot of structure from being representable—for example, a layer where “at most two neurons are activated on any input from some set” can’t be represented with independently distributed neurons. More generally a lot of circuit-style constructions require this joint structure. IMO this is actually the more fundamental limitation, though takes longer to dig into.
The same argument applies—if the distribution of intermediate neurons shifts so that Othello-board-state-detectors have a reasonably high probability of being instantiated, it will be possible to construct a linear probe detecting this, regardless of the permutation-invariance of the distribution.
This is a more reasonable objection(although actually, I’m not sure if independence does hold in the tensor programs framework—probably?)
Yeah, this “if” was the part I was claiming permutation invariance causes problems for—that identically distributed neurons probably couldn’t express something as complicated as a board-state-detector. As soon as that’s true (plus assuming the board-state-detector is implemented linearly), agreed, you can recover it with a linear probe regardless of permutation-invariance.
I probably should’ve just gone with that one, since the independence barrier is the one I usually think about, and harder to get around (related to non-free-field theories, perturbation theory, etc).
My impression from reading through one of the tensor program papers a while back was that it still makes the IID assumption, but there could be some subtlety about that I missed.
Thanks! The permutation-invariance of a bunch of theories is a helpful concept
I get the impression of a certain of motte and bailey in this comment and similar arguments. From a high-level, the notion of better understanding what neural networks are doing would be great. The problem though seems to be that most of the SOTA of research in interpretability does not seem to be doing a good job of this in a way that seems useful for safety anytime soon. In that sense, I think this comment talks past the points that this post is trying to make.
I wish the original post had been more careful about its claims, so that I could respond to them more clearly. Instead there’s a mishmash of sensible arguments, totally unjustified assertions, and weird strawmen (like “I don’t understand how “Looking at random bits of the model and identify circuits/features” will help with deception”). And in general a lot of this is of the form “I don’t see how X”, which is the format I’m objecting to, because of course you won’t see how X until someone invents a technique to X.
This is exacerbated by the meta-level problem that people have very different standards for what’s useful (e.g. to Eliezer, none of this is useful), and also standards for what types of evidence and argument they accept (e.g. to many ML researchers, approximately all arguments about long-term theories of impact are too speculative to be worth engaging in depth).
I still think that so many people are working on interpretability mainly because they don’t see alternatives that are as promising; in general I’d welcome writing that clearly lays out solid explanations and intuitions about why those other research directions are worth working on, and think that this would be the best way to recalibrate the field.
Thanks for the reply. This sounds reasonable to me. On the last point, I tried my best to do that here, and I think there is a relatively high ratio of solid explanations to unsolid ones.
Overall, I think that the hopes you have for interpretability research are good, and I hope it works out. One of the biggest things that I think is a concern though is that people seem to have been making similar takes with little change for 7+ years. But I just don’t think there have been a number of wins from this research that are commensurate with the effort put into it. And I assume this is expected under your views, so probably not a crux.
EDIT: Nuance of course being impossible, this no doubt comes off as rude—and is in turn a reaction to an internet-distorted version of what you actually wrote. Oh well, grain of salt and all that.
This is equivocation. There are some properties of what’s going on inside a NN that are crucial to reasoning about its safety properties, and many, many more that are irrelevant.
I’m actually strongly reminded of a recent comment about LK-99, where someone remarked that a good way to ramp up production of superconductors would be to understand how superconductors work, because then we could design one that’s easier to mass-produce.
Except:
What we normally think of as “understanding how superconductors work” is not a sure thing, it’s hard and sometimes we don’t find satisfactory models.
Even if we understand how superconductors work, designing new ones with economically useful properties is an independent problem that’s also hard and possible to fail at for decades.
There are many other ways to make progress in discovering superconductors and ramping up their production. These ways are sometimes purely phenomenological, or sometimes rely on building some understanding of the superconductor that’s a model of a different type than what we typically mean by “understanding how superconductors work.”
It might sound good to say “we’ll understand how NNs work, and then use that to design safe ones,” but I think the problems are analogous. What we normally think of as “understand how NNs work,” especially in the context of mech interp, is a very specific genre of understanding—it’s not omniscience, it’s the ability to give certain sorts of mechanistic explanations for canonical explananda. And then using that understanding to design safe AI is an independent problem not solved just by solving the first one. Meanwhile, there are other ways to reason about the safety of AI (e.g. statistical arguments about the plausibility of gradient hacking) that use “understanding,” but not of the mech interp sort.
Yes, blue sky research is good. But we can simultaneously use our brains about what sorts of explanations we think are promising to find. Understanding doesn’t just go into a big bucket labeled “Understanding” from which we draw to make things happen. If I’m in charge of scaling up superconductor production, and I say we should do less micro-level explanation and more phenomenology, telling me about the value of blue sky research is the “wrong type of reasoning.”
The tricky part being that in the AGI alignment discourse, if you believe in self-improvement runaway feedback loops, there is no good. There is only perfect, or extinction. This might be a bit extreme but we don’t really know that for sure either.
Note that a wrench current paradigms throw in this is that self-improvement processes would not look uniquely recursive, since all training algorithms sort of look like “recursive self improvement”. instead, RSI is effectively just “oh no, the training curve was curved differently on this training run”, which is something most likely to happen in open world RL. But I agree, open world RL has the ability to be suddenly surprising in capability growth. and there wouldn’t be much of an opportunity to notice the problem unless we’ve already solved how to intentionally bound capabilities in RL.
There has been some interesting work on bounding capability growth in safe RL already, though. I haven’t looked closely at it, I wonder if any of it is particularly good.
edit: note that I am in fact claiming that after miri deconfuses us, it’ll turn out to apply to ordinary gradient updates
Au contraire, the perfect future doesn’t exist, but good ones do.
This isn’t about “perfect futures” though, but about perfect AGIs specifically. Consider a future that goes like this:
the AI’s presence and influence over us evolves exponentially according to a law dAIdt=γAI,
the exponent γ expresses the amount of misalignment; if the AI is aligned and fully under our control, γ=0, otherwise γ>0,
then in that future, anything less than perfect alignment ends with us overwhelmed by the AI, sooner or later. This is super simplistic, but the essence is that if you keep around something really powerful that might just decide to kill you, you probably want to be damn sure it won’t. That’s what “perfect” here means; it’s not fine if it just wants to kill you a little bit. So if your logic is correct (and indeed, I do agree with you on general matters of ethics), then perhaps we just shouldn’t build AGI at all, because we can’t get it perfect, and if it’s not perfect it’ll probably be in too precarious a balance with us for it to persist for long.
Ah, I see more of what you mean. I agree an AI’s influence being small is unstable. And this means that the chance of death by AI being small is also unstable.
But I think the risk is one-time, not compounding over time. A high-influence AI might kill you, but if it doesn’t, you’ll probably live a long and healthy life (because of arguments like stability of value being a convergent instrumental goal). It’s not that once an AI becomes high-influence, there’s an exponential decay of humans, as every day it makes a new random mutation to its motivations.
I don’t think that’s necessarily true. There’s two ways in which I think it can compound:
if the AGI will self-upgrade, or design more advanced AGI, the problem repeats, and the AGI can make mistakes, same as us, though probably less obvious mistakes
it is possible to imagine an AGI that stays generally aligned but has a certain probability of being triggered on some runaway loop in which it loses its alignment. Like it will come up with pretty aligned solutions most of the time but there is something, some kind of problem or situation, that is so out-of-domain it sends it off the path of insanity, and it’s unrecoverable, and we don’t know how or when that might occur.
Also, it might simply be probabilistic—any non-fully deterministic AGI probably wouldn’t literally have no access to non-aligned strategies, but merely assign them very small logits. So in theory that’s still a finite but non-zero possibility that it goes into some kind of “kill all humans” strategy path. And even if you interpret this as one-shot (did you align it right or not on creation?), the effects might not be visible right away.
Now that I think about it, this is the main problem a lot of LW thinking and posting has: It implicitly thinks that only a perfect, watertight solution to alignment is sufficient to guarantee human survival, despite the fact that most solutions to problems don’t have to be perfect to work, and even the cases where we do face against an adversary, imperfect but fast solutions win out over perfect, very slow solutions, and in particular ignores that multiple solutions to alignment can fundamentally stack.
In general, I feel like the biggest flaw of LW is it’s perfectionism, and the big reason why Michael Nielsen pointed out that alignment is extremely accelerationist in practice is that OpenAI implements a truth that LWers like Nate Soares and Eliezer Yudkowsky, as well as the broader community doesn’t: Alignment approaches don’t need to be perfect to work, and having an imperfect safety and alignment plan is much better than no plan at all.
Links are below:
https://www.lesswrong.com/posts/8Q7JwFyC8hqYYmCkC/link-post-michael-nielsen-s-notes-on-existential-risk-from
https://www.beren.io/2023-02-19-The-solution-to-alignment-is-many-not-one/
It’s literally point −2 in List of Lethalities that we don’t need “perfect” alignment solution, we just don’t have any.
I basically just disagree with this entirely, unless you don’t count stuff like RLHF or DPO as alignment.
More generally, if we grant that we don’t need perfection, or arbitrarily good alignment, at least early on, then I think this implies that alignment should be really easy, and the p(Doom) numbers are almost certainly way too high, primarily because it’s often doable to solve problems of you don’t need perfect or arbitrarily good solutions.
So I basically just disagree with Eliezer here.
It seems really easy to spell out worldviews where “we don’t need perfection, or arbitrarily good alignment” but yet “alignment should be really easy”. To give a somewhat silly example based on the OP, I could buy Enumerative Safety in principle—so if we can check all the features for safety, we can 100% guarantee the safety of the model. It then follows that if we can check 95% of the features (sampled randomly) then we get something like a 95% safety guarantee (depending on priors).
But I might also think that properly “checking” even one feature is really, really hard.
So I don’t buy the claimed implication: “we don’t need perfection” does not imply “alignment should be really easy”. Indeed, I think the implication quite badly fails.
I’ll admit I overstated it here, but my claim is that once you remove the requirement for arbitrarily good/perfect solutions, it becomes easier to solve the problem. Sometimes, it’s still impossible to solve the problem, but it’s usually solvable once you drop a perfectness/arbitrarily good requirement, primarily because it loosens a lot of constraints.
I agree it isn’t a logical implication, but I suspect your example is very misleading, and that more realistic imperfect solutions won’t have this failure mode, so I’m still quite comfortable with using it as an implication that isn’t 100% accurate, but more like 90-95+% accurate.
I mean, yeah, I agree with all of this as generic statements if we ignore the subject at hand.
I agree the example sucks and only serves to prove that it is not a logical implication.
A better example would be, like, the Goodhart model of AI risk, where any loss function that we optimize hard enough to get into superintelligence would probably result in a large divergence between what we get and what we actually want, because optimization amplifies. Note that this still does not make an assumption that we need to prove 100% safety, but rather, argues, for reasons, from assumptions that it will be hard to get any safety at all from loss functions which merely coincide to what we want somewhat well.
I still think the list of lethalities is a pretty good reply to your overall line of reasoning—IE it clearly flags that the problem is not achieving perfection, but rather, achieving any significant probability of safety, and it gives a bunch of concrete reasons why this is hard, IE provides arguments rather than some kind of blind assumption like you seem to be indicating.
You are doing a reasonable thing by trying to provide some sort of argument for why these conclusions seem wrong, but “things tend to be easy when you lift the requirement of perfection” is just an extremely weak argument which seems to fall apart the moment we contemplate the specific case of AI alignment at all.
The problem with RLHF/DPO is not that it doesn’t work period, the problem is that we don’t know if they work. I can imagine that we can just scale to superintelligence, apply RLHF and get aligned ASI, but this would imply a bunch of things about reality like “even at high level of capability reasonable RLHF-data contains overwhelmingly mostly good value-shaped thought-patterns” and I just don’t think that we know enough about reality to make such statements.
I think this might be a crux, actually. I think it’s surprisingly common in history for things to work out well empirically, but that we either don’t understand how they work, or it took a long time to understand how it works.
AI development is the most central example, but I’d argue the invention of steel is another good example.
To put it another way, I’m relying on the fact that there have been empirically successful interventions where we either simply don’t know why it works, or it takes a long time to get a useful theory out of the empirically successful intervention.
Are you mostly looking for where there is useful empirical feedback?
That sounds like a shot in the dark.
A concern I have:
I cannot conceptually distinguish these continued empirical investigations of methods to build maybe-aligned AGI, from how medieval researchers tried to build perpetual motion machines. It took sound theory to finally disprove the possibility once and for all that perpetual motion machines were possible.
I agree with Charbel-Raphaël that the push for mechanistic interpretability is in effect promoting the notion that there must be possibilities available here to control potentially very dangerous AIs to stay safe in deployment. It is much easier to spread the perception of safety, than to actually make such systems safe.
That, while there is no sound theoretical basis for claiming that scaling mechanistic interpretability could form the basis of such a control method. Nor for that any control method could keep “AGI” safe.
Rather, mechint is fundamentally limited in the extent it could be used to safely control AGI.
See posts:
The limited upside of interpretability by Peter S. Park
Why mechanistic interpretability does not and cannot contribute to long-term AGI safety by me
Besides theoretical limits, there are plenty of practical arguments (as listed in Charbel-Raphaël’s post) for why scaling the utilisation of mechint would be net harmful.
So no rigorous basis for that the use of mechint would “open up possibilities” to long-term safety.
And plenty of possibilities for corporate marketers – to chime in on mechint’s hypothetical big breakthroughs.
In practice, we may help AI labs again – accidentally – to safety-wash their AI products.
It does seem like a large proportion of disagreements in this space can be explained by how hard people think alignment will be. It seems like your view is actually more pessimistic about the difficulty of alignment than Eliezer’s, because he at least thinks it’s possible for mechinterp to help in principle.
I think that being confident in this level of pessimism is wildly miscalibrated, and such a big disagreement that it’s probably not worth discussing much further. Though I reply indirectly to your point here.
I personally think pessimistic vs. optimistic misframes it, because it frames a question about the world in terms of personal predispositions.
I would like to see reasoning.
Your reasoning in the comment thread you linked to is: “history is full of cases where people dramatically underestimated the growth of scientific knowledge, and its ability to solve big problems”
That’s a broad reference-class analogy to use. I think it holds little to no weight as to whether there would be sufficient progress on the specific problem of “AGI” staying safe over the long-term.
I wrote why that specifically would not be a solvable problem.