High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I’m finally going back to doing theory for a variety of reasons; my story is definitely not that I’m transitioning back from applied work to theory because I now believe the algorithms aren’t ready.
I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?
I feel like a story is basically plausible until proven implausible, so I have a pretty low bar.
What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?
I don’t think that iterated amplification ever was at the point where we couldn’t tell a story about how it might fail (perhaps in the middle of writing the ALBA post was peak optimism? but by the time I was done writing that post I think I basically had a story about how it could fail). In this case it seems like the distinction is more like “what is a solution going to look like?” and there aren’t clean lines between “big changes to this algorithm” and “new algorithm.”
I guess the question is why I was as optimistic as I was. For example, all the way until mid-2017 I thought it was plausible that something like iterated amplification would work without too many big changes (that’s a bit of a simplification, but you can see how I talked about it e.g. here).
Some thoughts on that:
I first remember discussing the translation example in a workshop in I think 2016. My view at that time was that the learning process might be implemented by amplification (i.e. that learning could occur within the big implicit HCH tree). If that’s the case then the big open question seemed to be preserving alignment within that kind of learning process (which I did think was a big/ambitious problem). Around this time I started thinking about amplification as at-best implementing an “enlightened prior” that would then handle updating on the basis of evidence.
I don’t think it’s super obvious that this approach can’t work, and even now I don’t think we’ve written up very clean arguments. The big issue is that the intermediate results of the learning process (that are needed to justify the final output, e.g. summary statistics from each half of the dataset) are both too large to be memorized by the model and too computationally complex to be reproduced by the model at test time. On top of that it seems like probably amplification/debate can’t work with very large implicit trees even if they can be memorized (based on the kinds of issues raised in Beth’s report, that also should have been obvious in advance but there were too many possibly-broken things to give each one the attention they deserved)
In 2016 I transitioned to doing more applied work for a while, and that’s a lot of why my own views stopped changing so rapidly. You could debate whether this was reasonable in light of the amount of theoretical uncertainty. I know plenty of people who think that it was both unreasonable to transition into applied work at that time and that it would be unreasonable for me to transition back now.
I still spent some time thinking about theory and by early 2018 I thought that we needed more fundamental extra ingredients. I started thinking about this more and talking with OpenAI folks about it. (Largely Geoffrey, though I think he was never as worried about these issues as I was, in part because of the methodological difference where I’m more focused on the worst case and he has the more typical perspective of just wanting something that’s good enough to work in practice.)
Part of why these problems were less clear to me back in 2016-2017 is that it was less clear to me what exactly we needed to do in order to be safe (in the worst case). I had the vague idea of informed oversight, but hadn’t thought through what parts of it were hard or necessary or what it really looked like. This all feels super obvious to me now but at the time it was pretty murky. I had to work through a ton of examples (stuff like implicit extortion) to develop a clear enough sense that I felt confident about it. This led to posts like ascription universality and strategy stealing.
The more recent round of writeups were largely about catching up in public communications, though my thinking is a lot clearer than it was 1-2 years before so it would have been even more of a mess if I’d been trying to do it as I went. Imitative generalization was a significant simplification/improvement over the kind of algorithm I’d been thinking about for a while to handle these problems. (I don’t think it’s the end of the line at all, if I was still doing applied work maybe we’d have a similar discussion in a while when I described a different algorithm that I thought worked better, but given that I’m focusing on theory right now the timeline will probably be much shorter.)
High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I’m finally going back to doing theory for a variety of reasons; my story is definitely not that I’m transitioning back from applied work to theory because I now believe the algorithms aren’t ready.
I feel like a story is basically plausible until proven implausible, so I have a pretty low bar.
I don’t think that iterated amplification ever was at the point where we couldn’t tell a story about how it might fail (perhaps in the middle of writing the ALBA post was peak optimism? but by the time I was done writing that post I think I basically had a story about how it could fail). In this case it seems like the distinction is more like “what is a solution going to look like?” and there aren’t clean lines between “big changes to this algorithm” and “new algorithm.”
I guess the question is why I was as optimistic as I was. For example, all the way until mid-2017 I thought it was plausible that something like iterated amplification would work without too many big changes (that’s a bit of a simplification, but you can see how I talked about it e.g. here).
Some thoughts on that:
I first remember discussing the translation example in a workshop in I think 2016. My view at that time was that the learning process might be implemented by amplification (i.e. that learning could occur within the big implicit HCH tree). If that’s the case then the big open question seemed to be preserving alignment within that kind of learning process (which I did think was a big/ambitious problem). Around this time I started thinking about amplification as at-best implementing an “enlightened prior” that would then handle updating on the basis of evidence.
I don’t think it’s super obvious that this approach can’t work, and even now I don’t think we’ve written up very clean arguments. The big issue is that the intermediate results of the learning process (that are needed to justify the final output, e.g. summary statistics from each half of the dataset) are both too large to be memorized by the model and too computationally complex to be reproduced by the model at test time. On top of that it seems like probably amplification/debate can’t work with very large implicit trees even if they can be memorized (based on the kinds of issues raised in Beth’s report, that also should have been obvious in advance but there were too many possibly-broken things to give each one the attention they deserved)
In 2016 I transitioned to doing more applied work for a while, and that’s a lot of why my own views stopped changing so rapidly. You could debate whether this was reasonable in light of the amount of theoretical uncertainty. I know plenty of people who think that it was both unreasonable to transition into applied work at that time and that it would be unreasonable for me to transition back now.
I still spent some time thinking about theory and by early 2018 I thought that we needed more fundamental extra ingredients. I started thinking about this more and talking with OpenAI folks about it. (Largely Geoffrey, though I think he was never as worried about these issues as I was, in part because of the methodological difference where I’m more focused on the worst case and he has the more typical perspective of just wanting something that’s good enough to work in practice.)
Part of why these problems were less clear to me back in 2016-2017 is that it was less clear to me what exactly we needed to do in order to be safe (in the worst case). I had the vague idea of informed oversight, but hadn’t thought through what parts of it were hard or necessary or what it really looked like. This all feels super obvious to me now but at the time it was pretty murky. I had to work through a ton of examples (stuff like implicit extortion) to develop a clear enough sense that I felt confident about it. This led to posts like ascription universality and strategy stealing.
The more recent round of writeups were largely about catching up in public communications, though my thinking is a lot clearer than it was 1-2 years before so it would have been even more of a mess if I’d been trying to do it as I went. Imitative generalization was a significant simplification/improvement over the kind of algorithm I’d been thinking about for a while to handle these problems. (I don’t think it’s the end of the line at all, if I was still doing applied work maybe we’d have a similar discussion in a while when I described a different algorithm that I thought worked better, but given that I’m focusing on theory right now the timeline will probably be much shorter.)
Cool, that makes sense, thanks!