Thanks. For time/brevity, I’ll just say which things I agree / disagree with:
> sufficiently capable and general AI is likely to have property X as a strong default [...]
I generally agree with this, although for certain important values of X (such as “fooling humans for instrumental reasons”) I’m probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I’m also probably more optimistic (but not certain) that those efforts will succeed.
[inside view, modest epistemology]: I don’t have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.
> less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed
> MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.
> AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked
Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).
> I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you
I’m more pessimistic than MIRI about HRAD, though that has selection effects. I’ve found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they’re likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)
[experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I’m not sure if you’d pick “predictable” or “uninformative”, but my take is: * The outcome wasn’t predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting “worse”.) * It’s informative, for two reasons: (1) It shows that NNs “automatically” generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?
> Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
Would agree with “most”, but I think you probably meant something like “almost all”, which seems wrong. There’s lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).
I’m not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count “the same”) had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we’ve got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn’t say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.
If we qualify an experiment as interesting when it can tell anyone about anything, then there’s an infinite variety of experiments “interesting” in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don’t know, but tell me something relevant that I don’t know. There is also something to be said for opening your eyes and staring around, but even then, there’s an infinity of blank faraway canvases to stare at, and the trick is to go wandering with your eyes wide open someplace you might see something really interesting. Others will be puzzled and interested by different things and I don’t wish them ill on their spiritual journeys, but I don’t expect the vast majority of them to return bearing enlightenment that I’m at all interested in, though now and then Miles Brundage tweets something (from far outside of EA) that does teach me some little thing about cognition.
I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
Do larger models generalize better or more poorly OOD? It’s a relatively basic question as such things go, and no doubt of interest to many, and may even update our timelines from ‘unpredictable’ to ‘unpredictable’, but… I’m trying to figure out how to say this, and I think I should probably accept that there’s no way to say it that will stop people from trying to sell other bits of research as Super Relevant To Alignment… it’s okay to have an understanding of reality which makes narrower guesses than that about which projects will turn out to be very relevant.
I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example “what aspect of these predictable problems crop up first, and why?”
are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now?
It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like “What actually went into the nonviolence predicate instead of just nonviolence?” Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.
Thanks. For time/brevity, I’ll just say which things I agree / disagree with:
> sufficiently capable and general AI is likely to have property X as a strong default [...]
I generally agree with this, although for certain important values of X (such as “fooling humans for instrumental reasons”) I’m probably more optimistic than you that there will be a robust effort to get not-X, including by many traditional ML people. I’m also probably more optimistic (but not certain) that those efforts will succeed.
[inside view, modest epistemology]: I don’t have a strong take on either of these. My main take on inside views is that they are great for generating interesting and valuable hypotheses, but usually wrong on the particulars.
> less weight on reasoning like ‘X was true about AI in 1990, in 2000, in 2010, and in 2020; therefore X is likely to be true about AGI when it’s developed
I agree, see my post On the Risks of Emergent Behavior in Foundation Models. In the past I think I put too much weight on this type of reasoning, and also think most people in ML put too much weight on it.
> MIRI thinks AGI is better thought of as ‘a weird specific sort of AI’, rather than as ‘like existing AI but more so’.
Probably disagree but hard to tell. I think there will both be a lot of similarities and a lot of differences.
> AGI is mostly insight-bottlenecked (we don’t know how to build it), rather than hardware-bottlenecked
Seems pretty wrong to me. We probably need both insight and hardware, but the insights themselves are hardware-bottlenecked: once you can easily try lots of stuff and see what happens, insights are much easier, see Crick on x-ray crystallography for historical support (ctrl+f for Crick).
> I’d want to look at more conceptual work too, where I’d guess MIRI is also more pessimistic than you
I’m more pessimistic than MIRI about HRAD, though that has selection effects. I’ve found conceptual work to be pretty helpful for pointing to where problems might exist, but usually relatively confused about how to address them or how specifically they’re likely to manifest. (Which is to say, overall highly valuable, but consistent with my take above on inside views.)
[experiments are either predictable or uninformative]: Seems wrong to me. As a concrete example: Do larger models have better or worse OOD generalization? I’m not sure if you’d pick “predictable” or “uninformative”, but my take is:
* The outcome wasn’t predictable: within ML there are many people who would have taken each side. (I personally was on the wrong side, i.e. predicting “worse”.)
* It’s informative, for two reasons: (1) It shows that NNs “automatically” generalize more than I might have thought, and (2) Asymptotically, we expect the curve to eventually reverse, so when does that happen and how can we study it?
See also my take on Measuring and Forecasting Risks from AI, especially the section on far-off risks.
> Most ML experiments either aren’t about interpretability and ‘cracking open the hood’, or they’re not approaching the problem in a way that MIRI’s excited by.
Would agree with “most”, but I think you probably meant something like “almost all”, which seems wrong. There’s lots of people working on interpretability, and some of the work seems quite good to me (aside from Chris, I think Noah Goodman, Julius Adebayo, and some others are doing pretty good work).
I’m not (retroactively in imaginary prehindsight) excited by this problem because neither of the 2 possible answers (3 possible if you count “the same”) had any clear-to-my-model relevance to alignment, or even AGI. AGI will have better OOD generalization on capabilities than current tech, basically by the definition of AGI; and then we’ve got less-clear-to-OpenPhil forces which cause the alignment to generalize more poorly than the capabilities did, which is the Big Problem. Bigger models generalizing better or worse doesn’t say anything obvious to any piece of my model of the Big Problem. Though if larger models start generalizing more poorly, then it takes longer to stupidly-brute-scale to AGI, which I suppose affects timelines some, but that just takes timelines from unpredictable to unpredictable sooo.
If we qualify an experiment as interesting when it can tell anyone about anything, then there’s an infinite variety of experiments “interesting” in this sense and I could generate an unlimited number of them. But I do restrict my interest to experiments which can not only tell me something I don’t know, but tell me something relevant that I don’t know. There is also something to be said for opening your eyes and staring around, but even then, there’s an infinity of blank faraway canvases to stare at, and the trick is to go wandering with your eyes wide open someplace you might see something really interesting. Others will be puzzled and interested by different things and I don’t wish them ill on their spiritual journeys, but I don’t expect the vast majority of them to return bearing enlightenment that I’m at all interested in, though now and then Miles Brundage tweets something (from far outside of EA) that does teach me some little thing about cognition.
I’m interested at all in Redwood Research’s latest project because it seems to offer a prospect of wandering around with our eyes open asking questions like “Well, what if we try to apply this nonviolence predicate OOD, can we figure out what really went into the ‘nonviolence’ predicate instead of just nonviolence?” or if it works maybe we can try training on corrigibility and see if we can start to manifest the tiniest bit of the predictable breakdowns, which might manifest in some different way.
Do larger models generalize better or more poorly OOD? It’s a relatively basic question as such things go, and no doubt of interest to many, and may even update our timelines from ‘unpredictable’ to ‘unpredictable’, but… I’m trying to figure out how to say this, and I think I should probably accept that there’s no way to say it that will stop people from trying to sell other bits of research as Super Relevant To Alignment… it’s okay to have an understanding of reality which makes narrower guesses than that about which projects will turn out to be very relevant.
Trying to rephrase it in my own words (which will necessarily lose some details), are you interested in Redwood’s research because it might plausibly generate alignment issues and problems that are analogous to the real problem within the safer regime and technology we have now? Which might tell us for example “what aspect of these predictable problems crop up first, and why?”
It potentially sheds light on small subpieces of things that are particular aspects that contribute to the Real Problem, like “What actually went into the nonviolence predicate instead of just nonviolence?” Much of the Real Meta-Problem is that you do not get things analogous to the full Real Problem until you are just about ready to die.