The book aimed at a general audience does not do enough justice to my unpublished plan for pitting AIs against AIs
FWIW that’s not at all what I mean (and I don’t know of anyone who’s said that). What I mean is much more like what Ryan said here:
I expect that by default superintelligence is built after a point where we have access to huge amounts of non-superintelligent cognitive labor so it’s unlikely that we’ll be using current methods and current understanding (unless humans have already lost control by this point, which seems totally plausibly, but not overwhelmingly likely nor argued convincingly for by the book). Even just looking at capabilities, I think it’s pretty likely that automated AI R&D will result in us operating in a totally different paradigm by the time we build superintelligence—this isn’t to say this other paradigm will be safer, just that a narrow description of “current techniques” doesn’t include the default trajectory.
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.
FWIW that’s not at all what I mean (and I don’t know of anyone who’s said that). What I mean is much more like what Ryan said here:
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
I would say my summary for hope is more like:
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.