I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.
I think the online resources touches on that in the “more on making AIs solve the problem” subsection here. With the main thrust being: I’m skeptical that you can stack lots of dumb labor into an alignment solution, and skeptical that identifying issues will allow you to fix them, and skeptical that humans can tell when something is on the right track. (All of which is one branch of a larger disjunctive argument, with the two disjuncts mentioned above — “the world doesn’t work like that” and “the plan won’t survive the gap between Before and After on the first try” — also applying in force, on my view.)
(Tbc, I’m not trying to insinuate that everyone should’ve read all of the online resources already; they’re long. And I’m not trying to say y’all should agree; the online resources are geared more towards newcomers than to LWers. I’m not even saying that I’m getting especially close to your latest vision; if I had more hope in your neck of the woods I’d probably investigate harder and try to pass your ITT better. From my perspective, there are quite a lot of hopes and copes to cover, mostly from places that aren’t particularly Redwoodish in their starting assumptions. I am merely trying to evidence my attempts to reply to what I understand to be the counterarguments, subject to constraints of targeting this mostly towards newcomers.)
FWIW, I have read those parts of the online resources.
You can obviously summarize me however you like, but my favorite summary of my position is something like “A lot of things will have changed about the situation by the time that it’s possible to build ASI. It’s definitely not obvious that those changes mean that we’re okay. But I think that they are a mechanically important aspect of the situation to understand, and I think they substantially reduce AI takeover risk.”
Ty. Is this a summary of a more-concrete reason you have for hope? (Have you got alternative more-concrete summaries you’d prefer?)
“Maybe huge amounts of human-directed weak intelligent labor will be used to unlock a new AI paradigm that produces more comprehensible AIs that humans can actually understand, which would be a different and more-hopeful situation.”
(Separately: I acknowledge that if there’s one story for how the playing field might change for the better, then there might be bunch more stories too, which would make “things are gonna change” an argument that supports the claim that the future will have a much better chance than we’d have if ChatGPT-6 was all it took.)
I would say my summary for hope is more like:
It seems pretty likely to be doable (with lots of human-directed weak AI labor and/or controlled stronger AI labor) to use iterative and prosaic methods within roughly the current paradigm to sufficiently align AIs which are slightly superhuman. In particular, AIs which are capable enough to be better than humans at safety work (while being much faster and having other AI advantages), but not much more capable than this. This also requires doing a good job elicting capabilites and making the epistemics of these AIs reasonably good.
Doable doesn’t mean easy or going to happen by default.
If we succeeded in aligning these AIs and handing off to them, they would be in a decent position for other ongoing solving alignment (e.g. aligning a somewhat smarter successor which itself aligns its successor and so on or scalably solving alignment) and also in a decent position to buy more time for solving alignment.
I don’t think this is all of my hope, but if I felt much less optimistic about these pieces, that would substantially change my perspective.