I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn’t feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of “mostly works” as well as the opportunity to gradually give feedback and increment current systems seems like it’ll get us pretty far. That seems reasonable to me.
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more “coherent” and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the “true moral law”. Because of the “fragility of value”, he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a “why should I trust that this guy knows the personality of GPT 9″ sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there’s the “you can’t retrain a powerful superintelligence” problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra’s pitch, but which they didn’t really get into.
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).
I am a fan of Yudkowsky and it was nice hearing him of Ezra Klein, but I would have to say that for my part the arguments didn’t feel very tight in this one. Less so than in IABED (which I thought was good not great).
Ezra seems to contend that surely we have evidence that we can at least kind of align current systems to at least basically what we usually want most of the time. I think this is reasonable. He contends that maybe that level of “mostly works” as well as the opportunity to gradually give feedback and increment current systems seems like it’ll get us pretty far. That seems reasonable to me.
As I understand it, Yudkowsky probably sees LLMs as vaguely anthropomophic at best, but not meaningfully aligned in a way that would be safe/okay if current systems were more “coherent” and powerful. Not even close. I think he contended that if you just gave loads of power to ~current LLMs, they would optimize for something considerably different than the “true moral law”. Because of the “fragility of value”, he also believes it is likely the case that most types of psuedoalignments are not worthwhile. Honestly, that part felt undersubstantiated in a “why should I trust that this guy knows the personality of GPT 9″ sort of way; I mean, Claude seems reasonably nice right? And also, ofc, there’s the “you can’t retrain a powerful superintelligence” problem / the stop button problem / the anti-natural problems of corrigible agency which undercut a lot of Ezra’s pitch, but which they didn’t really get into.
So ya, I gotta say, it was hardly a slam dunk case / discussion for high p(doom | superintelligence).