like, I agree that there are some safety lessons to be learned here, but it seems somewhat clear to me that people are working on WebGPT and InstructGPT primarily for capabilities reasons, not for existential-risk-from-AI reasons
This also seems like an odd statement—it seems reasonable to say “I think the net effect of InstructGPT is to boost capabilities” or even “If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT”. But it feels like you’re assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than “I think OpenAI’s alignment team is making bad prioritisation decisions”.
Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors—it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.
(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)
Yeah, I agree that I am doing reasoning on people’s motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people’s motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.
I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.
I do think overall I’ve had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanting a moat around their language model product instead of being motivated by safety concerns. I spent many hours trying to puzzle over the reasons for why they choose this release strategy, and ultimately concluded that the motivation was primarily financial/competetive-advantage related, and not related to safety (despite people at OpenAI claiming otherwise).
I also overall agree that trying to analyze motivations of people is kind of fraught and difficult, but I also feel pretty strongly that it’s now been many years where people have been trying to tell a story of OpenAI leadership being motivated by safety stuff, with very little action to actually back that up (and a massive amount of harm in terms of capability gains), and I do want to be transparent that I no longer really believe the stated intentions of many people working there.
This also seems like an odd statement—it seems reasonable to say “I think the net effect of InstructGPT is to boost capabilities” or even “If someone was motivated by x-risk it would be poor prioritisation/a mistake to work on InstructGPT”. But it feels like you’re assuming some deep insight into the intention behind the people working on it, and making a much stronger statement than “I think OpenAI’s alignment team is making bad prioritisation decisions”.
Like, reading the author list of InstructGPT, there are obviously a bunch of people on there who care a bunch about safety including I believe the first two authors—it seems pretty uncharitable and hostile to say that they were motivated by a desire to boost capabilities, even if you think that was a net result of their work.
(Note: My personal take is to be somewhat confused, but to speculate that InstructGPT was mildly good for the world? And that a lot of the goodness comes from field building of getting more people investing in good quality RLHF.)
Yeah, I agree that I am doing reasoning on people’s motivations here, which is iffy and given the pushback I will be a bit more hesitant to do, but also like, in this case reasoning about people’s motivations is really important, because what I care about is what the people working at OpenAI will actually do when they have extremely powerful AI in their hands, and that will depend a bunch on their motivations.
I am honestly a bit surprised to see that WebGPT was as much driven by people who I do know reasonably well and who seem to be driven primarily by safety concerns, since the case for it strikes me as so weak, and the risk seeming as somewhat obviously high, so I am still trying to process that and will probably make some kind of underlying update.
I do think overall I’ve had much better success at predicting the actions of the vast majority of people at OpenAI, including a lot of safety work, by thinking of them by being motivated by doing cool capability things, sometimes with a thin safety veneer on top, instead of being motivated primarily by safety. For example, I currently think that the release strategy for the GPT models of OpenAI is much better explained by OpenAI wanting a moat around their language model product instead of being motivated by safety concerns. I spent many hours trying to puzzle over the reasons for why they choose this release strategy, and ultimately concluded that the motivation was primarily financial/competetive-advantage related, and not related to safety (despite people at OpenAI claiming otherwise).
I also overall agree that trying to analyze motivations of people is kind of fraught and difficult, but I also feel pretty strongly that it’s now been many years where people have been trying to tell a story of OpenAI leadership being motivated by safety stuff, with very little action to actually back that up (and a massive amount of harm in terms of capability gains), and I do want to be transparent that I no longer really believe the stated intentions of many people working there.