The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI’s internal views on model training:
My interpretation of OpenAI’s worldview, as implied by this, is:
Inner alignment is not really an issue. Training objectives (evals) relate to behaviour in a straightforward and predictable way.
Outer alignment kinda matters, but it’s not that hard. Deciding the parameters of desired behaviour is something that can be done without serious philosophical difficulties.
Designing the right evals is hard, you need lots of technical skill and high taste to make good enough evals to get the right behaviour.
Oversight is important, in fact oversight is the primary method for ensuring that the AIs are doing what we want. Oversight is tractable and doable.
None of this dramatically conficts with what I already thought OpenAI believed, but it’s interesting to get another angle on it.
It’s quite possible that 1 is predicated on technical alignment work being done in other parts of the company (though their superalignment team no longer exists) and it’s just not seen as the purview of the evals team. If so it’s still very optimistic. If there isn’t such a team then it’s suicidally optimistic.
For point2, I think the above ad does implies that the evals/RL team is handling all of the questions of “how should a model behave” and they’re mostly not looking at it from the perspective of moral philosophy a la Amanda Askell at Anthropic. If questions of how models behave are entirely being handled by people selected only for artistic talent + autistic talent then I’m concerned these won’t be done well either.
3 seems correct in that well-designed evals are hard to make and you need skills beyond technical skills. Nice, but it’s telling that they’re doing well on the issue which brings in immediate revenue, and badly on the issues that get us killed at some point in the future.
Point 4 is kinda contentious. Some very smart people take oversight very seriously, but it also seems kinda doomed as an agenda when it comes to ASI. Seems like OpenAI are thinking about at least one not-kill-everyoneism plan, but a marginally promising one at best. Still, if we somehow make miraculous progress on oversight, perhaps OpenAI will take up those plans.
Finally, I find the mention of “The Gravity of AGI” to be quite odd since I’ve never got the sense that Aiden feels the gravity of AGI particularly strongly. As an aside I think that “feeling the AGI” is like enlightenment, where everyone behind you on the journey is a naive fool and everyone ahead of you is a crazy doomsayer.
EDIT: a fifth implication: little to no capabilities generalization. Seems like they expect each individual capability to be produced by a specific high-quality eval, rather than for their models to generalize broadly to a wide range of tasks.
Re: 1, during my time at OpenAI I also strongly got the impression that inner alignment was way underinvested. The Alignment team’s agenda seemed basically about better values/behavior specification IMO, not making the model want those things on the inside (though this is now 7 months out of date). (Also, there are at least a few folks within OAI I’m sure know and care about these issues)
The latest recruitment ad from Aiden McLaughlin tells a lot about OpenAI’s internal views on model training:
My interpretation of OpenAI’s worldview, as implied by this, is:
Inner alignment is not really an issue. Training objectives (evals) relate to behaviour in a straightforward and predictable way.
Outer alignment kinda matters, but it’s not that hard. Deciding the parameters of desired behaviour is something that can be done without serious philosophical difficulties.
Designing the right evals is hard, you need lots of technical skill and high taste to make good enough evals to get the right behaviour.
Oversight is important, in fact oversight is the primary method for ensuring that the AIs are doing what we want. Oversight is tractable and doable.
None of this dramatically conficts with what I already thought OpenAI believed, but it’s interesting to get another angle on it.
It’s quite possible that 1 is predicated on technical alignment work being done in other parts of the company (though their superalignment team no longer exists) and it’s just not seen as the purview of the evals team. If so it’s still very optimistic. If there isn’t such a team then it’s suicidally optimistic.
For point2, I think the above ad does implies that the evals/RL team is handling all of the questions of “how should a model behave” and they’re mostly not looking at it from the perspective of moral philosophy a la Amanda Askell at Anthropic. If questions of how models behave are entirely being handled by people selected only for artistic talent + autistic talent then I’m concerned these won’t be done well either.
3 seems correct in that well-designed evals are hard to make and you need skills beyond technical skills. Nice, but it’s telling that they’re doing well on the issue which brings in immediate revenue, and badly on the issues that get us killed at some point in the future.
Point 4 is kinda contentious. Some very smart people take oversight very seriously, but it also seems kinda doomed as an agenda when it comes to ASI. Seems like OpenAI are thinking about at least one not-kill-everyoneism plan, but a marginally promising one at best. Still, if we somehow make miraculous progress on oversight, perhaps OpenAI will take up those plans.
Finally, I find the mention of “The Gravity of AGI” to be quite odd since I’ve never got the sense that Aiden feels the gravity of AGI particularly strongly. As an aside I think that “feeling the AGI” is like enlightenment, where everyone behind you on the journey is a naive fool and everyone ahead of you is a crazy doomsayer.
EDIT: a fifth implication: little to no capabilities generalization. Seems like they expect each individual capability to be produced by a specific high-quality eval, rather than for their models to generalize broadly to a wide range of tasks.
Link is here, if anyone else was wondering too.
Re: 1, during my time at OpenAI I also strongly got the impression that inner alignment was way underinvested. The Alignment team’s agenda seemed basically about better values/behavior specification IMO, not making the model want those things on the inside (though this is now 7 months out of date). (Also, there are at least a few folks within OAI I’m sure know and care about these issues)