i mostly agree that large sections of work at labs falls into traps somewhat like this. however, i have some key disagreements that make me more optimistic about it being possible to do (very) good work at a lab:
it is possible to simply disagree with the lab’s actions in general, not to find any galaxy brain justification for why everything is well meaning. if you’re so principled that you are incapable of even existing inside any system which is somewhat misaligned, you’re going to have a rough time finding anywhere in this world that makes you happy, other than an isolated cabin in the woods.
i guess empirically some people struggle with disagreeing with the lab’s actions while at the lab. i agree these people should just not work at a lab. i think a prerequisite to someone doing good safety work at a lab is they should have a strong internal compass of what they believe is correct, and not be easily swayed by social pressure.
the openai exoduses are a real thing, but it’s useful to view them in the context of an industry where it’s pretty rare for someone to be at the same place for many years, and a company which has had a higher turnover rate than average just in general, not only for safety people.
i think you’re broadly right that safety stuff can’t cut too much into the capabilities stuff, but i contend that the main resource here is serial time rather than raw resources. this is important because it opens up some positive sum trades: capabilities people really hate it when their release dates get set back by e.g safety reviews (which is unfortunate), but they mind a lot less if you’re taking 5% of compute (which is a huge amount of absolute compute, and you should value a lot!), and they mind even less if you have an entire team doing good work somewhere out of their way (which you should also value a lot!).
i think it’s unproductive to think of the relation with labs as purely zero sum. there are areas where you can get some of what you value at low cost to them, and they can get some of what they want at low cost to you.
relatedly, imo the “good alignment work is good PR which is bad” argument is also classic zero sum thinking. good technical alignment work is not the maximally efficient PR investment; that would be using AI to help cute puppies. or to cure cancer. or to help cure cancer in cute puppies. you obviously don’t want to work on that stuff. but if you did actually good alignment work, and the company got some good PR out of it, is that the end of the world? in an extreme limit case, suppose openai solved interpretability tomorrow (or if you think interp is just capabilities, pick whatever field you like), like probably this is very net good, and you wouldn’t mind openai having a tiny sliver more goodwill in exchange.
you, hypothetical lab employee, have no obligation to say untrue things in public to improve PR of your company. just say what you truly believe, within reason, and people will correctly model your beliefs.
maybe part of why this works for me is i truly do respect capabilities people a lot even if i disagree strongly with them about what is good for the world. i’m honestly quite similar to lab capabilities people in a bunch of ways, and i’m probably a lot more sympathetic to the position that maybe i’m wrong about all of this alignment stuff and that capabilities people are just being mostly reasonable than the average person on LW (though note that this is a statement about whether i can emotionally imagine being persuaded given the right facts; i am forbidden from using the emotional imaginability as an excuse to work on capabilities, and in fact i have not)
this is important because it opens up some positive sum trades: capabilities people really hate it when their release dates get set back by e.g safety reviews (which is unfortunate), but they mind a lot less if you’re taking 5% of compute (which is a huge amount of absolute compute, and you should value a lot!), and they mind even less if you have an entire team doing good work somewhere out of their way (which you should also value a lot!).
If the safety people never actually slow down or materially dent the model release schedule (which in your model is what triggers conflict), what you are describing is not a sequence of positive sum trades between the capabilities and safety teams. Rather, it’s a model of how companies can devote relatively small amounts of resources to throw up a safety-coated PR shield around their core capabilities research effort. IRL we’ve also seen safety teams get dissolved after making big claims to funding which could conceivably dent the release schedule (the 20% for superalignment being a good example)
why is it not positive sum if you never slow down the model release schedule? positive sum doesn’t mean “nobody compromises on anything.” the entire point of positive sum trades is you find things that are compromises for the other party but ultimately not very costly for them, but are very beneficial to you; and in exchange, you do something that you would rather not do, but is not that costly for you, and beneficial for the other party. and on net both people get something they would rather have than the thing they sacrificed. so the potential trade here is:
capabilities teams sacrifice x% of their compute and $y of salary money. they grumble a bit because on the margin another x% more compute could have make them very slightly faster, but at the end of the day it’s not make or break for them.
alignment teams hire lots of really good alignment researchers, and do a huge amount of good work. alignment teams allow the lab to get some good PR. maybe even let their alignment work have the side effect of making the models very slightly better in ways that don’t reduce serial time a lot (though i think this requires more care than the PR).
of course this is all assuming that you actually do good work with the resources, which is pretty hard. if you’re doing useless work then it doesn’t matter whether you’re spending philanthropic money or lab money or whatever, you should stop doing bad work and do good work instead.
the superalignment situation is very unfortunate and reflects poorly on openai, but i think the entire situation was also net negative from a perfectly spherical openai’s perspective (negative PR from breaking this commitment far outweighed any of the benefits from making it in the first place; a perfectly spherical rational openai should have either not made the commitment at all, made a smaller commitment in the beginning, or upheld the commitment. in reality, the shift in position is of course easily explained by the board situation.)
but i contend that the main resource here is serial time rather than raw resources.
I’m very skeptical that this is currently the bottleneck for much of the safety work at OpenAI. Are you saying that both safety / alignment / interp all have sufficient headcount and everything is bottlenecked by serial time?
no, i’m saying serial time is more of a bottleneck for capabilities than raw resources. diverting 5% of parallel resources is much less costly for capabilities than slowing serial speed by 5%
i mostly agree that large sections of work at labs falls into traps somewhat like this. however, i have some key disagreements that make me more optimistic about it being possible to do (very) good work at a lab:
it is possible to simply disagree with the lab’s actions in general, not to find any galaxy brain justification for why everything is well meaning. if you’re so principled that you are incapable of even existing inside any system which is somewhat misaligned, you’re going to have a rough time finding anywhere in this world that makes you happy, other than an isolated cabin in the woods.
i guess empirically some people struggle with disagreeing with the lab’s actions while at the lab. i agree these people should just not work at a lab. i think a prerequisite to someone doing good safety work at a lab is they should have a strong internal compass of what they believe is correct, and not be easily swayed by social pressure.
the openai exoduses are a real thing, but it’s useful to view them in the context of an industry where it’s pretty rare for someone to be at the same place for many years, and a company which has had a higher turnover rate than average just in general, not only for safety people.
i think you’re broadly right that safety stuff can’t cut too much into the capabilities stuff, but i contend that the main resource here is serial time rather than raw resources. this is important because it opens up some positive sum trades: capabilities people really hate it when their release dates get set back by e.g safety reviews (which is unfortunate), but they mind a lot less if you’re taking 5% of compute (which is a huge amount of absolute compute, and you should value a lot!), and they mind even less if you have an entire team doing good work somewhere out of their way (which you should also value a lot!).
i think it’s unproductive to think of the relation with labs as purely zero sum. there are areas where you can get some of what you value at low cost to them, and they can get some of what they want at low cost to you.
relatedly, imo the “good alignment work is good PR which is bad” argument is also classic zero sum thinking. good technical alignment work is not the maximally efficient PR investment; that would be using AI to help cute puppies. or to cure cancer. or to help cure cancer in cute puppies. you obviously don’t want to work on that stuff. but if you did actually good alignment work, and the company got some good PR out of it, is that the end of the world? in an extreme limit case, suppose openai solved interpretability tomorrow (or if you think interp is just capabilities, pick whatever field you like), like probably this is very net good, and you wouldn’t mind openai having a tiny sliver more goodwill in exchange.
you, hypothetical lab employee, have no obligation to say untrue things in public to improve PR of your company. just say what you truly believe, within reason, and people will correctly model your beliefs.
maybe part of why this works for me is i truly do respect capabilities people a lot even if i disagree strongly with them about what is good for the world. i’m honestly quite similar to lab capabilities people in a bunch of ways, and i’m probably a lot more sympathetic to the position that maybe i’m wrong about all of this alignment stuff and that capabilities people are just being mostly reasonable than the average person on LW (though note that this is a statement about whether i can emotionally imagine being persuaded given the right facts; i am forbidden from using the emotional imaginability as an excuse to work on capabilities, and in fact i have not)
If the safety people never actually slow down or materially dent the model release schedule (which in your model is what triggers conflict), what you are describing is not a sequence of positive sum trades between the capabilities and safety teams. Rather, it’s a model of how companies can devote relatively small amounts of resources to throw up a safety-coated PR shield around their core capabilities research effort. IRL we’ve also seen safety teams get dissolved after making big claims to funding which could conceivably dent the release schedule (the 20% for superalignment being a good example)
why is it not positive sum if you never slow down the model release schedule? positive sum doesn’t mean “nobody compromises on anything.” the entire point of positive sum trades is you find things that are compromises for the other party but ultimately not very costly for them, but are very beneficial to you; and in exchange, you do something that you would rather not do, but is not that costly for you, and beneficial for the other party. and on net both people get something they would rather have than the thing they sacrificed. so the potential trade here is:
capabilities teams sacrifice x% of their compute and $y of salary money. they grumble a bit because on the margin another x% more compute could have make them very slightly faster, but at the end of the day it’s not make or break for them.
alignment teams hire lots of really good alignment researchers, and do a huge amount of good work. alignment teams allow the lab to get some good PR. maybe even let their alignment work have the side effect of making the models very slightly better in ways that don’t reduce serial time a lot (though i think this requires more care than the PR).
of course this is all assuming that you actually do good work with the resources, which is pretty hard. if you’re doing useless work then it doesn’t matter whether you’re spending philanthropic money or lab money or whatever, you should stop doing bad work and do good work instead.
the superalignment situation is very unfortunate and reflects poorly on openai, but i think the entire situation was also net negative from a perfectly spherical openai’s perspective (negative PR from breaking this commitment far outweighed any of the benefits from making it in the first place; a perfectly spherical rational openai should have either not made the commitment at all, made a smaller commitment in the beginning, or upheld the commitment. in reality, the shift in position is of course easily explained by the board situation.)
I’m very skeptical that this is currently the bottleneck for much of the safety work at OpenAI. Are you saying that both safety / alignment / interp all have sufficient headcount and everything is bottlenecked by serial time?
no, i’m saying serial time is more of a bottleneck for capabilities than raw resources. diverting 5% of parallel resources is much less costly for capabilities than slowing serial speed by 5%