Edoardo Pona comments on Some background for reasoning about dual-use alignment research

Edoardo Pona 29 Jun 2023 18:54 UTC
1 point
0
RLHF is already a counterexample. It’s had an impact because of alignment’s inherent dual use—is that cheating? No: RLHF is economically valuable, but Paul still put work into it for alignment before profit-seeking researchers got to it. It could have been otherwise.
I think this works as a counterexample in the case of Paul, because we assume that he is not part of the ‘profit-seeking’ group (regardless of whether its true, depending on what it means). However, as you pointed out earlier, alignment itself is crucial for the ‘profit-seeking’ case.
Let’s pretend for a minute that alignment and capabilities are two separate things. As you point out, a capable but misaligned (wrt. its creators’ goals) is not profit generating. This means alignment as a whole is ‘profitable’, regardless of whether an individual researcher is motivated by the profit or not. The converse can be said about capability. It is not necessarily the case that a capability researcher is motivated by profit (in which case maybe they should focus on alignment at some point instead).