I guess these days the safety argument has shifted to “inner optimizers”, which I think means “OK fine, we can probably specify human values well enough that LLMs understand us. But what if the system learns some weird approximation of values—or worse conspires to fool us while secretly having other goals.” I don’t understand this well enough to have confident takes on it, but it feels like...a pretty conjunctive worry, a possibility worth guarding against but not a knockdown argument.
I guess these days the safety argument has shifted to “inner optimizers”, which I think means “OK fine, we can probably specify human values well enough that LLMs understand us. But what if the system learns some weird approximation of values—or worse conspires to fool us while secretly having other goals.” I don’t understand this well enough to have confident takes on it, but it feels like...a pretty conjunctive worry, a possibility worth guarding against but not a knockdown argument.