I haven’t thought about this enough to have a very mature opinion. On one hand being more general means you’re liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI’s are more liable to “partially memorize” how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of “how general are humans”, and how stable are human values to being much more or much less general
I guess im mostly thinking about the regime where AIs are more capable and general than humans.
It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.
If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins
You mean on more general algorithms being good vs. bad?
Yes.
I haven’t thought about this enough to have a very mature opinion. On one hand being more general means you’re liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI’s are more liable to “partially memorize” how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of “how general are humans”, and how stable are human values to being much more or much less general
I guess im mostly thinking about the regime where AIs are more capable and general than humans.
It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.
Maybe a reductive summary is “general is good if outer alignment is easy but inner alignment is hard, but bad in the opposite case”
Isn’t it the other way around ?
If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins