Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well.
generalizing human values to superintelligent level
Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don’t, resulting in a demonstration of orthogonality thesis.
Even more generally, many alignment proposals are more worrying than some by-default future GPT-n things, provided they are not fine-tuned too much as well.
Trying to learn human values as an explicit concept is already alarming. At least right now breakdown of robustness is also breakdown of capability. But if there are multiple subsystems, or training data is mostly generated by the system itself, then capability might survive when other subsystems don’t, resulting in a demonstration of orthogonality thesis.