The main problem here is “how to elicit simulacra of superhuman aligned intelligence while avoiding Waluigi effect”. We don’t have aligned superintelligence in training data and any attempts to elicit superintelligence from LLM can be fatal.
If alignment-by-default works for AGI, then we will have thousands of AGIs providing examples of aligned intelligence. This new, massive dataset of aligned behavior could then be used to train even more capable and robustly aligned models each of which would then add to the training data until we have data for aligned superintelligence.
If alignment-by-default doesn’t work for AGI, then we will probably die before ASI.
The main problem here is “how to elicit simulacra of superhuman aligned intelligence while avoiding Waluigi effect”. We don’t have aligned superintelligence in training data and any attempts to elicit superintelligence from LLM can be fatal.
If alignment-by-default works for AGI, then we will have thousands of AGIs providing examples of aligned intelligence. This new, massive dataset of aligned behavior could then be used to train even more capable and robustly aligned models each of which would then add to the training data until we have data for aligned superintelligence.
If alignment-by-default doesn’t work for AGI, then we will probably die before ASI.