Epistemic Status: Low. Very likely wrong but would like to understand why.
It seems easier to intent align a human level or slightly above human level AI (HLAI) than a massively smarter than human AI.
Some new research options become available to us once we have aligned HLAI, including:
The HLAI might be able to directly help us do alignment research and solve the general alignment problem.
We could run experiments on the HLAI and get experimental evidence much closer to the domain we are actually trying to solve.
We could use the HLAI to start a training procedure, a la IDA.
These schemes seem fragile, because 1) if any HLAIs are not aligned, we lose, and 2) if the training up to superintelligence process fails, due to some unknown unknown or the HLAI being misaligned or through any of the known failure modes, we lose.
However, 1) seems like a much easier problem than aligning an arbitrary intelligence AI. Even though something could likely go wrong aligning a HLAI, it also seems likely that something goes wrong if we try to align an arbitrary intelligence AI. (This seems related to security mindset… in the best case world we do just solve the general case of alignment, but that seemshard.)
For 2), the process of training up to superintelligence seems like a HLAI would help more than it hurts. If the HLAI is actually intent aligned, this seems like having a fully uploaded alignment researcher, which seems less like getting Godzilla to fight and more like getting a Jaeger to protect Tokyo.
Epistemic Status: Low. Very likely wrong but would like to understand why.
It seems easier to intent align a human level or slightly above human level AI (HLAI) than a massively smarter than human AI.
Some new research options become available to us once we have aligned HLAI, including:
The HLAI might be able to directly help us do alignment research and solve the general alignment problem.
We could run experiments on the HLAI and get experimental evidence much closer to the domain we are actually trying to solve.
We could use the HLAI to start a training procedure, a la IDA.
These schemes seem fragile, because 1) if any HLAIs are not aligned, we lose, and 2) if the training up to superintelligence process fails, due to some unknown unknown or the HLAI being misaligned or through any of the known failure modes, we lose.
However, 1) seems like a much easier problem than aligning an arbitrary intelligence AI. Even though something could likely go wrong aligning a HLAI, it also seems likely that something goes wrong if we try to align an arbitrary intelligence AI. (This seems related to security mindset… in the best case world we do just solve the general case of alignment, but that seems hard.)
For 2), the process of training up to superintelligence seems like a HLAI would help more than it hurts. If the HLAI is actually intent aligned, this seems like having a fully uploaded alignment researcher, which seems less like getting Godzilla to fight and more like getting a Jaeger to protect Tokyo.