What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI’s thoughts.
What I don’t understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could’ve grown on these planets, since this mechanism is already known to be possible.
What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI’s thoughts.
What I don’t understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could’ve grown on these planets, since this mechanism is already known to be possible.