Thanks for the commentary—it helps in better understanding why some people are pessimistic and why others are optimistic on ASI.
However, I’m not convinced with the argument that an alignment technique will consistently produce AI with human compatible values as you mentioned below.
An alignment technique that works 99% of the time to produce an AI with human compatible values is very close to a full alignment solution[5]. If you use this technique once, gradient descent will not thereafter change its inductive biases to make your technique less effective. There’s no creative intelligence that’s plotting your demise
When we have self-improving AI, architectures may change and become more complex. Then, how do we know the alignment of the stronger AIs is still sufficient. How can we tell the difference between the sychophant or schemer from a saint?
Thanks for the commentary—it helps in better understanding why some people are pessimistic and why others are optimistic on ASI.
However, I’m not convinced with the argument that an alignment technique will consistently produce AI with human compatible values as you mentioned below.
When we have self-improving AI, architectures may change and become more complex. Then, how do we know the alignment of the stronger AIs is still sufficient. How can we tell the difference between the sychophant or schemer from a saint?