This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.
Is this why you put the probability as “10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values”? Or have you updated your probabilities since writing this post?
This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
Also, I have another strange idea that might increase the probability of this working.
If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
I don’t think it’s likely to work, but thought I’d share anyway.
Thanks!
Is this why you put the probability as “10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values”? Or have you updated your probabilities since writing this post?
Yup, this is basically where that probability came from. It still feels about right.