johnswentworth comments on Alignment By Default

johnswentworth 22 Mar 2021 4:39 UTC
LW: 4 AF: 3
0
AF
This is a great explanation. I basically agree, and this is exactly why I expect alignment-by-default to most likely fail even conditional on the natural abstractions hypothesis holding up.
- Chris_Leong 22 Mar 2021 4:50 UTC
  LW: 4 AF: 3
  0
  AF Parent
  Also, I have another strange idea that might increase the probability of this working.
  If you could temporarily remove proxies based on what people say, then this would seem to greatly increase the chance of it hitting the actual embedded representation of human values. Maybe identifying these proxies is easier than identifying the representation of “true human values”?
  I don’t think it’s likely to work, but thought I’d share anyway.
- Chris_Leong 22 Mar 2021 4:45 UTC
  LW: 4 AF: 2
  0
  AF Parent
  Thanks!
  
  Is this why you put the probability as “10-20% chance of alignment by this path, assuming that the unsupervised system does end up with a simple embedding of human values”? Or have you updated your probabilities since writing this post?
  - johnswentworth 22 Mar 2021 5:37 UTC
    LW: 6 AF: 4
    0
    AF Parent
    Yup, this is basically where that probability came from. It still feels about right.