One of my old blog posts I never wrote (I did not even list it in a “posts I will never write” document) is one about how corrigibility are anti correlated with goal security.
Something like: If you build an AI that don’t resist someone trying to change its goals, it will also not try to stop bad actors from changing its goal. (I don’t think this particular worry applies to Paul’s version of corrigibility, but this blog post idea was from before I learned about his definition.)
Related to
One of my old blog posts I never wrote (I did not even list it in a “posts I will never write” document) is one about how corrigibility are anti correlated with goal security.
Something like: If you build an AI that don’t resist someone trying to change its goals, it will also not try to stop bad actors from changing its goal. (I don’t think this particular worry applies to Paul’s version of corrigibility, but this blog post idea was from before I learned about his definition.)