I think I’ve independently arrived at a fairly similar view. I haven’t read your post. But I think the corrigibility basin thing is one of the more plausible and practical ideas for aligning ASIs. The core problem is that you can’t just train your ASI for corrigibility because it will sit and do nothing, you have to train it to do stuff. And then these two training schemes will grit against each-other. Which leads to tons of bad stuff happening, eg its a great way to make your AI a lot more situationally aware. This is an important facet of the “anti-naturality” thing, I think.
you can’t just train your ASI for corrigibility because it will sit and do nothing
I’m confused. That doesn’t sound like what Max means by corrigibility. A corrigible ASI would respond to requests from its principal(s) as a subgoal of being corrigible, rather than just sit and do nothing.
Or did you mean that you need to do some next-token training in order to get it to be smart enough for corrigibility training to be feasible? And that next-token training conflicts with corrigibility?
Okay, sorry about this. You are right. I have a thought up a somewhat nuanced view about how prosaic corrigibility could work and I kind of just assumed that was the same was what Max had because he uses a lot of the same keywords I use when I think about this, but after actually reading the CAST article (or I read part 0 and 1), I realize we have really quite different views.
I think I’ve independently arrived at a fairly similar view. I haven’t read your post. But I think the corrigibility basin thing is one of the more plausible and practical ideas for aligning ASIs. The core problem is that you can’t just train your ASI for corrigibility because it will sit and do nothing, you have to train it to do stuff. And then these two training schemes will grit against each-other. Which leads to tons of bad stuff happening, eg its a great way to make your AI a lot more situationally aware. This is an important facet of the “anti-naturality” thing, I think.
I’m confused. That doesn’t sound like what Max means by corrigibility. A corrigible ASI would respond to requests from its principal(s) as a subgoal of being corrigible, rather than just sit and do nothing.
Or did you mean that you need to do some next-token training in order to get it to be smart enough for corrigibility training to be feasible? And that next-token training conflicts with corrigibility?
Okay, sorry about this. You are right. I have a thought up a somewhat nuanced view about how prosaic corrigibility could work and I kind of just assumed that was the same was what Max had because he uses a lot of the same keywords I use when I think about this, but after actually reading the CAST article (or I read part 0 and 1), I realize we have really quite different views.