As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing
If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress).
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you)
If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress).
It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you)