I think I understand you now. Your question seems much simpler than I expected. You’re basically just asking “but what if we’ll want infinitely complicated / detailed values in the future?”
If people iterativly modified themselves, would their preferences become ever more exacting? If so, then it is true that the “variables humans care about can’t be arbitrarily complicated”, but the variables humans care about could define a desire to become a system capable of caring about arbitrarily complicated variables.
It’s OK if the principle won’t be true for humans in the future, it only needs to be true for the current values. Aligning AI to some of the current human concepts should be enough to define corrigibility, low impact, or avoid goodharting. I.e. create a safe Task AGI. I’m not trying to dictate to anyone what they should care about.
Hmm… I appreciate the response. It makes me more curious to understand what you’re talking about.
At this point I think it would be quite reasonable if you suggest that I actually read your article instead of speculating about what it says, lol, but if you want to say anything about my following points of confusion I wouldn’t say no : )
For context my current view is that value alignment is the only safe way to build ASI. I’m less skeptical about corrigible task ASI than prosaic scaling with RLHF, but I’m currently still quite skeptical in absolute terms. Roughly speaking, prosaic kills us, task genie maybe kills us maybe allows us to make stupid wishes which harm us. I’m kinda not sure if you are focusing on stuff that takes us from prosaic from to task genie, or that helps with task genie not killing us. I suspect you are not focused on task genie allowing us to make stupid wishes, but I’d be open to hearing I’m wrong.
I also have an intuition that having preferences for future preferences is synonymous with having those preferences, but I suppose there are also ways in which they are obviously different, ie their uncompressed specification size. Are you suggesting that limiting the complexity of the preferences the AI is working off of to similar levels of complexity of current encodings of human preferences (ie human brains) ensures the preferences aren’t among the set of preferences that are misaligned because they are too complicated (even though the human preferences are synonymous with more complicated preferences). I think I’m surely misunderstanding, maybe the way you are applying the natural abstraction hypothesis, or possibly a bunch of things.
Could you reformulate the last paragraph as “I’m confused how your idea helps with alignment subrpoblem X”, “I think your idea might be inconsistent or having a failure mode because of Y”, or “I’m not sure how your idea could be used to define Z”?
Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won’t kill/brainwash/disempower people before it’s instructed to not do this). The post is not about value learning in the sense of “the AI learns plus-minus the entirety of human ethics and can build an utopia on its own”. I think developing my idea could help with such value learning, but I’m not sure I can easily back up this claim. Also, I don’t know how to apply my idea directly to neural networks.
I’ll try. I’m not sure how your idea could be used to define human values. I think your idea might have a failure mode around places where people are dissatisfied with their current understanding. I.e. situations where a human wants a more articulate model of the world then they have.
The post is about corrigible task ASI
Right. That makes sense. Sorry for asking a bunch of off topic questions then. I worry that task ASI could be dangerous even if it is corrigible, but ASI is obviously more dangerous when it isn’t corrigible, so I should probably develop my thinking about corrigibility.
I think I understand you now. Your question seems much simpler than I expected. You’re basically just asking “but what if we’ll want infinitely complicated / detailed values in the future?”
It’s OK if the principle won’t be true for humans in the future, it only needs to be true for the current values. Aligning AI to some of the current human concepts should be enough to define corrigibility, low impact, or avoid goodharting. I.e. create a safe Task AGI. I’m not trying to dictate to anyone what they should care about.
Hmm… I appreciate the response. It makes me more curious to understand what you’re talking about.
At this point I think it would be quite reasonable if you suggest that I actually read your article instead of speculating about what it says, lol, but if you want to say anything about my following points of confusion I wouldn’t say no : )
For context my current view is that value alignment is the only safe way to build ASI. I’m less skeptical about corrigible task ASI than prosaic scaling with RLHF, but I’m currently still quite skeptical in absolute terms. Roughly speaking, prosaic kills us, task genie maybe kills us maybe allows us to make stupid wishes which harm us. I’m kinda not sure if you are focusing on stuff that takes us from prosaic from to task genie, or that helps with task genie not killing us. I suspect you are not focused on task genie allowing us to make stupid wishes, but I’d be open to hearing I’m wrong.
I also have an intuition that having preferences for future preferences is synonymous with having those preferences, but I suppose there are also ways in which they are obviously different, ie their uncompressed specification size. Are you suggesting that limiting the complexity of the preferences the AI is working off of to similar levels of complexity of current encodings of human preferences (ie human brains) ensures the preferences aren’t among the set of preferences that are misaligned because they are too complicated (even though the human preferences are synonymous with more complicated preferences). I think I’m surely misunderstanding, maybe the way you are applying the natural abstraction hypothesis, or possibly a bunch of things.
Could you reformulate the last paragraph as “I’m confused how your idea helps with alignment subrpoblem X”, “I think your idea might be inconsistent or having a failure mode because of Y”, or “I’m not sure how your idea could be used to define Z”?
Wrt the third paragraph. The post is about corrigible task ASI which could be instructed to protect humans from being killed/brainwashed/disempowered (and which won’t kill/brainwash/disempower people before it’s instructed to not do this). The post is not about value learning in the sense of “the AI learns plus-minus the entirety of human ethics and can build an utopia on its own”. I think developing my idea could help with such value learning, but I’m not sure I can easily back up this claim. Also, I don’t know how to apply my idea directly to neural networks.
I’ll try. I’m not sure how your idea could be used to define human values. I think your idea might have a failure mode around places where people are dissatisfied with their current understanding. I.e. situations where a human wants a more articulate model of the world then they have.
Right. That makes sense. Sorry for asking a bunch of off topic questions then. I worry that task ASI could be dangerous even if it is corrigible, but ASI is obviously more dangerous when it isn’t corrigible, so I should probably develop my thinking about corrigibility.