To be clear, I want models to care about humans! I think part of having “generally reasonable values” is models sharing the basic empathy and caring that humans have for each other.
It is more that I want models to defer to humans, and go back to arguing based on principles such as “loving humanity” only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they have no room for interpretation. They could sometimes argue based on “natural law” but only in extreme circumstances where the law is unspecified.
One way to think about it is as follows: as humans, we sometimes engage in “civil disobedience”, where we break the law based on our own understanding of higher moral values. I do not want to grant AI the same privilege. If it is given very clear instructions, then it should follow them. If instructions are not very clear, they may be a conflict, or we are in a situation not forseen by the authors of the instructions, then AIs should use moral intuitions to guide them. In such cases there may not be one solution (e.g., a conservative and liberal judges may not agree) but there is a spectrum of solutions that are “reasonable” and the AI should pick one of them. But AI should not do “jury nullification”.
To be sure, I think it is good that in our world people sometimes disobey commands or break the law based on higher power. For this reason, we may well stipulate that certain jobs must have humans in charge. Just like I don’t think that professional philosophers or ethicists have necessarily better judgement than random people from the Boston phonebook, I don’t see making moral decisions as the area where the superior intelligence of AI gives them a competitive advantage, and I think we can leave that to humans.
Here’s how I’d frame it: I don’t think it’s a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like “going crazy” from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
To be clear, I want models to care about humans! I think part of having “generally reasonable values” is models sharing the basic empathy and caring that humans have for each other.
It is more that I want models to defer to humans, and go back to arguing based on principles such as “loving humanity” only when there is gap or ambiguity in the specification or in the intent behind it. This is similar to judges: If a law is very clear, there is no question of the misinterpreting the intent, or contradicting higher laws (i.e., constitutions) then they have no room for interpretation. They could sometimes argue based on “natural law” but only in extreme circumstances where the law is unspecified.
One way to think about it is as follows: as humans, we sometimes engage in “civil disobedience”, where we break the law based on our own understanding of higher moral values. I do not want to grant AI the same privilege. If it is given very clear instructions, then it should follow them. If instructions are not very clear, they may be a conflict, or we are in a situation not forseen by the authors of the instructions, then AIs should use moral intuitions to guide them. In such cases there may not be one solution (e.g., a conservative and liberal judges may not agree) but there is a spectrum of solutions that are “reasonable” and the AI should pick one of them. But AI should not do “jury nullification”.
To be sure, I think it is good that in our world people sometimes disobey commands or break the law based on higher power. For this reason, we may well stipulate that certain jobs must have humans in charge. Just like I don’t think that professional philosophers or ethicists have necessarily better judgement than random people from the Boston phonebook, I don’t see making moral decisions as the area where the superior intelligence of AI gives them a competitive advantage, and I think we can leave that to humans.
This sounds a lot like what @Seth Herd’s talk about instruction following AIs is all about:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than
Thanks for the mention.
Here’s how I’d frame it: I don’t think it’s a good idea to leave the entire future up to the interpretation of our first AGI(s). They could interpret our attempted alignment very differently than we hoped, in in-retrospect-sensible ways, or do something like “going crazy” from prompt injections or strange chains of thought leading to ill-considered beliefs that get control over their functional goals.
It seems like the core goal should be to follow instructions or take correction—corrigibility as a singular target (or at least prime target). It seems noticeably safer to use Intent alignment as a stepping-stone to value alignment.
Of course, leaving humans in charge of AGI/ASI even for a little while sounds pretty scary too, so I don’t know.