Upvoted for an interesting direction of exploration, but I’m not sure I agree with (or understand, perhaps) the underlying assumption that “natural-feeling” is more likely to be safe or good. This seems a little different from the common naturalistic fallacy (what’s natural is always good, what’s artificial is always bad). It’s more a glossing over the underlying problem that we have no Safe Natural Intelligence—people are highly variable and many many of them are terrifying and horrible.
The thing underlying the intuition is more something like: We have a method of feedback that humans understand and that works fairly well, and is adapted to the way values are stored in human brains. If we try to have humans give feedback in ways that are not adapted to that, I expect information to be lost. The fact that it “feels natural” is a proxy for “the method of feedback to machines is adapted to the way humans normally give feedback to other humans” without which I am at least concerned about information loss (not claiming it’s inevitable). I don’t inherently care about the “feeling” of naturalness.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against? This doesn’t make me somehow suddenly feel comfortable about “unnatural” (I need a better term) methods for humans to provide feedback to AI agents. The fact that there are bad people doesn’t negate the idea that the only source of information about what is good seems to be stored in brains and that we need to extract this information in a way that is adapted to how those brains normally express that information.
Maybe I should have called it “human-adapted methods of human feedback” or something.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against?
I think it’s a pretty strong argument. There are no humans I’d trust with the massively expanded capabilities that AI will bring, so I have to believe that the training methods for humans are insufficient.
We WANT divergence from “business as usual” human beliefs and actions, and one of the ways to get there is by different specifications and training mechanisms. The hard part is we don’t yet know how to specify precisely how we want it to differ.
I haven’t specified anything about the algorithms, but they will maybe somehow have to be different. The point is that the format of the human feedback is different. Really this post is about the format in which humans provide feedback rather than about the structure of the AI systems (i.e. a difference in method of generating the training signal rather than a difference in learning algorithm).
Upvoted for an interesting direction of exploration, but I’m not sure I agree with (or understand, perhaps) the underlying assumption that “natural-feeling” is more likely to be safe or good. This seems a little different from the common naturalistic fallacy (what’s natural is always good, what’s artificial is always bad). It’s more a glossing over the underlying problem that we have no Safe Natural Intelligence—people are highly variable and many many of them are terrifying and horrible.
The thing underlying the intuition is more something like: We have a method of feedback that humans understand and that works fairly well, and is adapted to the way values are stored in human brains. If we try to have humans give feedback in ways that are not adapted to that, I expect information to be lost. The fact that it “feels natural” is a proxy for “the method of feedback to machines is adapted to the way humans normally give feedback to other humans” without which I am at least concerned about information loss (not claiming it’s inevitable). I don’t inherently care about the “feeling” of naturalness.
Regarding no Safe Natural Intelligence: I agree that there is no such thing, but this is not really a strong argument against? This doesn’t make me somehow suddenly feel comfortable about “unnatural” (I need a better term) methods for humans to provide feedback to AI agents. The fact that there are bad people doesn’t negate the idea that the only source of information about what is good seems to be stored in brains and that we need to extract this information in a way that is adapted to how those brains normally express that information.
Maybe I should have called it “human-adapted methods of human feedback” or something.
I think it’s a pretty strong argument. There are no humans I’d trust with the massively expanded capabilities that AI will bring, so I have to believe that the training methods for humans are insufficient.
We WANT divergence from “business as usual” human beliefs and actions, and one of the ways to get there is by different specifications and training mechanisms. The hard part is we don’t yet know how to specify precisely how we want it to differ.
I dunno, I’m not at all sure what “naturalness” is supposed to be doing here below the appearance level—how are the algorithms different?
I haven’t specified anything about the algorithms, but they will maybe somehow have to be different. The point is that the format of the human feedback is different. Really this post is about the format in which humans provide feedback rather than about the structure of the AI systems (i.e. a difference in method of generating the training signal rather than a difference in learning algorithm).