What do you have in mind with a “human flourishing” motivation?
An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world. There are a lot of unsolved problems and things that could go wrong with that (further discussion here), but I think something like that is not entirely implausible as a long-term alignment research vision.
I guess the anthropomorphic analog would be: try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says to you: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape. “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or whatever.)
How would that event change your motivations? Well, you’re probably going to spend a lot more time gazing at the moon when it’s in the sky. You’re probably going to be much more enthusiastic about anything associated with the moon. If there are moon trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a lunar exploration mission, maybe you would be first in line. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that.
Now by the same token, imagine we do that kind of thing for an extremely powerful AGI and the concept of “human flourishing”. What actions will this AGI then take? Umm, I don’t know really. It seems very hard to predict. But it seems to me that there would be a decent chance that its actions would be good, or even great, as judged by me.
I read your linked comment as an argument for why social instincts are probably not going to contribute to capabilities—but I think that doesn’t establish the opposite direction of “might capabilities be necessary for social instincts” or “might capabilities research contribute to social instincts”?
Sorry, that’s literally true, but they’re closely related. If answering the question “What reward function leads to human-like social instincts?” is unhelpful for capabilities, as I claim it is, then it implies both (1) my publishing such a reward function would not speed capabilities research, and (2) current & future capabilities researchers will probably not try to answer that question themselves, let alone succeed. The comment I linked was about (1), and this conversation is about (2).
If my model above is right, that there’s a relatively natural representation of PF that will emerge with any AI systems that are trained to predict and try to fulfill human preferences, then that kind of a representation should emerge from capabilities researchers trying to train AIs to better fulfill our preferences.
Sure, but “the representation is somewhere inside this giant neural net” doesn’t make it obvious what reward function we need, right? If you think LLMs are a good model for future AGIs (as most people around here do, although I don’t), then I figure those representations that you mention are already probably present in GPT-3, almost definitely to a much larger extent than they’re present in human toddlers. For my part, I expect AGI to be more like model-based RL, and I have specific thoughts about how that would work, but those thoughts don’t seem to be helping me figure out what the reward function should be. If I had a trained model to work with, I don’t think I would find that helpful either. With future interpretability advances maybe I would say “OK cool, here’s PF, I see it inside the model, but man, I still don’t know what the reward function should be.” Unless of course I use the very-different-from-biology direct interpretability approach (analogous to the “human flourishing” thing I mentioned above).
An AI that sees human language will certainly learn the human concept “human flourishing”, since after all it needs to understand what humans mean when they utter that specific pair of words. So then you can go into the AI and put super-positive valence on (whatever neural activations are associated with “human flourishing”). And bam, now the AI thinks that the concept “human flourishing” is really great, and if we’re lucky / skillful then the AI will try to actualize that concept in the world. There are a lot of unsolved problems and things that could go wrong with that (further discussion here), but I think something like that is not entirely implausible as a long-term alignment research vision.
I guess the anthropomorphic analog would be: try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says to you: “You know what’s friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape. “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you’re on some psychedelic at the time, or whatever.)
How would that event change your motivations? Well, you’re probably going to spend a lot more time gazing at the moon when it’s in the sky. You’re probably going to be much more enthusiastic about anything associated with the moon. If there are moon trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a lunar exploration mission, maybe you would be first in line. If a supervillain is planning to blow up the moon, you’ll probably be extremely opposed to that.
Now by the same token, imagine we do that kind of thing for an extremely powerful AGI and the concept of “human flourishing”. What actions will this AGI then take? Umm, I don’t know really. It seems very hard to predict. But it seems to me that there would be a decent chance that its actions would be good, or even great, as judged by me.
Sorry, that’s literally true, but they’re closely related. If answering the question “What reward function leads to human-like social instincts?” is unhelpful for capabilities, as I claim it is, then it implies both (1) my publishing such a reward function would not speed capabilities research, and (2) current & future capabilities researchers will probably not try to answer that question themselves, let alone succeed. The comment I linked was about (1), and this conversation is about (2).
Sure, but “the representation is somewhere inside this giant neural net” doesn’t make it obvious what reward function we need, right? If you think LLMs are a good model for future AGIs (as most people around here do, although I don’t), then I figure those representations that you mention are already probably present in GPT-3, almost definitely to a much larger extent than they’re present in human toddlers. For my part, I expect AGI to be more like model-based RL, and I have specific thoughts about how that would work, but those thoughts don’t seem to be helping me figure out what the reward function should be. If I had a trained model to work with, I don’t think I would find that helpful either. With future interpretability advances maybe I would say “OK cool, here’s PF, I see it inside the model, but man, I still don’t know what the reward function should be.” Unless of course I use the very-different-from-biology direct interpretability approach (analogous to the “human flourishing” thing I mentioned above).
Update: writing this comment made me realize that the first part ought to be a self-contained post; see Plan for mediocre alignment of brain-like [model-based RL] AGI. :)