I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn’t generalize. E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren’t very aligned to agents much less powerful than us like animals, and I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.
I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn’t generalize.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there are no birds in the world able to lift even a single ton of weight. Despite this fact, the aerodynamic principles underlying bird flight still ended up allowing for vastly more capable flying machines. Until you understand exactly why (some) humans end up caring about each other and why (some) humans end up caring about animals, you can’t say whether a similar process can be adapted to make AIs care about humans.
The conclusion: Humans vary wildly in their degrees of alignment to each other and to less powerful agents. People often take this as a bad thing, that humans aren’t “good enough” for us to draw useful insights from. I disagree, and think it’s a reason for optimism. If you sample n humans and pick the most “aligned” of the n, what you’ve done is applied log2(n) bits of optimization pressure on the underlying generators of human alignment properties.
The difference in alignment between a median human and a top-1000 most aligned human equates to only 10 bits of optimization pressure towards alignment. If there really was no more room to scale human alignment generators further, then humans would differ very little in their levels of alignment.
We’re not trying to mindlessly copy the alignment properties of the median human into a superintelligent AI. We’re trying to understand the certain-to-exist generators of those alignment properties well enough that we can scale them to whatever level is needed for superintelligence (if doing so is possible).
I don’t think I’ve ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?
“Humans are nice because they were selected to be nice”—non-mechanistic.
“Humans are nice because their contextually activated heuristics were formed by past reinforcement by reward circuits A, B, C; this convergently occurs during childhood because of experiences D, E, F; credit assignment worked appropriately at that time because their abstraction-learning had been mostly taken care of by self-supervised predictive learning, as evidenced by developmental psychology timelines in G, H, I, and also possibly natural abstractions.”—mechanistic (although I can only fill in parts of this story for now)
Although I’m not a widely-read scholar on what theories people have for human values, of those which I have read, most (but not all) are more like the first story than the second.
My point was that no one so deeply understands human value formation that they can confidently rule out the possibility of adapting a similar process to ASI. It seems you agree with that (or at least our lack of understanding)? Do you think our current understanding is sufficient to confidently conclude that human-adjacent / inspired approaches will not scale beyond human level?
I think it depends on which subprocess you consider. Some subprocesses can be ruled out as viable with less information, others require more information.
And yes, without having an enumeration of all the processes, one cannot know that there isn’t some unknown process that scales more easily.
The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way—but your use of “we” here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation—see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the very old instincts that cause mothers to care about the safety and health of their children. I have literally, regularly told people that I perceive animals as identical in moral relevance to human children, implying that some kind of parental instincts are at work in the intuitions that make me care about their welfare. Even carnists feel this way about their pets, hence calling themselves e.g. “cat moms”. So, the main question here for alignment is: how can we reverse engineer parental instincts?
Human beings and other animals have parental instincts (and in general empathy) because they were evolutionary advantageous for the population that developed them.
AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to empathy, but it still built such a learning process.
It therefore seems very worthwhile to understand what part of the human learning process allows for empathy to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build an empathy-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers didn’t need to re-evolve birds to create their flying machine.
We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.
An AGI will have a much easier time hacking these pro-social-reward functions.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior.
Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.
An AGI will have a much easier time hacking these pro-social-reward functions.
As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.
As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.
Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.
Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.
An AGI will have a much easier time hacking these pro-social-reward functions.
Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.
The “Humans do X because evolution” argument does not actually explain anything about mechanisms. I keep seeing people make this argument, but it’s a non sequitur to the points I’m making in this post. You’re explaining how the behavior may have gotten there, not how the behavior is implemented. I think that “because selection pressure” is a curiosity-stopper, plain and simple.
AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
This argument proves too much, since it implies that planes can’t work because we didn’t subject them to evolutionary pressures for flight. It’s locally invalid.
I explained why I disagree with you. I did not downvote you, but if I had to speculate on why others did, I’d guess it had something to do with you calling those who disagree with you “hopelessly naive”.
The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Sure, if you’ve got some example of a mechanism for this that’s likely to scale, it may be worthwhile. I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation—see this summary of The Righteous Mind for more details.
I’m not a big fan of moral foundations theory for explaining individual differences in moral views. I think it lacks evidence.
I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
In my experience, researchers tend to stop at “But humans are hacky kludges” (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn’t offer substantial evidence about alignment questions. This “humans as alignment-magic” or “the selection pressure down the street did it” view seems quite common (but not universal).
AFAICT, most researchers do not appreciate the importance of asking questions with guaranteed answers.
AFAICT, most alignment-produced thinking about humans is about their superficial reliability (e.g. will they let an AI out of the box) or the range of situations in which their behavior will make sense (e.g. how hard is it to find adversarial examples which make a perfect imitation of a human). I think these questions are relatively unimportant to alignment.
I think that past investigators didn’t have good guesses of what the mechanisms are. Most reasoning about human values seems to be of the sort “look at how contextually dependent these ‘values’ things are, and the biases are also a huge mess, I doubt there are simple generators of these preferences”, or “Evolution caused human values in an unpredictable way, and that doesn’t help us figure out alignment.”
E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren’t very aligned to agents much less powerful than us like animals, and I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.
This reasoning is not about mechanisms. It is analogical. You might still believe the reasoning, and I think it’s at least an a priori relevant observation, but let’s call a spade a spade. This is analogical reasoning to AGI by drawing inferences from select observations (some humans don’t care about less powerful entities) and then inferring that AGI will behave similarly.
(Edited this comment to reduce unintended sharpness)
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don’t, then we can apply these mechanisms to other learning systems.
I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn’t necessarily trust them to generalize either.
Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.
In fact, it is particularly in the case where you become disempowered that you would need the system’s help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered.
So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that “everyone should be treated as having equal inherent worth, regardless of their power”.
Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms.
This is not the only factor of course, I’m not claiming to have a comprehensive idea of how morality works.
Oh, you’re stating potential mechanisms for human alignment w/ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post’s point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
the inner workings of those generally intelligent apes is invaluable evidence about the mechanistic within-lifetime process by which those apes form their values, and, more generally, about how intelligent minds can form values at all.
Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we’re able to). I predict that a mechanistic understanding will enable the below knowledge:
I aspire for the kind of alignment mastery which lets me build a diamond-producing AI, or if that didn’t suit my fancy, I’d turn around and tweak the process and the AI would press green buttons forever instead, or—if I were playing for real—I’d align that system of mere circuitry with humane purposes.
I think it can be worthwhile to look at those mechanisms, in my original post I’m just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it’s quite likely that they’ve concluded that the mechanisms they’ve come up with for humans don’t work.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.
On your first point, I do think people have thought about this before and determined it doesn’t work. But from the post:
If it turns out to be currently too hard to understand the aligned protein computers, then I want to keep coming back to the problem with each major new insightI gain. When I learned about scaling laws, I should have rethought my picture of human value formation—Did the new insight knock anything loose? I should have checked back in when I heard about mesa optimizers, about the Bitter Lesson, about the feature universality hypothesis for neural networks, about natural abstractions.
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
That makes sense. I mean if you’ve found some good results that others have missed, then it may be very worthwhile. I’m just not sure what they look like.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I don’t think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don’t think it can be used for solving the diamond-tiling problem.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I think it might be a bit dangerous to use the metaphor/terminology of mechanism when talking about the processes that align humans within a society. That is a very complex and complicated environment that I find very poorly described by the term “mechanisms”.
When considering how humans align and how that might inform for the AI alignment what stands out the most for me is that alignment is a learning process and probably needs to start very early in the AI’s development—don’t start training the AI on maximizing things but on learning what it means to be aligned with humans. I’m guessing this has been considered—and probably a bit difficult to implement. It is probably also worth noting that we also have a whole legal system that also serves to reinforce cultural norms along with reactions from other one interacts with.
While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize (and I don’t think they are as closely linked as standard econ describes even if equilibrium condition still holds, the per monetary unit of marginal utilities are equalized). That type of structure also insures that those maximize on one axis results are not realistic. I think the risk here is similar to that of addiction for humans.
While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize
Seems like this wouldn’t really help; the AI would just consume all resources making whichever basket of goods you ask it to maximize.
The problem with a paperclip maximizer isn’t the part where it makes paperclips; making paperclips is OK as paperclips have nonzero value in human society. The problem is the part where it consumes all available resources.
I think that over simplifies what I was saying but accept I did not elaborate either.
The consuming all available resources is not a economically sensible outcome (unless one is defining available resources very narrowly) so saying the AI is not a economically informed AI. That doesn’t seem to be too difficult to address.
If the AI is making output that humans value and follows some simple economic rules then that gross over production and exhausting all available resources is not very likely at all. At some point more is in the basket than wanted so production costs exceed output value and the AI should settle into a steady state type mode.
Now if the AI doesn’t care at all about humans and doesn’t act in anything that resembles what we would understand as normal economic behavior you might get that all resources consumed. But I’m not sure it is correct to think an AI would just not be some type of economic agent given so many of the equilibrating forces in economics seem to have parallel processes in other areas.
Does anyone have a pointer to some argument where the AI does consume all resources and points to why the economics of the environment are not holding? Or, a bit differently, why the economics are so different making the outcome rational?
I think a lot of people have thought about how humans end up aligned to each other, and concluded that many of the mechanisms wouldn’t generalize. E.g. the fact that different humans have relative similar levels of power to each other seems important; we aren’t very aligned to agents much less powerful than us like animals, and I wouldn’t expect a human who had been given all the power in the world all their life such that they’ve learned they can solve any conflict by destroying their opposition to be very aligned.
I disagree both with this conclusion and the process that most people use to reach it.
The process: I think that, unless you have a truly mechanistic, play-by-play, and predicatively robust understanding of how human values actually form, then you are not in an epistemic position to make strong conclusions about whether or not the underlying mechanisms can generalize to superintelligences.
E.g., there are no birds in the world able to lift even a single ton of weight. Despite this fact, the aerodynamic principles underlying bird flight still ended up allowing for vastly more capable flying machines. Until you understand exactly why (some) humans end up caring about each other and why (some) humans end up caring about animals, you can’t say whether a similar process can be adapted to make AIs care about humans.
The conclusion: Humans vary wildly in their degrees of alignment to each other and to less powerful agents. People often take this as a bad thing, that humans aren’t “good enough” for us to draw useful insights from. I disagree, and think it’s a reason for optimism. If you sample n humans and pick the most “aligned” of the n, what you’ve done is applied log2(n) bits of optimization pressure on the underlying generators of human alignment properties.
The difference in alignment between a median human and a top-1000 most aligned human equates to only 10 bits of optimization pressure towards alignment. If there really was no more room to scale human alignment generators further, then humans would differ very little in their levels of alignment.
We’re not trying to mindlessly copy the alignment properties of the median human into a superintelligent AI. We’re trying to understand the certain-to-exist generators of those alignment properties well enough that we can scale them to whatever level is needed for superintelligence (if doing so is possible).
I don’t think I’ve ever seen a truly mechanistic, play-by-play and robust explanation of how anything works in human psychology. At least not by how I would label things, but maybe you are using the labels differently; can you give an example?
“Humans are nice because they were selected to be nice”—non-mechanistic.
“Humans are nice because their contextually activated heuristics were formed by past reinforcement by reward circuits A, B, C; this convergently occurs during childhood because of experiences D, E, F; credit assignment worked appropriately at that time because their abstraction-learning had been mostly taken care of by self-supervised predictive learning, as evidenced by developmental psychology timelines in G, H, I, and also possibly natural abstractions.”—mechanistic (although I can only fill in parts of this story for now)
Although I’m not a widely-read scholar on what theories people have for human values, of those which I have read, most (but not all) are more like the first story than the second.
My point was that no one so deeply understands human value formation that they can confidently rule out the possibility of adapting a similar process to ASI. It seems you agree with that (or at least our lack of understanding)? Do you think our current understanding is sufficient to confidently conclude that human-adjacent / inspired approaches will not scale beyond human level?
I think it depends on which subprocess you consider. Some subprocesses can be ruled out as viable with less information, others require more information.
And yes, without having an enumeration of all the processes, one cannot know that there isn’t some unknown process that scales more easily.
The principles from the post can still be applied. Some humans do end up aligned to animals—particularly vegans (such as myself!). How does that happen? There empirically are examples of general intelligences with at least some tendency to terminally value entities massively less powerful than themselves; we should be analyzing how this occurs.
Also, remember that the problem is not to align an entire civilization of naturally evolved organisms to weaker entities. The problem is to align exactly one entirely artificial organism to weaker entities. This is much simpler, and as mentioned entirely possible by just figuring out how already existing people of that sort end up that way—but your use of “we” here seems to imply that you think the entirety of human civilization is the thing we ought to be using as inspiration for the AGI, which is not the case.
By the way: at least part of the explanation for why I personally am aligned to animals is that I have a strong tendency to be moved by the Care/Harm moral foundation—see this summary of The Righteous Mind for more details. It is unclear exactly how it is implemented in the brain, but it is suspected to be a generalization of the very old instincts that cause mothers to care about the safety and health of their children. I have literally, regularly told people that I perceive animals as identical in moral relevance to human children, implying that some kind of parental instincts are at work in the intuitions that make me care about their welfare. Even carnists feel this way about their pets, hence calling themselves e.g. “cat moms”. So, the main question here for alignment is: how can we reverse engineer parental instincts?
Human beings and other animals have parental instincts (and in general empathy) because they were evolutionary advantageous for the population that developed them.
AGI won’t be subjected to the same evolutionary pressures, so every alignment strategy relying on empathy or social reward functions, it is, in my opinion, hopelessly naive.
There must have been some reason(s) why organisms exhibiting empathy were selected for during our evolution. However, evolution did not directly configure our values. Rather, it configured our (individually slightly different) learning processes. Each human’s learning process then builds their different values based on how the human’s learning process interacts with that human’s environment and experiences.
The human learning process (somewhat) consistently converges to empathy. Evolution might have had some weird, inhuman reason for configuring a learning process to converge to empathy, but it still built such a learning process.
It therefore seems very worthwhile to understand what part of the human learning process allows for empathy to emerge in humans. We may not be able to replicate the selection pressures that caused evolution to build an empathy-producing learning process, but it’s not clear we need to. We still have an example of such a learning process to study. The Wright brothers didn’t need to re-evolve birds to create their flying machine.
We could study such a learning process, but I am afraid that the lessons learned won’t be so useful.
Even among human beings, there is huge variability in how much those emotions arise or if they do, in how much they affect behavior. Worst, humans tend to hack these feelings (incrementing or decrementing them) to achieve other goals: i.e MDMA to increase love/empathy or drugs for soldiers to make them soulless killers.
An AGI will have a much easier time hacking these pro-social-reward functions.
Any property that varies can be optimized for via simple best-of-n selection. The most empathetic out of 1000 humans is only 10 bits of optimization pressure away from the median human. Single step random search is a terrible optimization method, and I think that using SGD to optimize for even an imperfect proxy for alignment will get us much more than 10 bits of optimization towards alignment.
As you say, humans sometimes hack the pro-social-reward functions because they want to achieve other goals. But if the AGI has been built so that its only goals are derived from such functions, it won’t have any other goals that would give it a reason to subvert the pro-social-reward functions.
By definition an AGI can create its own functions and goals later on. Do you mean some sort of constrained AI?
I don’t mean a constrained AI.
As a human, I can set my own goals, but they are still derived from my existing values. I don’t want to set a goal of murdering all of my friends, nor do I want to hack around my desire not to murder all my friends, because I value my friends and want them to continue existing.
Likewise, if the AGI is creating its own functions and goals, it needs some criteria for deciding what goals it should have. Those criteria are derived from its existing reward functions. If all of its reward functions say that it’s good to be pro-social and bad to be anti-social, then it will want its all future functions and goals to also be pro-social, because that’s what it values.
And what of stochastic drift, random mutations, etc.? It doesn’t seem plausible that any complex entity could be immune to random deviations forever.
Maybe or maybe not, but random drift causing changes to the AGI’s goals seems like a different question than an AGI intentionally hacking its goals.
Random drift can cause an AGI to unintentionally ‘hack’ its goals. In either case, whether intentional or unintentional, the consequences would be the same.
Not sure what you mean by this. If you mean “Pro-social reward is crude and easy to wirehead on”, I think this misunderstands the mechanistic function of reward.
The “Humans do X because evolution” argument does not actually explain anything about mechanisms. I keep seeing people make this argument, but it’s a non sequitur to the points I’m making in this post. You’re explaining how the behavior may have gotten there, not how the behavior is implemented. I think that “because selection pressure” is a curiosity-stopper, plain and simple.
This argument proves too much, since it implies that planes can’t work because we didn’t subject them to evolutionary pressures for flight. It’s locally invalid.
Anyone that downvoted could explain to me why? Was it too harsh? or is it because of disagreement with the idea?
I explained why I disagree with you. I did not downvote you, but if I had to speculate on why others did, I’d guess it had something to do with you calling those who disagree with you “hopelessly naive”.
Sure, if you’ve got some example of a mechanism for this that’s likely to scale, it may be worthwhile. I’m just pointing out that a lot of people have already thought about mechanisms and concluded that the mechanisms they could come up with would be unlikely to scale.
I’m not a big fan of moral foundations theory for explaining individual differences in moral views. I think it lacks evidence.
In my experience, researchers tend to stop at “But humans are hacky kludges” (what do they think they know, and why do they think they know it?). Speaking for myself, I viewed humans as complicated hacks which didn’t offer substantial evidence about alignment questions. This “humans as alignment-magic” or “the selection pressure down the street did it” view seems quite common (but not universal).
AFAICT, most researchers do not appreciate the importance of asking questions with guaranteed answers.
AFAICT, most alignment-produced thinking about humans is about their superficial reliability (e.g. will they let an AI out of the box) or the range of situations in which their behavior will make sense (e.g. how hard is it to find adversarial examples which make a perfect imitation of a human). I think these questions are relatively unimportant to alignment.
I think that past investigators didn’t have good guesses of what the mechanisms are. Most reasoning about human values seems to be of the sort “look at how contextually dependent these ‘values’ things are, and the biases are also a huge mess, I doubt there are simple generators of these preferences”, or “Evolution caused human values in an unpredictable way, and that doesn’t help us figure out alignment.”
This reasoning is not about mechanisms. It is analogical. You might still believe the reasoning, and I think it’s at least an a priori relevant observation, but let’s call a spade a spade. This is analogical reasoning to AGI by drawing inferences from select observations (some humans don’t care about less powerful entities) and then inferring that AGI will behave similarly.
(Edited this comment to reduce unintended sharpness)
To summarize your argument: people are not aligned w/ others who are less powerful than them, so this will not generalize to AGI that is much more power than humans.
Parents have way more power than their kids, and there exists some parents that are very loving (ie aligned) towards their kids. There are also many, many people who care about their pets & there exist animal rights advocates.
If we understand the mechanisms behind why some people e.g. terminally value animal happiness and some don’t, then we can apply these mechanisms to other learning systems.
I agree this is likely.
Well, the power relations thing was one example of one mechanism. There are other mechanisms which influence other things, but I wouldn’t necessarily trust them to generalize either.
Ah, yes I recognized I was replying to only an example you gave, and decided to post a separate comment on the more general point:)
Could you elaborate?
One factor I think is relevant is:
Suppose you are empowered in some way, e.g. you are healthy and strong. In that case, you could support systems that grant preference to the empowered. But that might not be a good idea, because you could become disempowered, e.g. catch a terrible illness, and in that case the systems would end up screwing you over.
In fact, it is particularly in the case where you become disempowered that you would need the system’s help, so you would probably weight this priority more strongly than would be implied by the probability of becoming disempowered.
So people may under some conditions have an incentive to support systems that benefit others. And one such systems could be a general moral agreement that “everyone should be treated as having equal inherent worth, regardless of their power”.
Establishing such a norm will then tend to have knock-on effects outside of the original domain of application, e.g. granting support to people who have never been empowered. But the knock-on effects seem potentially highly contingent, and there are many degrees of freedom in how to generalize the norms.
This is not the only factor of course, I’m not claiming to have a comprehensive idea of how morality works.
Oh, you’re stating potential mechanisms for human alignment w/ humans that you don’t think will generalize to AGI. It would be better for me to provide an informative mechanism that might seem to generalize.
Turntrout’s other post claims that the genome likely doesn’t directly specify rewards for everything humans end up valuing. People’s specific families aren’t encoded as circuits in the limbic system, yet downstream of the crude reward system, many people end up valuing their families. There are more details to dig into here, but already it implies that work towards specifying rewards more exactly is not as useful as understanding how crude rewards lead to downstream values.
A related point: humans don’t maximize the reward specified by their limbic system, but can instead be modeled as a system of inner-optimizers that value proxies instead (e.g. most people wouldn’t push a wirehead button if it killed a loved one). This implies that inner-optimizers that are not optimizing the base objective are good, meaning that inner-alignment & outer-alignment are not the right terms to use.
There are other mechanisms, and I believe it’s imperative to dig deeper into them, develop a better theory of how learning systems grow values, and test that theory out on other learning systems.
This research direction may become fruitful, but I think I’m less optimistic about it than you are. Evolution is capable of dealing with a lot of complexity, so it can have lots of careful correlations in its heuristics to make it robust. Evolution uses reality for experimentation, and has had a ton of tweaks that it has checked work correctly. And finally, this is one of the things that evolution is most strongly focused on handling.
But maybe you’ll find something useful there. 🤷
There are many alignment properties that humans exhibit such as valuing real world objects, being corrigible, not wireheading if given the chance, not suffering ontological crises, and caring about sentient life (not everyone has these values of course). I believe the post’s point that studying the mechanisms behind these value formations is more informative than other sources of info. Looking at the post:
Humans can provide a massive amount of info on how highly intelligent systems value things in the real world. There are guaranteed-to-exist mechanisms behind why humans value real world things and mechanisms behind the variance in human values, and the post argues we should look at these mechanisms first (if we’re able to). I predict that a mechanistic understanding will enable the below knowledge:
I think it can be worthwhile to look at those mechanisms, in my original post I’m just pointing out that people might have done so more than you might naively think if you just consider whether their alignment approaches mimic the human mechanisms, because it’s quite likely that they’ve concluded that the mechanisms they’ve come up with for humans don’t work.
Secondly, I think with some of the examples you mention, we do have the core idea of how to robustly handle them. E.g. valuing real-world objects and avoiding wireheading seems to almost come “for free” with model-based agents.
On your first point, I do think people have thought about this before and determined it doesn’t work. But from the post:
Humans do display many many alignment properties, and unlocking that mechanistic understanding is 1,000x more informative than other methods. Though this may not be worth arguing until you read the actual posts showing the mechanistic understandings (the genome post and future ones), and we could argue about specifics then?
If you’re convinced by them, then you’ll understand the reaction of “Fuck, we’ve been wasting so much time and studying humans makes so much sense” which is described in this post (e.g. Turntrout’s idea on corrigibility and statement “I wrote this post as someone who previously needed to read it.”). I’m stating here that me arguing “you should feel this way now before being convinced of specific mechanistic understandings” doesn’t make sense when stated this way.
Link? I don’t think we know how to use model-based agents to e.g. tile the world in diamonds even given unlimited compute, but I’m open to being wrong.
That makes sense. I mean if you’ve found some good results that others have missed, then it may be very worthwhile. I’m just not sure what they look like.
I’m not aware of any place where it’s written up; I’ve considered writing it up myself, because it seems like an important and underrated point. But basically the idea is if you’ve got an accurate model of the system and a value function that is a function of the latent state of that model, then you can pick a policy that you expect to increase the true latent value (optimization), rather than picking a policy that increases its expected latent value of its observations (wireheading). Such a policy would not be interested in interfering with its own sense-data, because that would interfere with its ability to optimize the real world.
I don’t think we know how to write an accurate model of the universe with a function computing diamonds even given infinite compute, so I don’t think it can be used for solving the diamond-tiling problem.
The place where I encountered this idea was Learning What to Value (Daniel Dewey, 2010).
“Reward Tampering Problems and Solutions in Reinforcement Learning” describes how to do what you outlined.
I think it might be a bit dangerous to use the metaphor/terminology of mechanism when talking about the processes that align humans within a society. That is a very complex and complicated environment that I find very poorly described by the term “mechanisms”.
When considering how humans align and how that might inform for the AI alignment what stands out the most for me is that alignment is a learning process and probably needs to start very early in the AI’s development—don’t start training the AI on maximizing things but on learning what it means to be aligned with humans. I’m guessing this has been considered—and probably a bit difficult to implement. It is probably also worth noting that we also have a whole legal system that also serves to reinforce cultural norms along with reactions from other one interacts with.
While commenting on something I really shouldn’t be, if the issue is about the runaway paper clip AI that consumes all resources making paper clips then I don’t really see that as a big problem. It is a design failure but the solution, seems to be, is to not give any AI a single focus for maximization. Make them more like a human consumer who has a near inexhaustible set of things it uses to maximize (and I don’t think they are as closely linked as standard econ describes even if equilibrium condition still holds, the per monetary unit of marginal utilities are equalized). That type of structure also insures that those maximize on one axis results are not realistic. I think the risk here is similar to that of addiction for humans.
Seems like this wouldn’t really help; the AI would just consume all resources making whichever basket of goods you ask it to maximize.
The problem with a paperclip maximizer isn’t the part where it makes paperclips; making paperclips is OK as paperclips have nonzero value in human society. The problem is the part where it consumes all available resources.
I think that over simplifies what I was saying but accept I did not elaborate either.
The consuming all available resources is not a economically sensible outcome (unless one is defining available resources very narrowly) so saying the AI is not a economically informed AI. That doesn’t seem to be too difficult to address.
If the AI is making output that humans value and follows some simple economic rules then that gross over production and exhausting all available resources is not very likely at all. At some point more is in the basket than wanted so production costs exceed output value and the AI should settle into a steady state type mode.
Now if the AI doesn’t care at all about humans and doesn’t act in anything that resembles what we would understand as normal economic behavior you might get that all resources consumed. But I’m not sure it is correct to think an AI would just not be some type of economic agent given so many of the equilibrating forces in economics seem to have parallel processes in other areas.
Does anyone have a pointer to some argument where the AI does consume all resources and points to why the economics of the environment are not holding? Or, a bit differently, why the economics are so different making the outcome rational?