Thanks for writing this, I think it’s good to have discussions around these sorts of ideas.
Please, though, let’s not give up on “value alignment,” or, rather, conscience guard-railing, where the artificial conscience is inline with human values.
Sometimes when enough intelligent people declare something’s too hard to even try at, it becomes a self-fulfilling prophesy—most people may give up on it and then of course it’s never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we’re really not sure if it’ll be as hard as it seems.
I don’t think value alignment of a super-takeover AI would be a good idea, for the following reasons:
1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact. 2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it’s very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can’t correct for externalities happening down the road. (Speed also makes it more likely that we can’t correct in time, so I think we should try to go slow). 3) There is no agreement on which values are ‘correct’. Personally, I’m a moral relativist, meaning I don’t believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It’s very uncertain whether such change would be considered as net positive by any surviving humans. 4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.
I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I’m somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI’s input.
Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances.
A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do.
To respond to some of your specific points:
I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values.
I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do—I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the situation to see how their conscience should apply to it.
I don’t know if there are moral facts, but something that seems to me to be on the level of a fact is that everyone cares about their own well-being—everyone wants to feel good in some way. Some people are very confused about how to go about doing this and do self-destructive acts, but ultimately they’re trying to feel good (or less bad) in some way. And most people have empathy, so they feel good when they think others feel good. I think this is the entire basis from which we should start for a universal, not-ever-gonna-change human value: we all want to feel good in some way. Then it’s just a question of understanding the “physics” of how we work and what makes us feel the most overall good (well-being) over the long-term. And I put forward the hypothesis that raising self-esteem is the best heuristic for raising overall well-being, and further, that increasing our responsibility level is the path to higher self-esteem (see Branden for the conception of “self-esteem” I’m talking about here).
I also consider AI’s replacing all humans to be an extremely bad outcome. I think it’s a result that someone with an “ideal” human conscience would actively avoid bringing about, and thus an AI with an artificial conscience based on an ideal human conscience (emphasizing responsibility) should do the same.
Ultimately, there’s a lot of uncertainty about the future, and I wouldn’t write off “value alignment” in the form of an artificial conscience just yet, even if there are risks involved with it.
Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you’re trying to do, for clarity. I’m happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I’ve talked to people into value alignment of ASI who said they “would bite that bullet”, in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I’m also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)
If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I’d file it under a positive offense defense balance, which would be great. If humanity doesn’t support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.
Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you.
I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake
Again, I’m glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point.
I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn’t recognize the existence of such risks. The AI will proceed sabotaging people’s unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.
I think people underestimate the amount of pushback we’re going to get once you get into pivotal act territory. That’s why I think it’s hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
So yes definitely agree with this. I don’t think lack of conscience or ethics is the issue though, but existential risk awareness.
In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them.
Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved.
I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time—when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all.
This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on.
However, we also need to be realistic if we are going to succeed.
We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented.
One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.
However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they’re not a public-benefit corporation), and I also don’t think that value alignment is so convergent that order-following aligned AI is impossible to build. So we’re going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called “AI that resists malicious use”, while order-following AI is “AI that enables malicious use”. The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once “enabling malicious use” includes serious cybercrime, not just naughty stories, I don’t expect this political discussion to last very long: politically, it’s a pretty basic “do you want every-person-for-themself anarchy, or the collective good?” question. However, depending on takeoff speeds, the timeline from “serious cybercrime enabled” to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.
Sorry, I should’ve been more clear: I meant to say let’s not give up on getting “value alignment” figured out in time, i.e., before the first real AGI’s (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI’s are, which I think only the most “optimistic” people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it’s anyone’s guess.
I’d rather that companies/charities start putting some serious funding towards “artificial conscience” work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI’s in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there’s just not enough time for the “good AGI’s” to figure out how to minimize collateral damage in defending against “bad AGI’s.” Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren’t strongly suited to help make progress on “inner alignment” to be thinking hard about the “value alignment”/”artificial conscience” problem.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I’m not gambling on having more than a few years to get this right.
The other factor you’re not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can’t be in principle), you’d still have people preferring to align their AGIs to their own intent over value alignment.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
With my current understanding of compute hardware and of the software of various current AI systems, I don’t see a path towards a ‘locked in conscience’ that a bad actor with full control over the hardware/software couldn’t remove. Even chips soldered to a board can be removed/replaced/hacked.
My best guess is that the only approaches to having an ‘AI conscience’ be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won’t be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don’t think we lose utility by having all private uses go through APIs, so long as there isn’t undue censorship on the API.
I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.
Thanks. I guess I’d just prefer it if more people were saying, “Hey, even though it seems difficult, we need to go hard after conscience guard rails (or ‘value alignment’) for AI now and not wait until we have AI’s that could help us figure this out. Otherwise, some of us we might not make it until we have AI’s that could help us figure this out.” But I also realize that I’m just generally much more optimistic about the tractability of this problem than most people appear to be, although Shane Legg seemed to say it wasn’t “too hard,” haha.[1]
Legg was talking about something different than I am, though—he was talking about “fairly normal” human values and ethics, or what most people value, while I’m basically talking about what most people would value if they were wiser.
Thanks for writing this, I think it’s good to have discussions around these sorts of ideas.
Please, though, let’s not give up on “value alignment,” or, rather, conscience guard-railing, where the artificial conscience is inline with human values.
Sometimes when enough intelligent people declare something’s too hard to even try at, it becomes a self-fulfilling prophesy—most people may give up on it and then of course it’s never achieved. We do want to be realistic, I think, but still put in effort in areas where there could be a big payoff when we’re really not sure if it’ll be as hard as it seems.
I don’t think value alignment of a super-takeover AI would be a good idea, for the following reasons:
1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.
2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it’s very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can’t correct for externalities happening down the road. (Speed also makes it more likely that we can’t correct in time, so I think we should try to go slow).
3) There is no agreement on which values are ‘correct’. Personally, I’m a moral relativist, meaning I don’t believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It’s very uncertain whether such change would be considered as net positive by any surviving humans.
4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.
I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I’m somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI’s input.
Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances.
A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do.
To respond to some of your specific points:
I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values.
I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do—I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the situation to see how their conscience should apply to it.
I don’t know if there are moral facts, but something that seems to me to be on the level of a fact is that everyone cares about their own well-being—everyone wants to feel good in some way. Some people are very confused about how to go about doing this and do self-destructive acts, but ultimately they’re trying to feel good (or less bad) in some way. And most people have empathy, so they feel good when they think others feel good. I think this is the entire basis from which we should start for a universal, not-ever-gonna-change human value: we all want to feel good in some way. Then it’s just a question of understanding the “physics” of how we work and what makes us feel the most overall good (well-being) over the long-term. And I put forward the hypothesis that raising self-esteem is the best heuristic for raising overall well-being, and further, that increasing our responsibility level is the path to higher self-esteem (see Branden for the conception of “self-esteem” I’m talking about here).
I also consider AI’s replacing all humans to be an extremely bad outcome. I think it’s a result that someone with an “ideal” human conscience would actively avoid bringing about, and thus an AI with an artificial conscience based on an ideal human conscience (emphasizing responsibility) should do the same.
Ultimately, there’s a lot of uncertainty about the future, and I wouldn’t write off “value alignment” in the form of an artificial conscience just yet, even if there are risks involved with it.
Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you’re trying to do, for clarity. I’m happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I’ve talked to people into value alignment of ASI who said they “would bite that bullet”, in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I’m also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)
If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I’d file it under a positive offense defense balance, which would be great. If humanity doesn’t support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.
Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you.
I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
Again, I’m glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.
I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn’t recognize the existence of such risks. The AI will proceed sabotaging people’s unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.
I think people underestimate the amount of pushback we’re going to get once you get into pivotal act territory. That’s why I think it’s hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.
So yes definitely agree with this. I don’t think lack of conscience or ethics is the issue though, but existential risk awareness.
In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them.
Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved.
I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time—when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all.
Thanks for all your comments!
This is an excellent point. I do not want to give up on value alignment. And I will endeavor to not make it seem impossible or not worth working on.
However, we also need to be realistic if we are going to succeed.
We need specific plans to achieve value alignment. I have written about alignment plans for likely AGI designs. They look to me like they can achieve personal intent alignment, but are much less likely to achieve value alignment. Those plans are linked here. Having people, you or others, work out how those or other alignment plans could lead to robust value alignment would be a step in having them implemented.
One route to value alignment is having a good person or people in charge of an intent aligned AGI, having them perform a pivotal act, and using that AGI to help design working stable value alignment. That is the best long term success scenario I see.
For reasons I’ve outlined in Requirements for a Basin of Attraction to Alignment and Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis, I personally think value alignment is easy, convergent, and “an obvious target”, such that if you built a AGi or ASI that is sufficiently close to it, it will see the necessity/logic of value alignment and actively work to converge to it (or something close to it: I’m not sure the process is necessarily convergent to a single precisely-defined limit, just to a compact region: a question I discussed more in The Mutable Values Problem in Value Learning and CEV).
However, I agree that order-following alignment is obviously going to be appealing to people building AI, and to their shareholders/investors (especially if they’re not a public-benefit corporation), and I also don’t think that value alignment is so convergent that order-following aligned AI is impossible to build. So we’re going to need to a make, and successfully enforce, a social/political decision across multiple countries about which of these we want over the next few years. The in-the-Overton-Window terminology for this decision is slightly different: value-aligned Ai is called “AI that resists malicious use”, while order-following AI is “AI that enables malicious use”. The closed-source frontier labs are publicly in favor of the former, and are shipping primitive versions of it: the latter is being championed by the open-source community, Meta, and A16z. Once “enabling malicious use” includes serious cybercrime, not just naughty stories, I don’t expect this political discussion to last very long: politically, it’s a pretty basic “do you want every-person-for-themself anarchy, or the collective good?” question. However, depending on takeoff speeds, the timeline from “serious cybercrime enabled” to the sort of scenarios Seth is discussing above might be quite short, possible only of the order of a year or two.
Sorry, I should’ve been more clear: I meant to say let’s not give up on getting “value alignment” figured out in time, i.e., before the first real AGI’s (ones capable of pivotal acts) come online. Of course, the probability of that depends a lot on how far away AGI’s are, which I think only the most “optimistic” people (e.g., Elon Musk) put as 2 years or less. I hope we have more time than that, but it’s anyone’s guess.
I’d rather that companies/charities start putting some serious funding towards “artificial conscience” work now to try to lower the risks associated with waiting until boxed AGI or intent aligned AGI come online to figure it out for/with us. But my view on this is perhaps skewed by putting significant probability on being in a situation in which AGI’s in the hands of bad actors either come online first or right on the heals of those of good actors (as due to effective espionage), and there’s just not enough time for the “good AGI’s” to figure out how to minimize collateral damage in defending against “bad AGI’s.” Either way, I believe we should be encouraging people of moral psychology/philosophical backgrounds who aren’t strongly suited to help make progress on “inner alignment” to be thinking hard about the “value alignment”/”artificial conscience” problem.
Currently, an open source value-aligned model can be easily modified to just an intent-aligned model. The alignment isn’t ‘sticky’, it’s easy to remove it without substantially impacting capabilities.
So unless this changes, the hope of peace through value-aligned models routes through hoping that the people in charge of them are sufficiently ethical -value-aligned to not turn the model into a purely intent-aligned one.
Yes. Good point that LLMs are sort of value aligned as it stands.
I think of that alignment as far too weak to put it in the same category as what I’m speaking of. I’d be shocked if that sort of RL alignment is sufficient to create durable alignment in smarter-than-human scaffolded agent systems using those foundation models.
When they achieve “coherence” or reflection and self-modification, I’d be surprised if their implicit values are good enough to create a good future without further tweaking, once they’re refined into explicit values. Which we won’t be able to do once they’re smart enough to escape our control.
Agreed, “sticky” alignment is a big issue—see my reply above to Seth Herd’s comment. Thanks.
Agreed on all points.
Except that timelines are anyone’s guess. People with more relevant expertise have better guesses. It looks to me like people with the most relevant expertise have shorter timelines, so I’m not gambling on having more than a few years to get this right.
The other factor you’re not addressing is that, even if value alignment were somehow magically equally as easy as intent alignment (and I currently think it can’t be in principle), you’d still have people preferring to align their AGIs to their own intent over value alignment.
Sure. Me being sloppy with my language again, sorry. It does feel like having more than a decade to AGI is fairly unlikely.
I also agree that people are going to want AGI’s aligned to their own intents. That’s why I’d also like to see money being dedicated to research on “locking in” a conscience module in an AGI, most preferably on a hardware level. So basically no one could sell an AGI without a conscience module onboard that was safe against AGI-level tampering (once we get to ASI’s, all bets are off, of course).
I actually see this as the most difficult problem in the AGI general alignment space—not being able to align an AGI to anything (inner alignment) or what to align an AGI to (“wise” human values), but how to keep an AGI aligned to these values when so many people (both people with bad intent and intelligent but “naive” people) are going to be trying with all their might (and near-AGI’s they have available to them) to “jail break” AGI’s.[1] And the problem will be even harder if we need a mechanism to update the “wise” human values, which I think we really should have unless we make the AGI’s “disposable.”
To be clear, I’m taking “inner alignment” as being “solved” when the AGI doesn’t try to unalign itself from what it’s original creator wanted to align it to.
With my current understanding of compute hardware and of the software of various current AI systems, I don’t see a path towards a ‘locked in conscience’ that a bad actor with full control over the hardware/software couldn’t remove. Even chips soldered to a board can be removed/replaced/hacked.
My best guess is that the only approaches to having an ‘AI conscience’ be robust to bad actors is to make both the software and hardware inaccessible to the bad actors. In other words, that it won’t be feasible to do for open-weights models, only closed-weight models accessed through controlled APIs. APIs still allow for fine-tuning! I don’t think we lose utility by having all private uses go through APIs, so long as there isn’t undue censorship on the API.
I think figuring out ways to have an API which does restrict things like information pertaining to the creation of weapons of mass destruction, but not pertaining to personal lifestyle choices (e.g. pornography) would be a very important step towards reducing the public pressure for open-weights models.
Thanks for the comment. You might be right that any hardware/software can ultimately be tampered with, especially if an ASI is driving/helping with the jail breaking process. It seems likely that silicon-based GPU’s will be the hardware to get us to the first AGI’s, but this isn’t an absolute certainty since people are working on other routes such as thermodynamic computing. That makes things harder to predict, but it doesn’t invalidate your take on things, I think. My not-very-well-researched-initial-thought was something like this (chips that self destruct when tampered with).
I envision people having AGI-controlled robots at some point, which may complicate things in terms of having the software/hardware inaccessible to people, unless the robot couldn’t operate without an internet connection, i.e., part of its hardware/software was in the cloud. It’s likely the hardware in the robot itself could still be tampered with in this situation, though, so it still seems like we’d want some kind of self-destructing chip to avoid tampering, even if this ultimately only buys us time until AGI+’s/ASI’s figure a way around this.
Oh hey—I just stumbled back on this comment and realized: it’s the primary reason I wrote
Intent alignment as a stepping-stone to value alignment
On not giving up on value alignment, while acknowledging that instruction-following is a much safer first alignment target.
Thanks. I guess I’d just prefer it if more people were saying, “Hey, even though it seems difficult, we need to go hard after conscience guard rails (or ‘value alignment’) for AI now and not wait until we have AI’s that could help us figure this out. Otherwise, some of us we might not make it until we have AI’s that could help us figure this out.” But I also realize that I’m just generally much more optimistic about the tractability of this problem than most people appear to be, although Shane Legg seemed to say it wasn’t “too hard,” haha.[1]
Legg was talking about something different than I am, though—he was talking about “fairly normal” human values and ethics, or what most people value, while I’m basically talking about what most people would value if they were wiser.