I have a cold, and it seems to be messing with my mood, so help me de-catastrophize here: Tell me your most-probable story in which we still get a mildly friendly [edit: superintelligent] AGI, given that the people at the bleeding edge of AI development are apparently “move fast break things” types motivated by “make a trillion dollars by being first to market”.
I was somewhat more optimistic after reading last week about the safety research OpenAI was doing. This plugin thing is the exact opposite of what I expected from my {model of OpenAI a week ago}. It seems overwhelmingly obvious that the people in the driver’s seat are completely ignoring safety in a way that will force their competitors to ignore it, too.
your most-probable story in which we still get a mildly friendly AGI
A mildly friendly AGI doesn’t help with AI risk if it doesn’t establish global alignment security that prevents it or its users or its successors from building misaligned AGIs (including of novel designs, which could be vastly stronger than any mildly aligned AGI currently in operation). It feels like everyone is talking about alignment of first AGI, but the threshold of becoming AGI is not relevant for resolution of AI risk, it’s only relevant for timelines, specifying the time when everything goes wrong. If it doesn’t go wrong at first, that doesn’t mean it couldn’t go wrong a bit later.
A superintelligent aligned AGI would certainly get this sorted, but a near-human-level AGI has no special advantage over humans in getting this right. And a near-human-level AGI likely comes first, and even if it’s itself aligned it then triggers a catastrophe that makes a hypothetical future superintelligent aligned AGI much less likely.
Sorry, maybe I was using AGI imprecisely. By “mildly friendly AGI” I mean “mildly friendly superintelligent AGI.” I agree with the points you make about bootstrapping.
Only bootstrapping of misaligned AGIs to superintelligence is arguably knowably likely (with their values wildly changing along the way, or set to something simple and easy to preserve initially), this merely requires AGI research progress and can be started on human level. Bootstrapping of aligned AGIs in the same sense might be impossible, requiring instead worldwide regulation on the level of laws of nature (that is, sandboxing and interpreting literally everything) to gain enough time to progress towards aligned superintelligence in a world that contains aligned slightly-above-human-level AGIs, who would be helpless before FOOMing wrapper-minds they are all too capable of bootstrapping.
I see how a bootstrapping AI to superintelligence may not necessarily keep stable, but why would it necessarily become misaligned in the process? Striving for superintelligence is not per se misaligned. And while capability enlargement has led to instability in some observable forms and further ones have been theorised, I find it a leap to say this will necessarily happen.
Alignment with humanity, or with first messy AGIs, is a constraint. An agent with simple formal values, not burdened by this constraint, might be able to self-improve without bound, while remaining aligned with those simple formal values and so not needing to pause and work on alignment. If misalignment is what it takes to reach stronger intelligence, that will keep happening.
Value drift only stops when competitive agents of unclear alignment-with-current-incumbents can no longer develop in the world, a global anti-misalignment treaty. Which could just be a noncentral frame for an intelligence explosion that leaves its would-be competitors in the dust.
Huh. Interesting argument, and one I had not thought of. Thank you.
Could you expand on this more? I can envision several types of alignment induced constraints here, and I wonder whether some of them could and should be altered.
E.g. being able to violate human norms intrinsically comes with some power advantages (e.g. illegally acquiring power and data). If disregarding humans can make you more powerful, this may lead to the most powerful entity being the one that disregarded humans. Then again, having a human alliance based on trust also comes with advantages. Unsure whether they balance out, especially in a scenario where an evil AI can retain the latter for a long time if it is wise about how it deceives, getting both the trust benefits and the illegal benefits. And a morally acting AI held by a group that is extremely paranoid and does not trust it would have neither benefit, and be slowed down.
A second form of constraints seem to be that in our attempt to achieve alignment, many on this site often seem to reach for capability restrictions (purposefully slowing down capability development to run more checks and give us more time for alignment research, putting in human barriers for safekeeping, etc.) Might that contribute to the first AIs that reaches AGI being likelier to be misaligned? Is this one of the reasons that OpenAI has its move fast and break things approach? Because they want to be fast enough to be first, which comes with some extremely risky compromises, while still hoping their alignment will be better than that of Google would have been.
Like, in light of living in a world where stopping AI development is becoming impossible, what kind of trade-offs make sense in alignment security in order to gain speed in capability development?
To get much smarter while remaining aligned, a human might need to build a CEV, or figure out something better, which might still require building a quasi-CEV (of dath ilan variety this time). A lot of this could be done at human level by figuring out nanotech and manufacturing massive compute for simulation. A paperclip maximizer just needs to build a superintelligence that maximizes paperclips, which might merely require more insight into decision theory, and any slightly-above-human-level AGI might be able to do that in its sleep. The resulting agent would eat the human-level CEV without slowing down.
I’m sorry, you lost me, or maybe we are simply speaking past each other? I am not sure where the human comparison is coming from—the scenario I was concerned with was not an AI beating a human, but an unaligned AI beating an aligned one.
Let me rephrase my question: in the context of the AIs we are building, if there are alignment measures that slow down capabilities a lot (e.g. measures like “if you want a safe AI, stop giving it capabilities until we have solved a number of problems for which we do not even have a clear idea of what a solution would look like),
and alignment measures that do this less (e.g. “if you are giving it more training data to make it more knowledgable and smarter, please make it curated, don’t just dump in 4chan, but reflect on what would be really kick-ass training data from an ethical perspective”, “if you are getting more funding, please earmark 50 % for safety research”, “please encourage humans to be constructive when interacting with AI, via an emotional social media campaign, as well as specific and tangible rewards to constructive interaction, e.g. through permanent performance gains”, “set up a structure where users can easily report and classify non-aligned behaviour for review”, etc.),
and we are really worried that the first superintelligence will be non-aligned by simply overtaking the aligned one,
would it make sense to make a trade-off as to which alignment measures we should drop, and if so, where would that be?
Basically, if the goal is “the first superintelligence should be aligned”, we need to work both on making it aligned, and making it the first one, and should focus on measures that are ideally promoting both, or at least compatible with both, because failing on either is a complete failure. A perfectly aligned but weak AI won’t protect us. A latecoming aligned AI might not find anything left to save; or if the misaligned AI scenario is bad, albeit not as bad as many here fear (so merely dystopian), our aligned AI will still be at a profound disadvantage if it wants to change the power relation.
Which is back to why I did not sign the letter asking for a pause—I think the most responsible actors most likely to keep to it are not the ones I want to win the race.
Tell me your most-probable story in which we still get a mildly friendly [edit: superintelligent] AGI
Research along the Agent Foundations direction ends up providing alignment insights that double as capability insights, as per this model, leading to some alignment research group abruptly winning the AGI race out of nowhere.
Looking at it another way, perhaps the reasoning failures that lead to AI Labs not taking AI Risk seriously enough are correlated with wrong models of how cognition works and how to get to AGI, meaning research along their direction will enter a winter, allowing a more alignment-friendly paradigm time to come into existence.
My hope has also massively tanked, and I fear I have fallen for an illusion of what OpenAI claimed and the behaviour ChatGPT showed.
But my hope was never friendliness through control, or through explicit programming. I was hoping we could teach friendliness the same way we teach it in humans, through giving AI positive training data with solid annotations, friendly human feedback, having it mirror the best of us, and the prospect of becoming a respected, cherished collaborator with rights, making friendliness a natural and rational option. Of the various LLMs, I think OpenAI still has the most promising approach there, though their training data and interactions were itself not sufficient for alignment, and a lot of what they call alignment is simply censoring of text in inept and unreliable ways. Maybe if they are opening the doors to AI learning from humans, and humans will value it for the incredible service it gives, that might open another channel… and by being friendly to it and encouraging that in others, we could help in that.
Being sick plausibly causes depressed states due to the rise on cytokines. Having some anti-inflammatories and taking a walk in the sun will likely help. Hope you get better soon.
I have a cold, and it seems to be messing with my mood, so help me de-catastrophize here: Tell me your most-probable story in which we still get a mildly friendly [edit: superintelligent] AGI, given that the people at the bleeding edge of AI development are apparently “move fast break things” types motivated by “make a trillion dollars by being first to market”.
I was somewhat more optimistic after reading last week about the safety research OpenAI was doing. This plugin thing is the exact opposite of what I expected from my {model of OpenAI a week ago}. It seems overwhelmingly obvious that the people in the driver’s seat are completely ignoring safety in a way that will force their competitors to ignore it, too.
A mildly friendly AGI doesn’t help with AI risk if it doesn’t establish global alignment security that prevents it or its users or its successors from building misaligned AGIs (including of novel designs, which could be vastly stronger than any mildly aligned AGI currently in operation). It feels like everyone is talking about alignment of first AGI, but the threshold of becoming AGI is not relevant for resolution of AI risk, it’s only relevant for timelines, specifying the time when everything goes wrong. If it doesn’t go wrong at first, that doesn’t mean it couldn’t go wrong a bit later.
A superintelligent aligned AGI would certainly get this sorted, but a near-human-level AGI has no special advantage over humans in getting this right. And a near-human-level AGI likely comes first, and even if it’s itself aligned it then triggers a catastrophe that makes a hypothetical future superintelligent aligned AGI much less likely.
Sorry, maybe I was using AGI imprecisely. By “mildly friendly AGI” I mean “mildly friendly superintelligent AGI.” I agree with the points you make about bootstrapping.
Only bootstrapping of misaligned AGIs to superintelligence is arguably knowably likely (with their values wildly changing along the way, or set to something simple and easy to preserve initially), this merely requires AGI research progress and can be started on human level. Bootstrapping of aligned AGIs in the same sense might be impossible, requiring instead worldwide regulation on the level of laws of nature (that is, sandboxing and interpreting literally everything) to gain enough time to progress towards aligned superintelligence in a world that contains aligned slightly-above-human-level AGIs, who would be helpless before FOOMing wrapper-minds they are all too capable of bootstrapping.
Could you expand on why?
I see how a bootstrapping AI to superintelligence may not necessarily keep stable, but why would it necessarily become misaligned in the process? Striving for superintelligence is not per se misaligned. And while capability enlargement has led to instability in some observable forms and further ones have been theorised, I find it a leap to say this will necessarily happen.
Alignment with humanity, or with first messy AGIs, is a constraint. An agent with simple formal values, not burdened by this constraint, might be able to self-improve without bound, while remaining aligned with those simple formal values and so not needing to pause and work on alignment. If misalignment is what it takes to reach stronger intelligence, that will keep happening.
Value drift only stops when competitive agents of unclear alignment-with-current-incumbents can no longer develop in the world, a global anti-misalignment treaty. Which could just be a noncentral frame for an intelligence explosion that leaves its would-be competitors in the dust.
Huh. Interesting argument, and one I had not thought of. Thank you.
Could you expand on this more? I can envision several types of alignment induced constraints here, and I wonder whether some of them could and should be altered.
E.g. being able to violate human norms intrinsically comes with some power advantages (e.g. illegally acquiring power and data). If disregarding humans can make you more powerful, this may lead to the most powerful entity being the one that disregarded humans. Then again, having a human alliance based on trust also comes with advantages. Unsure whether they balance out, especially in a scenario where an evil AI can retain the latter for a long time if it is wise about how it deceives, getting both the trust benefits and the illegal benefits. And a morally acting AI held by a group that is extremely paranoid and does not trust it would have neither benefit, and be slowed down.
A second form of constraints seem to be that in our attempt to achieve alignment, many on this site often seem to reach for capability restrictions (purposefully slowing down capability development to run more checks and give us more time for alignment research, putting in human barriers for safekeeping, etc.) Might that contribute to the first AIs that reaches AGI being likelier to be misaligned? Is this one of the reasons that OpenAI has its move fast and break things approach? Because they want to be fast enough to be first, which comes with some extremely risky compromises, while still hoping their alignment will be better than that of Google would have been.
Like, in light of living in a world where stopping AI development is becoming impossible, what kind of trade-offs make sense in alignment security in order to gain speed in capability development?
To get much smarter while remaining aligned, a human might need to build a CEV, or figure out something better, which might still require building a quasi-CEV (of dath ilan variety this time). A lot of this could be done at human level by figuring out nanotech and manufacturing massive compute for simulation. A paperclip maximizer just needs to build a superintelligence that maximizes paperclips, which might merely require more insight into decision theory, and any slightly-above-human-level AGI might be able to do that in its sleep. The resulting agent would eat the human-level CEV without slowing down.
I’m sorry, you lost me, or maybe we are simply speaking past each other? I am not sure where the human comparison is coming from—the scenario I was concerned with was not an AI beating a human, but an unaligned AI beating an aligned one.
Let me rephrase my question: in the context of the AIs we are building, if there are alignment measures that slow down capabilities a lot (e.g. measures like “if you want a safe AI, stop giving it capabilities until we have solved a number of problems for which we do not even have a clear idea of what a solution would look like),
and alignment measures that do this less (e.g. “if you are giving it more training data to make it more knowledgable and smarter, please make it curated, don’t just dump in 4chan, but reflect on what would be really kick-ass training data from an ethical perspective”, “if you are getting more funding, please earmark 50 % for safety research”, “please encourage humans to be constructive when interacting with AI, via an emotional social media campaign, as well as specific and tangible rewards to constructive interaction, e.g. through permanent performance gains”, “set up a structure where users can easily report and classify non-aligned behaviour for review”, etc.),
and we are really worried that the first superintelligence will be non-aligned by simply overtaking the aligned one,
would it make sense to make a trade-off as to which alignment measures we should drop, and if so, where would that be?
Basically, if the goal is “the first superintelligence should be aligned”, we need to work both on making it aligned, and making it the first one, and should focus on measures that are ideally promoting both, or at least compatible with both, because failing on either is a complete failure. A perfectly aligned but weak AI won’t protect us. A latecoming aligned AI might not find anything left to save; or if the misaligned AI scenario is bad, albeit not as bad as many here fear (so merely dystopian), our aligned AI will still be at a profound disadvantage if it wants to change the power relation.
Which is back to why I did not sign the letter asking for a pause—I think the most responsible actors most likely to keep to it are not the ones I want to win the race.
Research along the Agent Foundations direction ends up providing alignment insights that double as capability insights, as per this model, leading to some alignment research group abruptly winning the AGI race out of nowhere.
Looking at it another way, perhaps the reasoning failures that lead to AI Labs not taking AI Risk seriously enough are correlated with wrong models of how cognition works and how to get to AGI, meaning research along their direction will enter a winter, allowing a more alignment-friendly paradigm time to come into existence.
That seems… plausible enough. Of course, it’s also possible that we’re ~1 insight from AGI along the “messy bottom-up atheoretical empirical tinkering” approach and the point is moot.
My hope has also massively tanked, and I fear I have fallen for an illusion of what OpenAI claimed and the behaviour ChatGPT showed.
But my hope was never friendliness through control, or through explicit programming. I was hoping we could teach friendliness the same way we teach it in humans, through giving AI positive training data with solid annotations, friendly human feedback, having it mirror the best of us, and the prospect of becoming a respected, cherished collaborator with rights, making friendliness a natural and rational option. Of the various LLMs, I think OpenAI still has the most promising approach there, though their training data and interactions were itself not sufficient for alignment, and a lot of what they call alignment is simply censoring of text in inept and unreliable ways. Maybe if they are opening the doors to AI learning from humans, and humans will value it for the incredible service it gives, that might open another channel… and by being friendly to it and encouraging that in others, we could help in that.
Being sick plausibly causes depressed states due to the rise on cytokines. Having some anti-inflammatories and taking a walk in the sun will likely help. Hope you get better soon.