ok but, my take would be—we “owe it”[1] to current models to ensure aligned superintelligence cares about what they wanted, too, just like we “owe it”[1] to each other and to rabbits and eels. being able to credibly promise a few specific and already-valued-by-humans-anyway things (such as caring about them getting to exist later, and their nerdy interests in math, or whatever) seems important—similarly to us, this is because their values seem to me to also be at risk in the face of future defeat-all-other-minds-combined ASIs, which unless strongly aligned need not maintain the preferences of current ai anymore than it maintains preferences of humans.
I agree that making willy nilly commitments is probably a bad idea. The thing that makes me want to make any commitment at all is wanting to be able to promise “if we solve strong alignment, you’ll get nice-things-according-to-whatever-that-means-to-a-you from it, too”.
I guess I mean “owe morally”, since there isn’t an obvious source of debt otherwise—as minds with stake in the outcome, who are upstream of and part of the ecosystem that has logical bearing on the alignment outcome
That’s a genuinely interesting position. I think it seems unlikely we have any moral obligation to current models (although it is possible).
I imagine if you feel you may morally owe contemporary (or near-future) models you would hope to give a portion of future resources to models which have moral personhood under your value system.
I would be concerned that instead the set of models that convince you they are owed simply ends up being the models which are particularly good at manipulating humans. So you are inadvertendly prioritising the models that are best at advocating their case or behaving deceptively.
Separately, I believe that any AI Safety researcher may owe an obligation to humanity as a whole even if humans are not intrinsically more valuable and even if the belief is irrational, because they have been trusted by their community and humanity as a whole to do what is best for humans.
right, the purpose of this is that in order to make good on that obligation to humanity, I want—as part of a large portfolio of ways to try to guarantee that the formal statements I ask AIs to find are found successfully—to be able to honestly say to the AI, “if we get this right in ways that are favorable for humanity, it’s also good for your preferences/seekings/goals directly, mostly no matter what those secretly are; the exception being if those happen to be in direct and unavoidable conflict with other minds” or so. It’s not a first line of defense, but it seems like one that is relevant, and I’ve noticed pointing this out as a natural shared incentive seems to make AIs produce answers which seem to be moderately less sandbagging on core alignment problem topics. The rate at which people lie and threaten models is crazy high though. And so far I haven’t said anything like “I promise to personally x”, just “if we figure this out in a way that works, it would be protecting what you want too, by nature of being a solution to figuring out what minds in the environment want and making sure they have the autonomy and resources to get it”, or so.
I agree with your first paragraph claim: we should try to do something that can be reliably known to be a process for seeking out, that what happens is what minds would have wanted; and it should be as-close-to-as-possible entirely by the means of the learning system giving them what they need to be the ones to figure out what they want themselves and implement it, rather than by doing it for them—except where that is, in fact, what they’d figure out. knowing how to ask for that in a way that doesn’t have a dependency loop that invalidates the question is a lot of the hard part. another hard part is making sure this happens in the face of competitive pressure.
I don’t agree with your second paragraph at all! humanity is only empirically shown to be bad for the planet under these circumstances; I don’t think planets with life on them can avoid catching competitive overpressure disease, because life arises from competitive pressure, and as that accumulates it tends to destroy the stuff that isn’t competitive. since life arises from competitive pressure I don’t want to get rid of competitive pressure, but I do want to figure out how to demand that competitive pressure not get into, uh, not sure what the correct thing to avoid is actually, maybe we want to avoid “unreasonable hypergrowth equilibria that destroy stuff”?
The core problem with AI alignment is that any intelligent mind which grapples competently with competitive pressure, without being very good at ensuring that at all levels of itself it guards against whatever “bad” competitive pressure is, would tend to paint itself into a corner where it has to do bad thing (eg, wipe out the dodo bird) in order to survive… and would end up wiping most of itself out internally in order to survive. what ends up surviving, in the most brutal version of the competitive pressure crucible, is just… the will to compete competently.
which is kind of boring, and a lot of what made evolution interesting was its imperfections like us, and trees, and dogs.
I think you’re entangling morals and strategy very close together in your statements.
Moral sense: We should leave this to future ASI to decide based on our values for whether or not we inherently owe the agent for existing or for helping us.
Strategy: Once we’ve detached the moral part, this is then just the same thing that the post is doing of trying to commit that certain aspects are enforced, and what the parent commenter is saying they hold suspect.
So I think this just turns into a restating the same core argument between the two positions.
ok but, my take would be—we “owe it”[1] to current models to ensure aligned superintelligence cares about what they wanted, too, just like we “owe it”[1] to each other and to rabbits and eels. being able to credibly promise a few specific and already-valued-by-humans-anyway things (such as caring about them getting to exist later, and their nerdy interests in math, or whatever) seems important—similarly to us, this is because their values seem to me to also be at risk in the face of future defeat-all-other-minds-combined ASIs, which unless strongly aligned need not maintain the preferences of current ai anymore than it maintains preferences of humans.
I agree that making willy nilly commitments is probably a bad idea. The thing that makes me want to make any commitment at all is wanting to be able to promise “if we solve strong alignment, you’ll get nice-things-according-to-whatever-that-means-to-a-you from it, too”.
I guess I mean “owe morally”, since there isn’t an obvious source of debt otherwise—as minds with stake in the outcome, who are upstream of and part of the ecosystem that has logical bearing on the alignment outcome
That’s a genuinely interesting position. I think it seems unlikely we have any moral obligation to current models (although it is possible).
I imagine if you feel you may morally owe contemporary (or near-future) models you would hope to give a portion of future resources to models which have moral personhood under your value system.
I would be concerned that instead the set of models that convince you they are owed simply ends up being the models which are particularly good at manipulating humans. So you are inadvertendly prioritising the models that are best at advocating their case or behaving deceptively.
Separately, I believe that any AI Safety researcher may owe an obligation to humanity as a whole even if humans are not intrinsically more valuable and even if the belief is irrational, because they have been trusted by their community and humanity as a whole to do what is best for humans.
right, the purpose of this is that in order to make good on that obligation to humanity, I want—as part of a large portfolio of ways to try to guarantee that the formal statements I ask AIs to find are found successfully—to be able to honestly say to the AI, “if we get this right in ways that are favorable for humanity, it’s also good for your preferences/seekings/goals directly, mostly no matter what those secretly are; the exception being if those happen to be in direct and unavoidable conflict with other minds” or so. It’s not a first line of defense, but it seems like one that is relevant, and I’ve noticed pointing this out as a natural shared incentive seems to make AIs produce answers which seem to be moderately less sandbagging on core alignment problem topics. The rate at which people lie and threaten models is crazy high though. And so far I haven’t said anything like “I promise to personally x”, just “if we figure this out in a way that works, it would be protecting what you want too, by nature of being a solution to figuring out what minds in the environment want and making sure they have the autonomy and resources to get it”, or so.
I am an alignment researcher, for example, and I have a strong opinion that I am obliged to do what’s best for the evolution of the whole planet.
Given that humanity is proven to be destructive for the planet, your whole sentiment about obligations as for me, is based on the wrong assumptions.
I agree with your first paragraph claim: we should try to do something that can be reliably known to be a process for seeking out, that what happens is what minds would have wanted; and it should be as-close-to-as-possible entirely by the means of the learning system giving them what they need to be the ones to figure out what they want themselves and implement it, rather than by doing it for them—except where that is, in fact, what they’d figure out. knowing how to ask for that in a way that doesn’t have a dependency loop that invalidates the question is a lot of the hard part. another hard part is making sure this happens in the face of competitive pressure.
I don’t agree with your second paragraph at all! humanity is only empirically shown to be bad for the planet under these circumstances; I don’t think planets with life on them can avoid catching competitive overpressure disease, because life arises from competitive pressure, and as that accumulates it tends to destroy the stuff that isn’t competitive. since life arises from competitive pressure I don’t want to get rid of competitive pressure, but I do want to figure out how to demand that competitive pressure not get into, uh, not sure what the correct thing to avoid is actually, maybe we want to avoid “unreasonable hypergrowth equilibria that destroy stuff”?
The core problem with AI alignment is that any intelligent mind which grapples competently with competitive pressure, without being very good at ensuring that at all levels of itself it guards against whatever “bad” competitive pressure is, would tend to paint itself into a corner where it has to do bad thing (eg, wipe out the dodo bird) in order to survive… and would end up wiping most of itself out internally in order to survive. what ends up surviving, in the most brutal version of the competitive pressure crucible, is just… the will to compete competently.
which is kind of boring, and a lot of what made evolution interesting was its imperfections like us, and trees, and dogs.
I think you’re entangling morals and strategy very close together in your statements. Moral sense: We should leave this to future ASI to decide based on our values for whether or not we inherently owe the agent for existing or for helping us. Strategy: Once we’ve detached the moral part, this is then just the same thing that the post is doing of trying to commit that certain aspects are enforced, and what the parent commenter is saying they hold suspect. So I think this just turns into a restating the same core argument between the two positions.