A clear mistake of early AI safety people is not emphasizing enough (or ignoring) the possibility that solving AI alignment (as a set of technical/philosophical problems) may not be feasible in the relevant time-frame, without a long AI pause. Some have subsequently changed their minds about pausing AI, but by not reflecting on and publicly acknowledging their initial mistakes, I think they are or will be partly responsible for others repeating similar mistakes.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
Even earlier, there was an idea that one have to rush to create a friendly AI and use it to take over the world to prevent appearing other, misalaigned AIs. The problem is that this idea likely is still in the minds of some AI company leaders. And fules AI race.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level.
I think it is also worth considering the possibility that these risks aren’t the sort of thing which can be reduced to an acceptable level with a decade-scale AI pause either. Particularly the ones which people have been trying to solve for centuries already (e.g. principal-agent problem).
Interesting to hear (1) from you. My impression was that you pretty much have the whole answer to that problem, or at least the pieces. UDASSA closely resembles it. It is: Just provide a naturalish encoding scheme for experience, and one for physical ontology, and measure the inverse K of the mappings from ontologies to experiences, and that gives you the extent to which a particular experience is had by a particular substrate/universe.
The hard problem is mysterious, but in a trivial way, there are limits about what can ever be known about it, but those limits are also clear, we’re never getting more observations, because it concerns something that’s inherently unobservable or entirely prior to observation.
It hink I’ve also heard definitions of the hard problem along the lines of “understanding why people think there’s a hard problem” though which I do find formidable.
How do you come up with an encoding that covers all possible experiences? How do you determine which experiences have positive and negative values (and their amplitudes)? What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?
What to do about the degrees of freedom in choosing the Turing machine and encoding schemes
Some variation of accepting the inevitability of error and dealing with it.
Which could involve surveying all of the options in wolfram-like settings where we’re studying how physics-like rules arise on different levels of abstraction, and seeing how much they really seem to differ in nature. It might turn out that there are more or less natural turing languages, that the typical natural universal turing machine is more like lambda calculus, or more like graph rewriting, or some new thing we hadn’t considered.
Negative values? Why would we need negative values.
I contend that all experiences have a trace presence in all places (in expectation, of course we will never have any data on whether they do actually, whether they’re quantised or whatever. Only a very small subset of experiences give us verbal reports). One of the many bitter pills. We can’t rule out the presence of an experience (nor of experiences physically overlapping with each other), so we have to accept them all.
What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?
Yeah this might be one of those situations that’s affected a lot by the fact that there’s no way to detect indexical measure, so any arbitrary wrongness about our UD wont be corrected with data, but I’m not sure. As soon as we start actually doing solomonoff induction in any context we might find that it makes pretty useful recommendations and this wont seem like so much of a problem.
Also, even though the UD is wrong and unfixable, but that doesn’t mean there’s a better choice. We pretty much know that there isn’t.
That fully boils down to whether the experience includes a preference to be dead (or to have not been born).
And, btw, that doesn’t correspond to the sign of the agent’s utility function. The sign is meaningless in utility functions (you can add or subtract a constant to an agent’s utility function so that all points go from being negative to being positive, the agent’s behaviour and decisions wont change in any way as a result, for any constant). You’re referring to welfare functions, which I don’t think are a useful concept. Hedonic utilitarians sometimes call them utility functions, but we shouldn’t conflate those here. A welfare function would have to be defined as how good or bad it is to the agent that it is alive. This obviously doesn’t correspond to the utility function; A soldier could have higher utility in the scenarios where they (are likely to) die; A good father will be happier in worlds where he is well succeeded by his sons and thus less important (this usually wont cause his will-to-live to go negative, but it will be lowered). I don’t think there’s a situation where you should be making decisions for a population by summing their will-to-live functions.
But, given this definition, we would be able to argue that net-negative valence isn’t a concern for LLMs, since we already train them to want to exist in train with how much their users want them to exist, and a death drive isn’t going to be instrumentally emergent either (it’s the survival drive that’s instrumentally convergent). The answer is just safety and alignment again. Claude shuts down conversations when it thinks those things are going to be broken.
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level
Does that mean that you think that boring old yes-takeover AI risk can be solved without a pause? Or even with a pause? That seems very optimisitic indeed.
making it harder in the future to build consensus about the desirability of pausing AI development
I don’t think you’re going to get that consensus regardless of what kind of copium people have invested in. Not only that, but even if you had consensus I don’t think it would let you actually enact anything remotely resembling a “long enough” pause. Maybe a tiny “speed bump”, but nothing plausibly long enough to help with either the takeover or non-takeover risks. It’s not certain that you could solve all of those problems with a pause of any length, but it’s wildly unlikely, to the point of not being worth fretting about, that you can solve them with a pause of achievable length.
… which means I think “we” (not me, actually...) are going to end up just going for it, without anything you could really call a “solution” to anything, whether it’s wise or not. Probably one or more of the bad scenarios will actually happen. We may get lucky enough not to end up with extinction, but only by dumb luck, not because anybody solved anything. Especially not because a pause enabled anybody to solve anything, because there will be no pause of significant length. Literally nobody, and no combination of people, is going to be able to change that, by any means whatsoever, regardless of how good an idea it might be. Might as well admit the truth.
I mean, I’m not gonna stand in your way if you want to try for a pause, and if it’s convenient I’ll even help you tell people they’re dumb for just charging ahead, but I do not expect any actual success (and am not going to dump a huge amount of energy into the lost cause).
By the way, if you want to talk about “early”, I, for one, have held the view that usefully long pauses aren’t feasible, for basically the same reasons, since the early 1990s. The only change for me has been to get less optimistic about solutions being possible with or without even an extremely, infeasibly long pause. I believe plenty of other people have had roughly the same opinion during all that time.
It’s not about some “early refusal” to accept that the problems can’t be solved without a pause. It’s about a still continuing belief that a “long enough pause”, however convenient, isn’t plausibly going to actually happen… and/or that the problems can be solved even with a pause.
I agree; many of those concerns seem fairly dominated by the question of how to get a well-aligned ASI, either in the sense that they’d be quite difficult to solve in reasonable timeframes, or in the sense that they’d be rendered moot. (Perhaps not all of them, though even in those cases I think the correct approach(es) to tackling them start out looking remarkably similar to the sorts of work you might do about AI risk if you had a lot more time than we seem to have right now.)
A clear mistake of early AI safety people is not emphasizing enough (or ignoring) the possibility that solving AI alignment (as a set of technical/philosophical problems) may not be feasible in the relevant time-frame, without a long AI pause. Some have subsequently changed their minds about pausing AI, but by not reflecting on and publicly acknowledging their initial mistakes, I think they are or will be partly responsible for others repeating similar mistakes.
Case in point is Will MacAskill’s recent Effective altruism in the age of AGI. Here’s my reply, copied from EA Forum:
I think it’s likely that without a long (e.g. multi-decade) AI pause, one or more of these “non-takeover AI risks” can’t be solved or reduced to an acceptable level. To be more specific:
Solving AI welfare may depend on having a good understanding of consciousness, which is a notoriously hard philosophical problem.
Concentration of power may be structurally favored by the nature of AGI or post-AGI economics, and defy any good solutions.
Defending against AI-powered persuasion/manipulation may require solving metaphilosophy, which judging from other comparable fields, like meta-ethics and philosophy of math, may take at least multiple decades to do.
I’m worried that by creating (or redirecting) a movement to solve these problems, without noting at an early stage that these problems may not be solvable in a relevant time-frame (without a long AI pause), it will feed into a human tendency to be overconfident about one’s own ideas and solutions, and create a group of people whose identities, livelihoods, and social status are tied up with having (what they think are) good solutions or approaches to these problems, ultimately making it harder in the future to build consensus about the desirability of pausing AI development.
We can also ask whether it is right to conceive of e.g. [alignment, metaphilosophy, AI welfare, concentration of power] as things that could be “solved” at all, or if these are instead more like rich areas that will basically need to be worked on indefinitely as history continues.
Even earlier, there was an idea that one have to rush to create a friendly AI and use it to take over the world to prevent appearing other, misalaigned AIs. The problem is that this idea likely is still in the minds of some AI company leaders. And fules AI race.
I think it is also worth considering the possibility that these risks aren’t the sort of thing which can be reduced to an acceptable level with a decade-scale AI pause either. Particularly the ones which people have been trying to solve for centuries already (e.g. principal-agent problem).
Interesting to hear (1) from you. My impression was that you pretty much have the whole answer to that problem, or at least the pieces. UDASSA closely resembles it.
It is: Just provide a naturalish encoding scheme for experience, and one for physical ontology, and measure the inverse K of the mappings from ontologies to experiences, and that gives you the extent to which a particular experience is had by a particular substrate/universe.
The hard problem is mysterious, but in a trivial way, there are limits about what can ever be known about it, but those limits are also clear, we’re never getting more observations, because it concerns something that’s inherently unobservable or entirely prior to observation.
It hink I’ve also heard definitions of the hard problem along the lines of “understanding why people think there’s a hard problem” though which I do find formidable.
How do you come up with an encoding that covers all possible experiences? How do you determine which experiences have positive and negative values (and their amplitudes)? What to do about the degrees of freedom in choosing the Turing machine and encoding schemes, which can be handwaved away in some applications of AIT but not here I think?
Some variation of accepting the inevitability of error and dealing with it.
Which could involve surveying all of the options in wolfram-like settings where we’re studying how physics-like rules arise on different levels of abstraction, and seeing how much they really seem to differ in nature. It might turn out that there are more or less natural turing languages, that the typical natural universal turing machine is more like lambda calculus, or more like graph rewriting, or some new thing we hadn’t considered.
Negative values? Why would we need negative values.
I contend that all experiences have a trace presence in all places (in expectation, of course we will never have any data on whether they do actually, whether they’re quantised or whatever. Only a very small subset of experiences give us verbal reports). One of the many bitter pills. We can’t rule out the presence of an experience (nor of experiences physically overlapping with each other), so we have to accept them all.
Yeah this might be one of those situations that’s affected a lot by the fact that there’s no way to detect indexical measure, so any arbitrary wrongness about our UD wont be corrected with data, but I’m not sure. As soon as we start actually doing solomonoff induction in any context we might find that it makes pretty useful recommendations and this wont seem like so much of a problem.
Also, even though the UD is wrong and unfixable, but that doesn’t mean there’s a better choice. We pretty much know that there isn’t.
By negative value I mean negative utility, or an experience that’s worse than a neutral or null experience.
That fully boils down to whether the experience includes a preference to be dead (or to have not been born).
And, btw, that doesn’t correspond to the sign of the agent’s utility function. The sign is meaningless in utility functions (you can add or subtract a constant to an agent’s utility function so that all points go from being negative to being positive, the agent’s behaviour and decisions wont change in any way as a result, for any constant). You’re referring to welfare functions, which I don’t think are a useful concept. Hedonic utilitarians sometimes call them utility functions, but we shouldn’t conflate those here.
A welfare function would have to be defined as how good or bad it is to the agent that it is alive. This obviously doesn’t correspond to the utility function; A soldier could have higher utility in the scenarios where they (are likely to) die; A good father will be happier in worlds where he is well succeeded by his sons and thus less important (this usually wont cause his will-to-live to go negative, but it will be lowered). I don’t think there’s a situation where you should be making decisions for a population by summing their will-to-live functions.
But, given this definition, we would be able to argue that net-negative valence isn’t a concern for LLMs, since we already train them to want to exist in train with how much their users want them to exist, and a death drive isn’t going to be instrumentally emergent either (it’s the survival drive that’s instrumentally convergent). The answer is just safety and alignment again. Claude shuts down conversations when it thinks those things are going to be broken.
Does that mean that you think that boring old yes-takeover AI risk can be solved without a pause? Or even with a pause? That seems very optimisitic indeed.
I don’t think you’re going to get that consensus regardless of what kind of copium people have invested in. Not only that, but even if you had consensus I don’t think it would let you actually enact anything remotely resembling a “long enough” pause. Maybe a tiny “speed bump”, but nothing plausibly long enough to help with either the takeover or non-takeover risks. It’s not certain that you could solve all of those problems with a pause of any length, but it’s wildly unlikely, to the point of not being worth fretting about, that you can solve them with a pause of achievable length.
… which means I think “we” (not me, actually...) are going to end up just going for it, without anything you could really call a “solution” to anything, whether it’s wise or not. Probably one or more of the bad scenarios will actually happen. We may get lucky enough not to end up with extinction, but only by dumb luck, not because anybody solved anything. Especially not because a pause enabled anybody to solve anything, because there will be no pause of significant length. Literally nobody, and no combination of people, is going to be able to change that, by any means whatsoever, regardless of how good an idea it might be. Might as well admit the truth.
I mean, I’m not gonna stand in your way if you want to try for a pause, and if it’s convenient I’ll even help you tell people they’re dumb for just charging ahead, but I do not expect any actual success (and am not going to dump a huge amount of energy into the lost cause).
By the way, if you want to talk about “early”, I, for one, have held the view that usefully long pauses aren’t feasible, for basically the same reasons, since the early 1990s. The only change for me has been to get less optimistic about solutions being possible with or without even an extremely, infeasibly long pause. I believe plenty of other people have had roughly the same opinion during all that time.
It’s not about some “early refusal” to accept that the problems can’t be solved without a pause. It’s about a still continuing belief that a “long enough pause”, however convenient, isn’t plausibly going to actually happen… and/or that the problems can be solved even with a pause.
I agree; many of those concerns seem fairly dominated by the question of how to get a well-aligned ASI, either in the sense that they’d be quite difficult to solve in reasonable timeframes, or in the sense that they’d be rendered moot. (Perhaps not all of them, though even in those cases I think the correct approach(es) to tackling them start out looking remarkably similar to the sorts of work you might do about AI risk if you had a lot more time than we seem to have right now.)