It’s not clear to me why the BFO would converge to a fixed point of μ. If we’ve solved the problem of embedded agency and the AI system knows that yt can depend on its prediction zt, then it would tend to find a fixed point, but it could also do the sort of counterfactual reasoning you say it can’t do. If we haven’t solved embedded agency, then it seems like the function that best explains the data is to posit the existence of some other classifier h that works the same way that the AI did in past timesteps, and that yt=μ(h(xt))+v(h(xt)). Intuitively, this is saying that the past data is explained by a hypothetical other classifier that worked the same way as the AI used to, and now the AI thinks one level higher than that. This probably does converge to a fixed point eventually, but at any given timestep the best hypothesis would be something that is a finite number of applications of μ and v.
The BFO can generally cope with humans observing zt=f(yt)
Should this be zt=f(xt)?
Seems like a nonissue.
I’m not claiming it’s an issue, I’m trying to understand what AUP does. Your response to comments is frequently of the form “AUP wouldn’t do that” so afaict none of the commenters (including me) groks your conception of AUP, so I’m trying to extract simple implications and see if they’re actually true in an attempt to grok it.
That doesn’t conflict with what I said.
I can’t tell if you agree or disagree with my original claim. “Don’t think so in general?” implies not, but this implies you do?
If you disagree with my original claim, what’s an example with deterministic known dynamics, where there is an optimal plan to achieve maximal u_A that can be executed at any time, where AUP with intent verification will execute that plan before the last possible moment in the epoch?
(1) I am unsure whether there exists an idealized reasoner analogous to a Carnot engine (see Realism about rationality). Even if such a reasoner exists, it seems unlikely that we will a) figure out what it is, b) understand it in sufficient depth, and c) successfully use it to understand and improve ML techniques, before we get powerful AI systems through other means. Under short timelines, this cuts particularly deeply, because a) there’s less time to do all of these things and b) it’s more likely that advanced AI is built out of “messy” deep learning systems that seem less amenable to this sort of theoretical understanding.
(2) I certainly agree that all else equal, advanced agents should act closer to ideal agents. (Assuming there is such a thing as an ideal agent.) I also agree that advanced AI should be less susceptible to money pumps, from which I learn that their “preferences” (i.e. world states that they work to achieve) are transitive. I’m also on board that more advanced AI systems are more likely to be described by some utility function that they are maximizing the expected utility of, per the VNM theorem. I don’t agree that the utility function must be simple, or that the AI must be internally reasoning by computing the expected utility over all actions and then choosing the one that’s highest. I would be extremely surprised if we built powerful AI such that when we say the English sentence “make paperclips” it acts in accordance with the utility function U(universe history) = number of paperclips in the last state of the universe history. I would be very surprised if we built powerful AI such that we hardcode in the above utility function and then design the AI to maximize its expected value.
But the first action doesn’t strictly improve your ability to get u_A (because you could just wait and execute the plan later), and so intent verification would give it a 1.01 penalty?
I was also confused by intent verification. The confusion went away after I figured out two things:
Each action in the plan is compared to the baseline of doing nothing, not to the baseline of the optimal plan.
Is it correct that in deterministic environments with known dynamics, intent verification will cause the agent to wait until the last possible timestep in the epoch at which it can execute its plan and achieve maximal u_A?
Yeah, I agree with that. I had to cut out a lot of interesting thoughts about it to keep it short, but re-reading the summary I wish I had included a link to your comment, which I found quite helpful. I’ll probably add a note to the next newsletter about it.
People keep saying things like this, and it might be true. But on what data are we basing this? Have we tried relaxing an impact measure, given that we have a conceptual core in hand?
What? I’ve never tried to write an algorithm to search an unordered set of numbers in O(log n) time, yet I’m quite certain it can’t be done. It is possible to make a real claim about X without having tried to do X. Granted, all else equal trying to do X will probably make your claims about X more likely to be true (but I can think of cases where this is false as well).
Thinking of it as alien agents does make more sense, I think that basically convinces me that this is not an important point to get hung up about. (Though I still do have residual feelings of weirdness.)
I argue that you should be very careful about believing these things.
You’re right, I was too loose with language there. A more accurate statement is “The general argument and intuitions behind the claim are compelling enough that I want any proposal to clearly explain why the argument doesn’t work for it”. Another statement is “the claim is compelling enough that I throw it at any particular proposal, and if it’s unclear I tend to be wary”. Another one is “if I were trying to design an impact measure, showing why that claim doesn’t work would be one of my top priorities”.
Perhaps we do mostly agree, since you are planning to talk more about this in the future.
it generally seems like the error that people make when they say, “well, I don’t see how to build an AGI right now, so it’ll take thousands of years”.
I think the analogous thing to say is, “well, I don’t see how to build an AGI right now because AIs don’t form abstractions, and no one else knows how to make AIs that form abstractions, so if anyone comes up with a plan for building AGI, they should be able to explain why it will form abstractions, or why AI doesn’t need to form abstractions”.
I actually think we could, but I have yet to publish my reasoning on how we would go about this, so you don’t need to take my word for now. Maybe we could discuss this when I’m able to post that?
Another consideration I forgot to highlight: the agent’s actual goal should be pointing in (very) roughly the right direction, so it’s more inclined to have certain kind of impact than others.
Yeah, I agree this helps.
I don’t understand the issue here – the attainable u_A is measuring how well would I be able to start maximizing this goal from here? It seems to be captured by what you just described. It’s supposed to capture the future ability, regardless of what has happened so far. If you do a bunch of jumping jacks, and then cripple yourself, should your jumping jack ability remain high because you already did quite a few?
In the case you described, u_A would be “Over the course of the entire history of the universe, I want to do 5 jumping jacks—no more, no less.” You then do 5 jumping jacks in the current epoch. After this, u_A will always output 1, regardless of policy, so its penalty should be zero, but since you call u_A on subhistories, it will say “I guess I’ve never done any jumping jacks, so attainable utility is 1 if I do 5 jumping jacks now, and 0 otherwise”, which seems wrong.
On the meta level: I think our disagreements seem of this form:
Me: This particular thing seems strange and doesn’t gel with my intuitions, here’s an example.
You: That’s solved by this other aspect here.
Me: But… there’s no reason to think that the other aspect captures the underlying concept.
You: But there’s no actual scenario where anything bad happens.
Me: But if you haven’t captured the underlying concept I wouldn’t be surprised if such a scenario exists, so we should still worry.
There are two main ways to change my mind in these cases. First, you could argue that you actually have captured the underlying concept, by providing an argument that your proposal does everything that the underlying concept would do. The argument should quantify over “all possible cases”, and is stronger the fewer assumptions it has on those cases. Second, you could convince me that the underlying concept is not important, by appealing to the desiderata behind my underlying concept and showing how those desiderata are met (in a similar “all possible cases” way). In particular, the argument “we can’t think of any case where this is false” is unlikely to change my mind—I’ve typically already tried to come up with a case where it’s false and not been able to come up with anything convincing.
I don’t really know how I’m supposed to change your mind in such cases. If it’s by coming up with a concrete example where things clearly fail, I don’t think I can do that, and we should probably end this conversation. I’ve outlined some ways in which I think things could fail, but anything involving all possible utility functions and reasoning about long-term convergent instrumental goals is sufficiently imprecise that I can’t be certain that anything in particular would fail.
(That’s another thing causing a lot of disagreements, I think—I am much more skeptical of any informal reasoning about all computable utility functions, or reasoning that depends upon particular aspects of the environment, than you seem to be.)
I’m going to try to use this framework in some of my responses.
But natural kind is a desideratum! I’m thinking about adding one, though.
Here, the “example” is the impact penalty that is always 1.01, the “other aspect” is “natural kind”, and the “underlying concept” is that an impact measure should allow the AI to do things.
Arguably 1.01 is a natural kind—is it not natural to think “any action that’s different from inaction is impactful”? I legitimately find 1.01 more natural than AUP—it is _really strange_ to me to penalize changes in Q-values in _both directions_. This is an S1 intuition, don’t take it seriously—I say it mainly to make the point that natural kind is subjective, whereas the fact that 1.01 is a bad impact penalty is not subjective.
So notice that although AUP is by design value agnostic, it has moderate value awareness via approval. I think this helps us around some issues you may be considering—I expect the approval incentives to be fairly strong.
Here, the “example” is how other actions might make us more likely to turn off the agent, the “other aspect” is value awareness via approval, and the “underlying concept” is something like “can the agent do things that it knows we want”.
Here, I’m pretty happy about value awareness via approval because it seems like it could capture a good portion of underlying concept, but I think that’s not clearly true—value awareness via approval depends a lot on the environment, and only gets some of it. If unaligned aliens were going to take over the AI, or we’re going to get wiped out by an asteroid, the AI couldn’t stop that from happening even though it knows we’d want it to. Similarly, if we wanted to build von Neumann probes but couldn’t without the AI’s help, it couldn’t do that for us. Invoking the framework again, the “example” is building von Neumann probes, the “other aspect” might be something like “building a narrow technical AI that just creates von Neumann probes and places them outside the AI’s control”, and the “underlying concept” is “the AI should be able to do what we want it to do”.
You might not be considering the asymmetry imposed by approval.
See paragraph above about why approval makes me happier but doesn’t fully remove my worries.
I view it as saying “there’s no clever complete plan which moves you towards your goal while not changing other things” (ofer has an interesting example for incomplete plans which doesn’t trigger Theorem 1’s conditions). This implies somewhat that it’s measuring impact in universal, although it only holds for all computable u.
When utility functions are on full histories I’d disagree with this (Theorem 1 feels decidedly trivial in that case), it’s possible that utility functions on subhistories are different, so perhaps I’ll wait until understanding that better.
Any action for which E[Penalty(a_unit)] is strictly increased?
By default I’d expect this to knock out half of all actions, which is quite a problem for small, granular action sets.
My model strongly disagrees with this intuition, and I’d be interested in hearing more arguments for it.
Uh, I thought I gave a very strong one—you can’t encode the utility function “I want to do X exactly once”. Let’s consider the “I want to do X exactly once, on the first timestep”. You could try to do this by writing the u_A = 1 if a_1 = X, and 0 otherwise. Since you apply u_A on different subhistories, this actually wants you to take action X on the first action of every epoch. If you’re using the full history for action selection, that may not be the case, but the attainable utility calculation will definitely think “The attainable utility for u_A is 1 if I can take action X at time step t+n+1, and 0 otherwise” _even if_ you have already taken action X.
This seems extremely premature. I agree that AUP should be more lax in some ways. The conclusion “looks maybe impossible, then” doesn’t seem to follow. Why don’t we just tweak the formulation? I mean, I’m one guy who worked on this for two months. People shouldn’t take this to be the best possible formulation.
The claim I’m making has nothing to do with AUP. It’s an argument that’s quantifying over all possible implementations of impact measures. The claim is “you cannot satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do useful things)”. I certainly haven’t proven this claim, nor have I given such a strong argument that everyone should mostly believe it, but I do currently believe this claim.
AUP might get around this by not being objective—that’s what value awareness through approval does. And in fact I think the more you think that value awareness through approval is important, the less that AUP meets your original desideratum of being value-agnostic—quoting from the desiderata post:
If we substantially base our impact measure on some kind of value learning—you know, the thing that maybe fails—we’re gonna have a bad time.
This seems to apply to any AUP-agent that is substantially value aware through approval.
From the desiderata post comments:
This criticism of impact measures doesn’t seem falsifiable? Or maybe I misunderstand.
That was an example meant to illustrate my model that impact (the concept in my head, not AUP) and values are sufficiently different that an impact measure couldn’t satisfy all three of objectivity, safety, and non-trivialness. The underlying model is falsifiable.
People have yet to point out a goal AUP cannot maximize in a low-impact way. Instead, certain methods of reaching certain goals are disallowed. These are distinct flaws, with the latter only turning into the former (as I understand it) if no such method exists for any given goal.
See first paragraph about our disagreements. But also I weakly claim that “design an elder-care robot” is a goal that AUP cannot maximize in a low-impact way today, or that if it can, there exists a (u_A, plan) pair such that AUP executes the plan and causes a catastrophe. (This mostly comes from my model that impact and values are fairly different, and to a lesser extent the fact that AUP penalizes everything some amount that’s not very predictable, and that a design for an elder-care robot could allow humans to come up with a design for unaligned AGI.) I would not make this claim if I thought that value awareness through approval and intent verification were strong effects, but in that case I would think of AUP as a value learning approach, not an impact measure.
Will reply on the other post to consolidate discussion.
This is a clear strawman, and is compounding the sense I have that we’re trying to score points now.
Fwiw, I would make the same argument that ofer did (though I haven’t read the rest of the thread in detail). For me, that argument is an existence proof that shows the following claim: if you know nothing about an impact measure, it is possible that the impact measure disallows all malignant behavior, and yet all of the difficulty is in figuring out how to make it lenient enough.
Now, obviously we know something about AUP, but It’s not obvious to me that we can make AUP lenient enough to do useful things without also allowing malignant behavior.
Nice job! This does meet a bunch of desiderata in impact measures that weren’t there before :)
My main critique is that it’s not clear to me that an AUP-agent would be able to do anything useful, and I think this should be included as a desideratum. I wrote more about this on the desiderata post, but it’s worth noting that the impact penalty that is always 1.01 meets all of the desiderata except natural kind.
For example, perhaps the action used to define the impact unit is well-understood and accepted, but any other action makes humans a little bit more likely to turn off the agent. Then the agent won’t be able to take those actions. Generally, I think that it’s hard to satisfy the conjunction of three desiderata—objectivity (no dependence on values), safety (preventing any catastrophic plans) and non-trivialness (the AI is still able to do some useful things).
Questions and comments:
We now formalize impact as change in attainable utility. One might imagine this being with respect to the utilities that we (as in humanity) can attain. However, that’s pretty complicated, and it turns out we get more desirable behavior by using the agent’s attainable utilities as a proxy.
An impact measure that penalized change in utility attainable by humans seems pretty bad—the AI would never help us do anything. To the extent that that the AI’s ability to do things is meant to be similar to our ability to do things, I would expect that to be bad for us in the same way.
Breaking a vase seems like it is restricting outcome space. Do you think it is an example of opportunity cost? That doesn’t feel right to me, but I suspect I could be quickly convinced.
Nitpick: Overfitting typically refers to situations where the training distribution _does_ equal the test distribution (but the training set is different from the test set, since they are samples from the same distribution).
One might intuitively define “bad impact” as “decrease in our ability to achieve our goals”.
Nitpick: This feels like a definition of “bad outcomes” to me, not “bad impact”.
we avoid overfitting the environment to an incomplete utility function and thereby achieve low impact.
This sounds very similar to me to “let’s have uncertainty over the utility function and be risk-averse” (similar to eg. Inverse Reward Design), but the actual method feels nothing like that, especially since we penalize _increases_ in our ability to pursue other goals.
I view Theorem 1 as showing that the penalty biases the agent towards inaction (as opposed to eg. showing that AUP measures impact, or something like that). Do you agree with that?
Random note: Theorem 1 depends on U containing all computable utility functions, and may not hold for other sets of utility functions, even infinite ones. Consider an environment where breaking vases and flowerpots is irreversible. Let u_A be 1 if you stand at a particular location and 0 otherwise. Let U contain only utility functions that assign different weights to having intact vases vs. flowerpots, but always assigns 0 utility to environments with broken vases and flowerpots. (There are infinitely many of these.) Then if you start in a state with broken vases and flowerpots, there will never be any impact penalty for any action.
To prevent the agent from intentionally increasing ImpactUnit, simply apply 1.01 penalty to any action which is expected to do so.
How do you tell which action is expected to do so?
Simple extensions of this idea drastically reduce the chance that a_unit happens to have unusually-large objective impact; for example, one could set ImpactUnit to be the non-zero minimum of the impacts of 50 similar actions.
I think this makes it much more likely that your AI is unable to do anything. (This is an example of why I wanted a desideratum of “your AI is able to do things”.)
We crisply defined instrumental convergence and opportunity cost and proved their universality.
I’m not sure what this is referring to. Are the crisp definitions are the the increase/decrease in available outcome-space? Where was the proof of universality?
An alternative definition such as “an agent’s ability to take the outside view on its own value-learning algorithm’s efficacy in different scenarios” implies a value-learning setup which AUP does not require.
That definition can be relaxed to “an agent’s ability to take the outside view on the trustworthiness of its own algorithms” to get rid of the value-learning setup. How does AUP fare on this definition?
I also share several of Daniel’s thoughts, for example, that utility functions on subhistories are sketchy (you can’t encode the utility function “I want to do X exactly once ever”) , and that the “no offsetting” desideratum may not be one we actually want (and similarly for the “shutdown safe” desideratum as you phrase it), and that as a result there may not be any impact measure that we actually want to use.
(Fwiw, I think that when Daniel says he thinks offsetting is useful and I say that I want as a desideratum “the AI is able to do useful things”, we’re using similar intuitions, but this is entirely a guess that I haven’t confirmed with Daniel.)
(I’m going to assume you mean the weaker thing that doesn’t literally involve precluding every possible bad outcome)
I’m confused. I think under the strongly superintelligent AI model (which seems to be the model you’re using), if there’s misalignment then the AI is strongly optimizing against any security precautions we’ve taken, so if we don’t preclude every possible bad outcome, the AI will find the one we missed. I grant that we’re probably not going to be able to prove that it precludes every possible bad outcome, if that’s what you’re worried about, but that still should be our desideratum. I’m also happy to consider other threat models besides strongly superintelligent AI, but that doesn’t seem to be what you’re considering.
Your example with Go is not value-agnostic, and arguably has miniscule objective impact on its own.
That’s my point. It could have been the case that we cared about AIs not beating us at Go, and if building AlphaGo does have minuscule objective impact, then that would have happened causing a catastrophe. In that world, I wouldn’t be surprised if we had arguments about why such a thing was clearly a high-impact action. (Another way of putting this is that I think either “impact” is a value-laden concept, or “impact” will fail to prevent some catastrophe, or “impact” prevents the AI from doing anything useful.)
I don’t see why an impact measure for fulfilling the criteria I listed wouldn’t meet what I think you have in mind.
Suppose your utility function has a maximum value of 1, and the inaction policy always gets utility 0. Consider the impact penalty that always assigns a penalty of 2, except for the inaction policy where the penalty is 0. The agent will provably follow the inaction policy. This impact penalty satisfies all of the desiderata, except “natural kind”. If you want to make it continuous for goal-agnostic, then make the impact penalty 2 + <insert favorite impact penalty here>. Arguably it doesn’t satisfy “scope-sensitivity” and “irreversibility-sensitivity”. I’m counting those as satisfied because this penalty will never allow the agent to take a higher-impact action, or a more-irreversible action, which I think was the point of those desiderata.
This is a bad impact measure, because it makes the AI unable to do anything. We should probably have a desideratum that outlaws this, and it should probably be of the form “Our AI is able to do things”, and that’s what I was trying to get at above. (And I do think that AUP might have this problem.)