In theory, never (either hyperbolic time discounting is a bias, and never “should” be done, or it’s a value, but one that longtermists explicitly don’t share).
In practice, hyperbolic time discounting might be a useful heuristic, e.g. perhaps since we are bad at thinking of all the ways that our plans can go wrong, we tend to overestimate the amount of stuff we’ll have in the future, and hyperbolic time discounting corrects for that.
For the record, the VNM theorem is about the fact that you are maximizing expected utility. All three of the words are important, not just the utility function part. The biggest constraint that the VNM theorem applies is that, assuming there is a “true” probability distribution over outcomes (or that the agent has a well-calibrated belief over outcomes that captures all information it has about the environment), the agent must choose actions in a way consistent with maximizing the expectation of some real-valued function of the outcome, which does in fact rule out some possibilities.
It’s only when you don’t have a probability distribution that the VNM theorem becomes contentless. So one check to see whether or not it’s “reasonable” to apply the VNM theorem is to see what happens in a deterministic environment (and the agent can perfectly model the environment) -- the VNM theorem shouldn’t add any force to the argument in this setting.
I usually imagine the problems of embedded agency (at least when I’m reading LW/AF), where the central issue is that the agent is a part of its environment (in contrast to the Cartesian model, where there is a clear, bright line dividing the agent and the environment). Afaict, “embedded Naive Bayes” is something that makes sense in a Cartesian model, which I wasn’t expecting.
It’s not that big a deal, but if you want to avoid that confusion, you might want to change the word “embedded”. I kind of want to say “The Intentional Stance towards Naive Bayes”, but that’s not right either.
What do you mean by embedded here? It seems you are asking the question “does a particular input-output behavior / computation correspond to some Naive Bayes model”, which is not what I would intuitively think of as “embedded Naive Bayes”.
Fwiw having read this exchange, I think I approximately agree with Paul. Going back to the original response to my comment:
Isn’t HCH also such a multiagent system?
Yes, I shouldn’t have made a categorical statement about multiagent systems. What I should have said was that the particular multiagent system you proposed did not have a single thing it is “trying to do”, i.e. I wouldn’t say it has a single “motivation”. This allows you to say “the system is not intent-aligned”, even though you can’t say “the system is trying to do X”.
Another way of saying this is that it is an incoherent system and so the motivation abstraction / motivation-competence decomposition doesn’t make sense, but HCH is one of the few multiagent systems that is coherent. (Idk if I believe that claim, but it seems plausible.) This seems to map on to the statement:
For an incoherent system this abstraction may not make sense, and a system may be trying to do lots of things.
Also, I want to note strong agreement with this:
Of course, it also seems quite likely that AIs of the kind that will probably be built (“by default”) also fall outside of the definition-optimization framework. So adopting this framework as a way to analyze potential aligned AIs seems to amount to narrowing the space considerably.
Oh, I see, you’re talking about the system as a whole, whereas I was thinking of the human imitation specifically. That seems like a multiagent system and I wouldn’t apply single-agent reasoning to it, so I agree motivation-competence is not the right way to think about it (but if you insisted on it, I’d say it fails motivation, mostly because the system doesn’t really have a single “motivation”).
It doesn’t seem like the definition-optimization decomposition helps either? I don’t know whether I’d call that a failure of definition or optimization.
Or to put it another way, supposed AI safety researchers determined ahead of time what kinds of questions won’t cause the Oracle to perform malign optimizations. Would that not count as part of the solution to motivation / intent alignment of this system (i.e., combination of human imitation and Oracle)?
I would say the human imitation was intent aligned, and this helped improve the competence of the human imitation. I mostly wouldn’t apply this framework to the system (and I also wouldn’t apply definition-optimization to the system).
I overall agree that this is a con. Certainly there are AI systems that are weak enough that you can’t talk coherently about their “motivation”. Probably all deep-learning-based systems fall into this category.
I also agree that (at least for now, and probably in the future as well) you can’t formally specify the “type signature” of motivation such that you could separately solve the competence problem without knowing the details of the solution to the motivation problem.
My hope here would be to solve the motivation problem and leave the competence problem for later, since by my view that solves most of the problem (I’m aware that you disagree with this).
I don’t agree that it’s not clean at the conceptual level. It’s perhaps less clean than the definition-optimization decomposition, but not much less.
For example, suppose we tried to increase the competence of the human imitation by combining it with a superintelligent Oracle, and it turns out the human imitation isn’t very careful and in most timelines destroys the world by asking unsafe questions that cause the Oracle to perform malign optimizations. Is this a failure of motivation or a failure of competence, or both?
This seems pretty clearly like a failure of competence to me, since the human imitation would (presumably) say that they don’t want the world to be destroyed, and they (presumably) did not predict that that was what would happen when they queried the oracle.
Update: A reader suggested that in the open-source implementation of PopArt, the PopArt normalization happens after the reward clipping, counter to my assumption. I no longer understand why PopArt is helping, beyond “it’s good for things to be normalized”.
Do you have any links related to this?
No, I haven’t read much about Bayesian updating. But I can give an example.
Consider the following game. I choose a coin. Then, we play N rounds. In each round, you make a bet about whether or not the coin will come up Heads or Tails at 1:2 odds which I must take (i.e. if you’re right I give you $2 and if I’m right you give me $1). Then I flip the coin and the bet resolves.
If your hypothesis space is “the coin has some bias b of coming up Heads or Tails”, then you will eagerly accept this game for large enough N—you will quickly learn the bias b from experiments, and then you can keep getting money in expectation.
However, if it turns out I am capable of making the coin come up Heads or Tails as I choose, then I will win every round. If you keep doing Bayesian updating on your misspecified hypothesis space, you’ll keep flip-flopping on whether the bias is towards Heads or Tails, and you will quickly converge to near-certainty that the bias is 50% (since the pattern will be HTHTHTHT...), and yet I will be taking a dollar from you every round. Even if you have the option of quitting, you will never exercise it because you keep thinking that the EV of the next round is positive.
Noise parameters can help (though the bias b is kind of like a noise parameter here, and it didn’t help). I don’t know of a general way to use noise parameters to avoid issues like this.
I think it’s worth pointing out one technical ‘caveat’
Yes, good point. I think I was assuming an infinite horizon (i.e. no terminal states), for which either construction works.
My main point, however, is that I think you could do some steelmanning here and recover most of the arguments you are criticizing (based on complexity arguments).
That’s the next post in the sequence, though the arguments are different from the ones you bring up.
But I think there are still good arguments for intelligence strongly suggesting some level of “goal-directed behavior”. e.g. it’s probably physically impossible to implement policies (over histories) that are effectively random, since they look like look-up tables that are larger than the physical universe.
I mean, you could have the randomly twitching robot. But I agree with the broader point, I think, to the extent that it is the “economic efficiency” argument in the next post.
Eliezer has a nice analogy in a comment on one of Paul’s posts (I think), about an agent that behaves like it understands math, except that it thinks 2+2=5.
It seems likely the AI’s beliefs would be logically coherent whenever the corresponding human beliefs are logically coherent. This seems quite different from arguing that the AI has a goal.
Re: convergent rationality, I don’t buy it (specifically the “convergent” part).
Re: fragility of human values, I do buy the notion of a broad basin of corrigibility, which presumably is less fragile.
But really my answer is “there are lots of ways you can get confidence in a thing that are not proofs”. I think the strongest argument against is “when you have an adversary optimizing against you, nothing short of proofs can give you confidence”, which seems to be somewhat true in security. But then I think there are ways that you can get confidence in “the AI system will not adversarially optimize against me” using techniques that are not proofs.
(Note the alternative to proofs is not trial and error. I don’t use trial and error to successfully board a flight, but I also don’t have a proof that my strategy is going to cause me to successfully board a flight.)
I agree with a). c) seems to me to be very optimistic, but that’s mostly an intuition, I don’t have a strong argument against it (and I wouldn’t discourage people who are enthusiastic about it from working on it).
The argument in b) makes sense; I think the part that I disagree with is:
moving from utility maximizes to other types of AIs is just replacing something that is relatively easy to reason about with something that it is harder to reason about, thereby obscuring the problems (that are still there).
The counterargument is “current AI systems don’t look like long term planners”, but of course it is possible to respond to that with “AGI will be very different from current AI systems”, and then I have nothing to say beyond “I think AGI will be like current AI systems”.
Or are you just trying to see if anyone can defeat the epistemic humility “trump card”?
Partly (I’m surprised by how confident people generally seem to be, but that could just be a misinterpretation of their position), but also on my inside view the empirical claim is not true and I wanted to see if there were convincing arguments for it.
But maybe it’s enough to have reasons for putting non-trivial weight on the empirical claim to be able to answer the other questions meaningfully?
Yeah, I’d be interested in your answers anyway.
I’d also argue against the empirical claim in that setting; do you agree with the empirical claim there?
That observation also cuts against the argument you make about warning signs, I think, as it suggests that we might significantly underestimate an AIs (e.g. vastly superhuman) skill in some areas, if it still fails at some things we think are easy.
Nobody denies that AI is really good at extracting patterns out of statistical data (e.g. image classification, speech-to-text, and so on), even though AI is absolutely terrible at many “easy” things. This, and the linked comment from Eliezer, seem to be drastically underselling the competence of AI researchers. (I could imagine it happening with strong enough competitive pressures though.)
I also predict that there will be types of failure we will not notice, or will misinterpret. [...]
All of this assumes some very good long-term planning capabilities. I expect long-term planning to be one of the last capabilities that AI systems get. If I thought they would get them early, I’d be more worried about scenarios like these.
I’m uncertain about weaponization of AI (and did say “if we ignore military applications” in the OP).
I just don’t know whether I agree with your assertion that eg AUP “defines” what not to do.
I think I mostly meant that it is not learned.
I kind of want to argue that this means the effect of not-learned things can be traced back to researcher’s brains, rather than to experience with the real world. But that’s not exactly right, because the actual impact penalty can depend on properties of the world, even if it doesn’t use learning.
How pessimistic are you about this concern for this idea?
I don’t know; it feels too early to say. I think if the norms end up in some hardcoded form such that they never update over time, nearest unblocked strategies feel very likely. If the norms are evolving over time, then it might be fine. The norms would need to evolve at the same “rate” as the rate at which the world changes.