It can, if there is a unique line. There isn’t a unique line in general—you can draw several lines, getting different probability directions for each.
Well… I agree with all of the “that’s peculiar” implications there. To answer your question:
The assignment of probabilities to actions doesn’t influence the final decision here. We just need to assign probabilities to everything. They could be anything, and the decision would come out the same.
The magic correlation is definitely weird. Before I worked out an example for this post, I thought I had a rough idea of what Jeffrey-Bolker rotation does to the probabilities and utilities, but I was wrong.
I see the epistemic status of this as “counterintuitive fact” rather than “using the metaphor wrong”. The vector-valued measure is just a way to visualize it. You can set up axioms in which the Jeffrey-Bolker rotation is impossible (like the Savage axioms), but in my opinion they’re cheating to rule it out. In any case, this weirdness clearly follows from the Jeffrey-Bolker axioms of decision theory.
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard” -- but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function” -- the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
Yeah. I’ve edited it a bit for clarity.
I’ll try and write up a proof that it can do what I think it can.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior.
You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you’ll do while knowing what Omega’s prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.
but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does
Yeah, that’s the problem here.
And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).
Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).
I think there are interesting connections between HCH/IDA and policy approval, which I hope to write more about some time.
I’m not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn’t. I see that I need to do some careful disambiguation of background assumptions and language.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
This part, at least, is getting at the same intuition I’m coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies. (I am thinking I’ll write another post to make that connection clearer.)
I will have to think harder about the difference between how you’re framing things and how I would frame things, to try to clarify more.
Does the above paragraph mean that people with unique preferences and crazy beliefs eventually end up without having their preferences respected (whereas someone with unique preferences and accurate beliefs would still have their preferences respected)?
Yes. This might be too harsh. The “libertarian” argument in favor of it is: who are you to keep someone from betting away all of their credit in the system? If you make a rule preventing this, agents will tend to want to find some way around it. If you just give some free credit to agents who are completely out, this harms the calibration of the system by reducing the incentive to be sane about your bets.
On the other hand, there may well be a serious game-theoretic reason why it is “too harsh”: someone who is getting to cooperation from the system has no reason to cooperate in turn. I’m curious if a CCT-adjacent formalism could capture this (or some other reason to be gentler). That would be the kind of thing which might have interesting analogues when we try to import insights back into decision theory.
Also, do we have to treat the agents as well-calibrated across all domains? Or is the system able to learn that their thoughts should be given weight in some circumstances and not others?
In the formalism, no, you just win or lose points across all domains. Realistically, it seems prudent to introduce stuff like that.
A possible fix to the above is that individual agents could do this subject-specific evaluation of other agents and would update their credences based on partially-accurate agents, thus the information still gets preserved.
That’s exactly what could happen in a logical-induction like setting.
could there be a double-counting when both Critch’s mechanism and other agents pick up on the accuracy of an agent?
There might temporarily be all sorts of crazy stuff like this, but we know it would (somehow) self-correct eventually.
I’ll see what I can do! (I’ve been completely ignoring that up to now, so yeah, probably can do better.)
Ah, yeah, you’re right. The separating hyperplane theorem only gives us ≤, and I was assuming <.
I think “admissible if and only if non-dogmatic” may still hold as I stated it, because I don’t see how to set up an example like the one you give when A is finite. I’m editing the post anyway, since (1) I don’t know how to how that at the moment, and (2) the if and only if falls apart for infinite action sets as in your example anyway, which makes it kind of feel “wrong in spirit”.
In the context in the post, I’m saying that before we get rid of an assumption of probabilities, it is easier to first get rid of the assumption that we have a real-valued L (proving it from other assumptions instead). We do this by applying VNM, still assuming mixed strategies are described by probabilities. I finally get rid of that assumption a few paragraphs later. I edited the post a little to try and clarify.
I basically agree with this criticism, and would like to understand what the alternative to Bayesian decision theory which comes out of the analogy would be.
Editing to clarify.
Another part of the idea (not fully explained in Scott’s post I referenced earlier) is that nonexploited bargaining (AKA bargaining away from the pareto fronteir AKA cooperating with agents with different notions of fairness) provides a model of why agents should not just take pareto improvements all the time, and may therefore be a seed of “non-Bayesian” decision theory (in so far as Bayes is about taking pareto improvements).
That seems good to understand better in itself, but it isn’t a crux for the argument. Whether you’ve got “probabilities” or a “caring measure” or just raw utility which doesn’t reduce to anything like that, it still seems like you’re justifying it with Pareto-type arguments. Scott’s claim is that Pareto-type arguments won’t apply if you correctly take into account the way in which you have control over certain things. I’m not sure if that makes any sense, but basically the question is whether CCT can make sense in a logical setting where you may have self-referential sentences and so on.