I agree that “functional time” makes sense, but somehow, I like “logical time” better. It brings out the paradox: logical truth is timeless, but any logical system must have a proof ordering, which brings out a notion of time based on what follows from what.
Since recently reading The Power of Now, which thoroughly and viscerally describes the perspective in which track-back would be harmful (because it distracts from the now), I want to elaborate on this some more.
Tho Power of Now did something interesting to my moment-to-moment awareness, but at least in the short term, it seemed to wreck my productivity. Returning to the track-back movement rather than the “now” type movement seems to help bring me back quite a bit.
No moment exists in isolation; anything which you can mentally label as a moment is an extended moment. Furthermore, although you can cultivate what feels like heightened states of awareness (as in The Power of Now), the only way to verify awareness is by checking whether you perceive more detail, and remember more accurately. To simultaneously be the observer and the observed is an illusion; you are always only observing a previous self, even if the delay is very slight.
So, checking whether you can remember what happened in your head over the past few seconds is a check of awareness. Furthermore, Shinzen Young suggests that awareness after-the-fact is in some sense as good for your practice, and more compatible with intellectual work. Furthermore, I find that paying attention to the train of thought/feeling which led to distraction is really helpful for maintaining focus and motivation.
I’m phrasing this as in opposition to “being in the now”, but, I’m not sure to what extent I really mean that. I do think I learned things from The Power of Now; and, I’m not deeply experienced in either style.
If you have no memory, how can you learn? I recognize that you can draw a formal distinction, allowing learning without allowing the strategies being learned to depend on the previous games. But, you are still allowing the agent itself to depend on the previous games, which means that “learning” methods wich bake in more strategy will perform better. For example, a learning method could learn to always go straight in a game of chicken by checking to see whether going straight causes the other player to learn to swerve. IE, it doesn’t seem like a principled distinction.
Furthermore, I don’t see the motivation for trying to do well in a single-shot game via iterated play. What kind of situation is it trying to model? This is discussed extensively in the paper I mentioned in the post, “If multi-agent learning is the answer, what is the question?”
Yeah, one might think I’m going against the grain, recommending something that more experienced meditators warn against. On the other hand (and imho), we could take it as a warning against ordinary distractedness and ordinary unmindful involvement in thoughts. Focusing intentionally on one specific mental motion is very different.
Of course, that’s for the goal of the book, which is about mindfulness meditation, which involves stabilizing your attention and strengthening your peripheral awareness.
The goal of mindfulness might be interpreted in different ways (and I haven’t read that book yet), but ender the interpretation of defusion, I think there’s nothing particularly harmful about the track-back exercise. It is possible that it can get you caught up in history and therefore fused with the thoughts, but it is also possible that looking at the history helps put thoughts at a little distance.
For example, someone using mindfulness to deal with cigarette cravings (trying to quit) is supposed to pay mindful attention to the craving, and “ride the wave” until the craving is over. It is possible that tracking back to what gave rise to the craving helps contextualize it and thus put it at a remove (“I was stressed just now, and then I started having the craving”). It is also possible that it takes you away from moment-to-moment presence, and the next thing you know, you find yourself reaching for a cigarette. I don’t know for sure.
If your goal is related to debiasing, though, I think it’s a pretty good form of mindfulness: the question “why did I have that thought?” is closely related to epistemic hygiene. “Why am I thinking this plan is bad? Ah, I started out being annoyed at Ellen for her bad driving, and then she mentioned this plan. But, her driving is unrelated to this plan...”
This is similar to, but slightly different from, the story in Bowling Alone. (Disclaimer: I haven’t read Bowling Alone, only had several discussions about it with someone who has.)
One very interesting question is: why were good citizenship norms at their peak in the 1930s-1950s?
According to Bowling Alone the answer is that there was a massive club-formation burst from the late 1800s to the early 1900s. These clubs created the strong social fabric which allowed trust in the society overall to be high.
Why was there a burst of club creation? I don’t know.
When you think about the problem this way, there are no counterfactuals, only state evolution. It can be applied to the past, to the present or to the future.
This doesn’t give very useful answers when the state evolution is nearly deterministic, such as an agent made of computer code.
For example, consider an agent trying to decide whether to turn left or turn right. Suppose for the sake of argument that it actually turns left, if you run physics forward. Also suppose that the logical uncertainty has figured that out, so that the best-estimate macrostate probabilities are mostly on that. Now, the agent considers whether to turn left or right.
Since the computation (as pure math) is deterministic, counterfactuals which result from supposing the state evolution went right instead of left mostly consist of computer glitches in which the hardware failed. This doesn’t seem like what the agent should be thinking about when it considers the alternative of going right instead of left. For example, the grocery store it is trying to get to could be on the right-hand path. The potential bad results of a hardware failure might outweigh the desire to turn toward the grocery store, so that the agent prefers to turn left.
For this story to make sense, the (logical) certainty that the abstract algorithm decides to turn left in this case has to be higher than the confidence that hardware will not fail, so that turning right seems likely to imply hardware failure. This can happen due to Löb’s theorem: the whole above argument, as a hypothetical argument, suggests that the agent would turn left on a particular occasion if it happened to prove ahead of time that its abstract algorithm would turn left (since it would then be certain that turning right implied a hardware failure). But this means a proof of left-turning results in left-turning. Löb’s theorem, left-turning is indeed provable.
The Newcomb’s-problem example you give also seems problematic. Again, if the agent’s algorithm is deterministic, it does basically one thing as long as the initial conditions are such that it is in Newcomb’s problem. So, essentially all of the uncertainty about the agent’s action is logical uncertainty. I’m not sure exactly what your intended notion of counterfactual is, but, I don’t see how reasoning about microstates helps the agent here.
What are the biggest issues that haven’t been solved for UDT or FDT?
UDT was a fairly simple and workable idea in classical Bayesian settings with logical omniscience (or with some simple logical uncertainty treated as if it were empirical uncertainty), but it was always intended to utilize logical uncertainty at its core. Logical induction, our current-best theory of logical uncertainty, doesn’t turn out to work very well with UDT so far. The basic problem seems to be that UDT required “updates” to be represented in a fairly explicit way: you have a prior which already contains all the potential things you can learn, and an update is just selecting certain possibilities. Logical induction, in contrast, starts out “really ignorant” and adds structure, not just content, to its beliefs over time. Optimizing via the early beliefs doesn’t look like a very good option, as a result.
FDT requires a notion of logical causality, which hasn’t appeared yet.
What is a co-ordination problem that hasn’t been solved?
Taking logical uncertainty into account, all games become iterated games in a significant sense, because players can reason about each other by looking at what happens in very close situations. If the players have T seconds to think, they can simulate the same game but given t<<T time to think, for many t. So, they can learn from the sequence of “smaller” games.
This might seem like a good thing. For example, single-shot prisoner’s dilemma has just a Nash equilibrium of defection. Iterated play cas cooperative equilibria, such as tit-for-tat.
Unfortunately, the folk theorem of game theory implies that there are a whole lot of fairly bad equilibria for iterated games as well. It is possible that each player enforces a cooperative equilibrium via tit-for-tat-like strategies. However, it is just as possible for players to end up in a mutual blackmail double bind, as follows:
Both players initially have some suspicion that the other player is following strategy X: “cooperate 1% of the time if and only if the other player is playing consistently with strategy X; otherwise, defect 100% of the time.” As a result of this suspicion, both players play via strategy X in order to get the 1% cooperation rather than 0%.
Ridiculously bad “coordination” like that can be avoided via cooperative oracles, but that requires everyone to somehow have access to such a thing. Distributed oracles are more realistic in that each player can compute them just by reasoning about the others, but players using distributed oracles can be exploited.
So, how do you avoid supremely bad coordination in a way which isn’t too badly exploitable?
And what still isn’t known about counterfactuals?
The problem of specifying good counterfactuals sort of wraps up any and all other problems of decision theory into itself, which makes this a bit hard to answer. Different potential decision theories may lean more or less heavily on the counterfactuals. If you lead toward EDT-like decision theories, the problem with counterfactuals is mostly just the problem of making UDT-like solutions work. For CDT-like decision theories, it is the other way around; the problem of getting UDT to work is mostly about getting the right counterfactuals!
The mutual-blackmail problem I mentioned in my “coordination” answer is a good motivating example. How do you ensure that the agents don’t come to think “I have to play strategy X, because if I don’t, the other player will cooperate 0% of the time?”
It can, if there is a unique line. There isn’t a unique line in general—you can draw several lines, getting different probability directions for each.
Well… I agree with all of the “that’s peculiar” implications there. To answer your question:
The assignment of probabilities to actions doesn’t influence the final decision here. We just need to assign probabilities to everything. They could be anything, and the decision would come out the same.
The magic correlation is definitely weird. Before I worked out an example for this post, I thought I had a rough idea of what Jeffrey-Bolker rotation does to the probabilities and utilities, but I was wrong.
I see the epistemic status of this as “counterintuitive fact” rather than “using the metaphor wrong”. The vector-valued measure is just a way to visualize it. You can set up axioms in which the Jeffrey-Bolker rotation is impossible (like the Savage axioms), but in my opinion they’re cheating to rule it out. In any case, this weirdness clearly follows from the Jeffrey-Bolker axioms of decision theory.
I thought you were arguing, “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this true utility function is still not aligned with you.” (Yes, having written it down I can see that is not what you actually said, but that’s the interpretation I originally ended up with.)
I would correct it to “Suppose we knew your true utility function exactly, with no errors. An AI that perfectly optimizes this in expectation according to some prior is still not aligned with you.”
I would now rephrase your claim as “Even assuming we know the true utility function, optimizing it is hard.”
This part is tricky for me to interpret.
On the one hand, yes: specifically, even if you have all the processing power you need, you still need to optimize via a particular prior (AIXI optimizes via Solomonoff induction) since you can’t directly see what the consequences of your actions will be. So, I’m specifically pointing at an aspect of “optimizing it is hard” which is about having a good prior. You could say that “utility” is the true target, and “expected utility” is the proxy which you have to use in decision theory.
On the other hand, this might be a misleading way of framing the problem. It suggests that something with a perfect prior (magically exactly equal to the universe we’re actually in) would be perfectly aligned: “If you know the true utility function, and you know the true state of the universe and consequences of alternative actions you can take, then you are aligned.” This isn’t necessarily objectionable, but it is not the notion of alignment in the post.
If the AI magically has the “true universe” prior, this gives humans no reason to trust it. The humans might reasonably conclude that it is overconfident, and want to shut it down. If it justifiably has the true universe prior, and can explain why the prior must be right in a way that humans can understand, then the AI is aligned in the sense of the post.
The Jeffrey-Bolker rotation (mentioned in the post) gives me some reason to think of the prior and the utility function as one object, so that it doesn’t make sense to think about “the true human utility function” in isolation. None of my choice behavior (be it revealed preferences or verbally claimed preferences etc) can differentiate between me assigning small probability to a set of possibilities (but caring moderately about what happens in those possibilities) and assigning a moderate probability (but caring very little what happens one way or another in those worlds). So, I’m not even sure it is sensible to think of UH alone as capturing human preferences; maybe UH doesn’t really make sense apart from PH.
So, to summarize,
1. I agree that “even assuming we know the true utility function, optimizing it is hard” -- but I am specifically pointing at the fact that we need beliefs to supplement utility functions, so that we can maximize expected utility as a proxy for utility. And this proxy can be bad.
2. Even under the idealized assumption that humans are perfectly coherent decision-theoretic agents, I’m not sure it makes sense to say there’s a “true human utility function” -- the VNM theorem only gets a UH which is unique up to such-and-such by assuming a fixed notion of probability. The Jeffrey-Bolker representation theorem, which justifies rational agents having probability and utility functions in one theorem rather than justifying the two independently, shows that we can do this “rotation” which shifts which part of the preferences are represented in the probability vs in the utility, without changing the underlying preferences.
3. If we think of the objective as “building AI such that there is a good argument for humans trusting that the AI has human interest in mind” rather than “building AI which optimizes human utility”, then we naturally want to solve #1 in a way which takes human beliefs into account. This addresses the concern from #2; we don’t actually have to figure out which part of preferences are “probability” vs “utility”.
Yeah. I’ve edited it a bit for clarity.
I’ll try and write up a proof that it can do what I think it can.
I think assuming that you have access to the proof of what Omega does means that you have already determined your own behavior.
You may not recognize it as such, especially if Omega is using a different axiom system than you. So, you can still be ignorant of what you’ll do while knowing what Omega’s prediction of you is. This makes it impossible for your probability distribution to treat the two as correlated anymore.
but if that’s taken to be _part of the prior_, then it seems you no longer have the chance to (acausally) influence what Omega does
Yeah, that’s the problem here.
And if it’s not part of the prior, then I think a value-learning agent with a good decision theory can get the $500.
Only if the agent takes that one proof out of the prior, but still has enough structure in the prior to see how the decision problem plays out. This is the problem of constructing a thin prior. You can (more or less) solve any decision problem by making the agent sufficiently updateless, but you run up against the problem of making it too updateless, at which point it behaves in absurd ways (lacking enough structure to even understand the consequences of policies correctly).
Hence the intuition that the correct prior to be updateless with respect to is the human one (which is, essentially, the main point of the post).
I think there are interesting connections between HCH/IDA and policy approval, which I hope to write more about some time.
I’m not even sure whether you are closer or further from understanding what I meant, now. I think you are probably closer, but stating it in a way I wouldn’t. I see that I need to do some careful disambiguation of background assumptions and language.
Instead of trying to value learn and then optimize, just go straight for the policy instead, which is safer than relying on accurately decomposing a human into two different things that are both difficult to learn and have weird interactions with each other.
This part, at least, is getting at the same intuition I’m coming from. However, I can only assume that you are confused why I would have set up things the way I did in the post if this was my point, since I didn’t end up talking much about directly learning the policies. (I am thinking I’ll write another post to make that connection clearer.)
I will have to think harder about the difference between how you’re framing things and how I would frame things, to try to clarify more.