Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
the computational complexity of individual hypotheses in the hypothesis class cannot be the thing that characterizes the hardness of learning, but rather it has to be some measure of how complex the entire hypothesis class is.
This is true, of course, but mostly immaterial. Outside of contrived examples, it’s rare for the hypothesis class to be feasible to learn while containing hypotheses that are infeasible to evaluate. It seems extremely implausible that you can find a hypothesis class that is simultaneously (i) possible to specify in practice [1] (ii) feasible to learn and (iii) contains a hypothesis which is an exact description of the real universe. Therefore, non-realizability is unavoidable.
By which I mean, we can construct the learning algorithm without being something akin to omniscient beings that already know everything about the universe and are able to hardcode this knowledge into the algorithm. Indeed, the reasons why we need a learning algorithm at all are (i) we don’t know a lot of what we want the agent to know (ii) it’s too labor-intensive to hardcode even the things that we do know. Therefore, we need a hypothesis class that is extremely broad and mostly uninformative.
This idea was described in a presentation I have in ’23, but wasn’t written down anywhere.
Here is a formalization of recursive self-improvement (more precisely, recursive metalearning) in the metacognitive agent framework.
Let
Let
Consider any symbolic representation of an element of
Define
Given
We now say that an agent is recursively metalearning (w.r.t. the choices involved), if (i) it satisfies a “good enough” regret bound w.r.t.
Intuitively, this reflects the idea if
For simplicity, we assume that
Just don’t. I understand the frustration of not getting engagement, but don’t spam the site.
Halpern and Leung propose the “minimax weighted expected regret” (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn’t work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.
The truth is outside of my hypothesis class, but my hypothesis class probably contains a non-trivial law that is a coarsening of the truth, which is the whole point.
For example, you can imagine that you start with some kind of intractable simplicity prior. Then, for each hypothesis you choose a tractable law that coarsens it. You end up with a probability distribution over laws.
A different way to view this is, this is just a way to force your policy to have low-regret w.r.t. all/most hypothesis while weighing complex hypotheses less. For a complex hypothesis, you naturally expect learning it to be harder so you’re weighing its regret less. Typically, it’s only possible to have a uniform regret bound if you impose a bound on the complexity of hypotheses in some sense. Absent such a bound, your regret bound must be non-uniform. You can formalize it by explicitly allowing the per-hypothesis regret to depend on some complexity parameter, but the Bayes approach is an alternative. (Also, Bayes regret obviously implies per-hypothesis non-uniform regret with a 1/probability coefficient.)
First, Bayes-regret and worst-case-regret are standard concepts in classical RL theory, and the infra-versions are straightforward analogs.
Second, you don’t have to focus on the Bayes-regret necessarily. In fact, in our papers, we focus entirely on uniform (worst-case) regret bounds.
Third, instead of an ordinary prior over laws you can consider an infraprior over laws (i.e. have ambiguity in hypothesis-space and not just in outcome-space). The resulting notion of “infra-Bayes-regret” has both Bayes-regret and worst-case-regret as special cases.
Fourth, the justification is quite straightforward. If you have an (unambiguous i.e. ordinary probability distribution) prior over laws, and your performance metric is the Bayes-infra-expected utility, then the Bayes-regret is just the difference between the performance of your policy and the performance of an optimal policy that magically knows the true hypothesis. So it’s a very natural measure of your policy’s ability to learn the hypothesis.
I like the overall vibe. Two issues:
It says “Top Posts” and the mouse-over text is “by karma”, however in reality I can choose which posts to put there. Now, I like it that I can choose which posts to put there, but once I customized them, the mouse-over becomes a lie.
The “recent comments” disappeared. This is really bad because I use that to find my recent comments when I want to edit them. (For example now I wanted to find this comment to add this second bullet but had to do it manually.) OK, I now see I can find them under “feed” but this might be confusing.
[Context: I’m not a digital minimalist but I am somewhat of a “digital reducetarian”: I don’t have social media (besides LinkedIn) and have a browser plugin that reduces my access to particular websites (like LessWrong).]
Cool post :)
For me, there’s something “strange” here (not surprising, but unlike my own experience), where the implication is that people have huge swaths of “free time” that they use for scrolling and the like (which you instead use for what’s described in this post). I spend the vast majority of my time either working or doing something with kids/lovers/friends. (I did read this post in bed preparing to start my day, and am sneaking in this comment between breakfast and work.) Plus short breaks from work, and a short time in bed before sleeping, during which I read fiction books (admittedly using digital means, but in principle I could use physical books just as well, if I could fit them all into my apartment).
It’s fun to hear about your experience talking to random strangers! Catalogued it under “I would never do this but I’m glad some people do”.
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be and the “true” reward function that we intend the AI to follow be . Supposedly, these function agree on some domain but catastrophically go apart outside of it. Then, if all the training data lies inside , which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors over .
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that and actually agree on , which is in itself wildly optimistic.
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.
I am separately worried about “Carefully Controlled Moderate Superintelligences that we’re running at scale, each instance of which is not threatening, but, we’re running a lot of them...
I think that this particular distinction is not the critical one. What constitutes an “instance” is somewhat fuzzy. (A single reasoning thread? A system with a particular human/corporate owner? A particular source code? A particular utility function?) I think it’s more useful to think in terms of machine intelligence suprasystems with strong internal coordination capabilities. That is, if we’re somehow confident that the “instances” can’t or won’t coordinate either causally or acausally, then they are arguably truly “instances”, but the more they can coordinate the more we should be thinking of them in the aggregate. (Hence, the most cautious risk estimate comes from comparing the sum total of all machine intelligence against the sum total of all human intelligence[1].)
More precisely, not even the sum total of all human intelligence, but the fraction of human intelligence that humans can effectively coordinate. See also comment by Nisan.
There seem to be two underlying motivations here, which are best kept separate.
One motivation is having a good vocabulary to talk about fine-grained distinctions. I’m on board with this one. We might want to distinguish e.g.:
Smarter than a median human along all AI-risk-relevant axes
Smarter than the smartest human along all AI-risk-relevant axes
Smarter than all of humanity put together along all AI-risk-relevant axes
Smart enough to have a 50% success probability to kill all humans if it chooses to, given current level of countermeasures
Smart enough to have a 50% success probability to kill all humans if it chooses to, even if best-case countermeasures are in place (this particular distinction inspired by Buck’s comments on this thread)
But then, first, it is clear that existing AI is not superintelligence according to any of the above interpretations. Second, I see no reason not to use catchy words like “hyperintelligence”, per One’s suggestion. (Although I agree that there is an advantage to choosing more descriptive terms.)
Another motivation is staying ahead of the hype cycles and epistemic warfare on twitter or whatnot. This one I take issue with.
I don’t have an account on twitter, and I hope that I never will have. Twisting ourselves into pretzels with ridiculous words like “AIdon’tkilleveryoneism” is incompatible with creating a vocabulary optimized for actually thinking and having productive discussions among people who are trying to be the adults in the room. Let the twitterites use whatever anti-language they want. The people trying to do beneficial politics there: I sincerely wish you luck, but I’m laboring in a different trench, and let’s use the proper tool for each task separately.
I understand that there can be practical difficulties such as, what if LW ends up using a language so different from the outside world that it will become inaccessible to outsiders, even when those outsiders would otherwise make valuable contributions. There are probably some tradeoffs that are reasonable to make with such considerations in mind. But let’s at least not abandon any linguistic position at the slightest threatening gesture of the enemy.
This post is an overview of Steven Byrnes’ AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes’ goal is to reverse engineer the human utility function, or at least some of its central features. I don’t think this will succeed in the sense of, we’ll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
Bridging brain science and agent theory is a promising way to make sure that we build a theory of agents broad enough to include humans. The latter is crucial in order to formally define alignment (since alignment is between the AI-agent and the human-agent), which is needed to have formal alignment guarantees. In particular, it is needed for value learning to become possible, such as in my COSI proposal.
While ideally we might wish for alignment guarantees to assume as little as possible, it might be difficult or even impossible to design a competitive AI system which is robustly aligned with a completely uniformed prior. As a conservative example, we might discover that one or several scalar parameters of humans should be approximately known (e.g. parameters related to amount of computing resources[1]). In this case, we would need to reverse engineer these parameters from brain science, which requires having a reliable dictionary between brain science and agent theory.
I hope that in the future this programme makes more direct contact with the mathematical formalism of agent theory, of the sort the LTA is constructing. However, I realize that this is a difficult challenge.
Why are we giving up on plain “superintelligence” so quickly? According to Wikipedia:
A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the most gifted human minds. Philosopher Nick Bostrom defines superintelligence as “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest”.
According to Google AI Overview:
Superintelligence (or Artificial Superintelligence—ASI) is a hypothetical AI that vastly surpasses human intellect in virtually all cognitive domains, possessing superior scientific creativity, general wisdom, and social skills, operating at speeds and capacities far beyond human capability, and potentially leading to profound societal transformation or existential risks if not safely aligned with human goals.
I don’t think I saw anyone use “superintelligence” as “better than a majority of humans on some specific tasks” before very recently. (Was DeepBlue a superintelligence? Is a calculator superintelligence?)
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as “fundamentally confused ideas” which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from “Superintelligence”, in which Bostrom talks about the problem of wireheading. Turner declares this to be “nonsense” since, according to Turner, RL systems don’t seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reasons to focus on such a system: these are properties that are desirable in an AI defense system, if the system is aligned. Now, the LLM+RLHF paradigm which Turners puts in the center is, at least superficially, not like that. However, this is no argument against Bostrom: today’s systems already went beyond LLM+RLHF (introducing RL over chain-of-thought) and tomorrow’s systems are likely to be even more different. And, if a given AI design does not somehow acquire properties i+ii even indirectly (e.g. via in-context learning), then it’s not clear how would it be useful for creating a defense system.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don’t know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don’t account for some errors, rather than despite of it.
After this argument, Turner adds that “as a point of further fact, RL approaches constitute humanity’s current best tools for aligning AI systems today”. This observation seems completely irrelevant. It was indeed expected that RL would be useful in the subhuman regime, when the system cannot fail catastrophically simply because it lacks the capabilities. (Even when it convinces some vulnerable person to commit suicide, OpenAI’s legal department can handle it.) I would expect it to be obvious to Bostrom even back then, and doesn’t invalidate his conclusions in the slightest.
In section 3, Turner proceeds to attack the so-called “counting argument” for misalignment. The counting argument goes, since there are much more misaligned minds/goals than aligned minds/goals, even conditional on “good” behavior in training, it seems unlikely that current methods will produce an aligned mind. Turner (quoting Belrose and Pope) counters this argument by way of analogy. Deep learning successfully generalizes even though most models that perform well on the training data don’t perform well on the test data. Hence, (they argue) the counting argument must be fallacious.
The major error that Turner, Belrose and Pope are making is that of confusing aleatoric and epistemic uncertainty. There is also a minor error of being careless about what measure the counting is performed over.
If we did not know anything about some algorithm except that it performs well on the training data, we would indeed have at most a weak expectation of it performing well on the test data. However, deep learning is far from random in this regard: it was selected by decades of research to be that sort of algorithm that does generalize well. Hence, the counting argument in this case gives us a perfectly reasonable prior.
(The minor point is that, w.r.t to a simplicity prior, even a random algorithm has some bounded-from-below probability of generalizing well.)
The counting argument is not premised on deep understanding of how deep learning works (which at present doesn’t exist), but on a reasonable prior about what should we expect from our vantage point of ignorance. It describes our epistemic uncertainty, not the aleatoric uncertainty of deep learning. We can imagine that, if we knew how deep learning really works in the context of typical LLM training data etc, we would be able to confidently conclude that, say, RLHF has a high probability to eventually produce agents that primarily want to build astronomical superstructures in the shape of English letters, or whatnot. (It is ofc also possible we would conclude that LLM+RLHF will never produce anything powerful enough to be dangerous or useful-as-defense-sytem.) That would not be inconsistent with the counting argument as applied from our current state of knowledge.
The real question is then, conditional on our knowledge that deep learning often generalizes well, how confident are we that it will generalize aligned behavior from training to deployment, when scaled up to highly capable systems. Unfortunately, I don’t think this update is strong enough to make us remotely safe. The fact deep learning generalizes implies that it implements some form of Occam’s razor, but Occam’s razor doesn’t strongly select for alignment, as far as we can tell. Our current (more or less) best model of Occam’s razor is Solomonoff induction, which Turner dismisses as irrelevant to neural networks: but here again, the fact that our understanding is flawed just pushes us back towards the counting-argument-prior, not towards safety.
Also, we should keep in mind that deep learning doesn’t always generalize well empirically, it’s just that when it fails we add more data until it starts generalizing. But, if the failure is “kill all humans”, there is nobody left to add more data.
Turner’s conclusion is “it becomes far easier to just use the AIs as tools which do things we ask”. The extent to which I agree with this depends on the interpretation of the vague term “tools”. Certainly modern AI is a tool that does approximately what we ask (even though when using AI for math, I’m already often annoyed at its attempts to cheat and hide the flaws of its arguments). However, I don’t think we know how to safety create “tools” that are powerful enough to e.g. nearly-autonomously do alignment research or otherwise make substantial steps toward building an AI defense systems.
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
It updates me towards natural latents being a useful concept for foundational questions in agent theory, as opposed to being some artifact of overindexing on Bayesian networks as the “right” ontology.
The proof technique involves defining an algorithmic information theory analogue of Bayesian networks, which is something I haven’t seen before and seems quite interesting in itself.
It would be interesting to see whether any of this carries over to the efficiently computable counterparts of Kolmogorov complexity I recently invented[1].
The main thing this post is missing is any rigorous examples or existence proofs of these AIT natural latents. I’m guessing that the following construction should work:
Choose a universal Turing machine .
Choose to be a -program for a total recursive function s.t. .
Choose to be random strings of length .
Set .
Then, with high probability, is a natural latent for the . (I think?)
It would be nice to see something like that in the post.
These ideas seem conceptually close to concepts like sophistication in algorithmic statistics, and the connection might be worth investigating.
Now, about the stated motivation: the OP claims that natural latents capture how “reasonable” agents choose to define categories about the world. The argument seems somewhat compelling, although some further justification is required for the claim that
If you’ve been wondering why on Earth we would ever expect to find such simple structures in the complicated real world, conditioning on background knowledge is the main answer.
That said, I think that real-world categorizations are also somewhat value-laden: depending on the agent’s preferences, and on the laws of the universe in which they find themselves, there might be particular features they care about much more than other features. (Since they are more decision-relevant.) The importance of these features will likely influence which categories are useful to define. This fact cannot be captured in a formalism on the level of abstraction in this post. (Although maybe we can get some of the way there by drawing on rate-distortion theory?)
Still unpublished.
This is an idea I came up with and presented in the Agent Foundations 2025 at CMU conference.
Here is a nice simple formalism for decision theory, that in particular supports the decision theory coming out of infra-Bayesianism. I now call the latter decision theory “Disambiguative Decision Theory”, since the counterfactuals work by “disambiguating” the agent’s belief.
Formalism
Let be the agent’s event space and the space of possible policies
[1]
. Let be the agent’s loss function. For each , we are given some .
[2]
This represents the event “the agent’s behavior is consistent with policy ”. We assume that
This data is common for all decision theories, but the rest of the details depend on the theory:
Functional Decision Theory (FDT)
We are given a mapping . The distribution represents the logical counterfactual associated with . It is also possible to consider the more general “robust” version , but we will avoid it here for simplicity. The decision rule is then
We will call an FDT problem “formally causal” when for any , the measures and agree when restricted to . That is, for any measurable , we require
Causal Decision Theory (CDT)
CDT has the same formal form as FDT, but we always require the problem to formally causal. Moreover, the interpretation of is different: it now represents the causal counterfactual associated with . The decision rule is also formally the same:
Given an FDT problem , we can translate it to a CDT problem, if we specify the agent’s belief about its own policies and causal interpretation: the kernel . Here is a copy of that represents the factual policy and is a copy of that represents the counterfactual policy. We require that , and that is formally causal in the second argument.
Normally, comes from a causal graph, where apply the do-operator for the counterfactual policy and condition on the factual policy (i.e. condition on what the policy would have been if not for the do-operator).
Given this data, we define the translation
Extensive Form and Evidential Decision Theory (EDT)
Extensive Form
To formalize EDT, we need to assume the decision process is given in “extensive” form. That is, we have a set of decision points, for each a set of actions , and a mapping , that defines the previous decision point and action. Here, we use the notation
We assume that is acyclic and hence makes into the vertices of a forest whose edges are labeled by .
We define a policy to be s.t.
For every , there is at most one s.t. .
For every , if then there exists some s.t. .
We further assume that there is a mapping (representing the last action taken) s.t. for all
Here, stands for iterating in the obvious way.
For any , we can use the notation
This represents the event “the decision point actually takes place”.
EDT
So far, this notion of extensive form decision problem is useful not just for EDT. Specifically for EDT, we add the assumption that we’re given the agent’s belief . We can now state the EDT decision rule. We define recursively. Always, .
For every s.t. , we set
Thus, the agent conditions both on following policy and observing decision point .
Given an FDT problem in extensive form, we can translate it to a EDT problem, if we specify the agent’s belief about its own policies . We define the translation
Disambiguative Decision Theory (DDT)
We are given the agent’s belief . Here, refers to supracontributions. The decision rule is
Here, is the characteristic function of the set . Equivalently, we can define by
We then have
This is the reason for the name “disambiguative”: is a “disambiguated” version of , where the policy is made unambiguous.
Given an FDT problem , we can translate it to a DDT problem without any further data:
That is, is the supracontribution hull of the distributions when ranges over .
DDT does have the odd property of non-invariance w.r.t. shifting by a constant, as opposed to all other decision theories considered. There might be some story about how this non-invariance is an inevitable consequence of learning (where imposing bounds on is important), but I’m not ready to tell it.
Comparison
Now, let’s look into how different decision theories compare. We will be using FDT as the “gold standard” throughout, when it comes to choosing the correct policy. Note though, that FDT assumes we somehow assign strict meaning to the logical counterfactuals, which is unclear how to accomplish. On the other hand, DDT makes the substantially weaker assumption that can define the supracontribution belief. In particular, it is consistent with learning, as was explained here.
Proposition 1: Consider a formally causal FDT problem . Assume that the causal interpretation takes the form . Then, .
Proposition 2: Consider a formally causal FDT problem in extensive form. Then, .
Proposition 3: Consider a formally causal FDT problem. Then, Then, .
Thus, in the strictly causal case all decision theories coincide: but even here DDT requires the least precise assumptions for that to work (compared to CDT and EDT). More importantly, DDT allows to go far beyond the formally causal case. However, we do need a mild assumption about the problem:
Definition 1: An FDT problem is called pseudocausal when for any , if then .
It’s easy to see that any formally causal problem is pseudocausal, but there are many counterexamples to the converse.
Essentially, pseudocausality means that the outcome cannot depend on decisions in situations of probability 0. Notice that in reality the agent is never absolutely certain about the decision problem, hence observing a situation of probability 0 should cause it to believe it is in a different decision problem altogether. This makes the pseudocausality condition very natural.
Pseudocausality has the nice property of not depending on the loss function. If we do allow dependence on the loss function, we can make do it with an even weaker condition.
Definition 2: An FDT problem is called stable when there exists an FDT-optimal s.t. for any , if then is also FDT-optimal.
It’s obvious that any stable problem is pseudocausal. Naturally, the converse is false.
Neither pseudocausality nor stability is sufficient to guarantee that DDT and FDT give identical recommendations. However, it becomes true when we iterate the problem.
Definition 3: Given a decision problem and , we define its -th power as follows. The event space is just the ordinary power . The policy space is . The loss function is
Given , we define by
For FDT, for any we define the kernel by . We then define the logical counterfactuals
For DDT, we take the belief to be .
Note that iterating a problem commutes with converting it from FDT to DDT.
Theorem 4: For a stable FDT problem, there exists s.t. for any , DDT and FDT agree on the problem .
The requirement to iterate doesn’t seem like a terrible cost, since in a learning context some kind of iteration is necessary anyway. It can also be understood as a natural result of the need for stability: problems that are close to being unstable require more iterations.
Examples
All these examples besides the last one have natural extensive forms with one decision point.
Newcomb
This problem is formally causal, however the usual causal interpretation is non-trivial:
As a result, .
XOR Blackmail
The problem is pseudocausal but not formally causal. Nevertheless, CDT agrees with FDT thanks to the following causal interpretation:
Counterfactual Mugging
The problem is pseudocausal but not formally causal.
Empty-Dependent Transparent Newcomb
For simplicity, we postulate that the agent is forced to two-box when seeing a full box, since this choice is a “no-brainer” for all decision theories.
The problem is stable but not pseudocausal.
Full-Dependent Transparent Newcomb
As above, we postulate that the agent is forced to two-box when seeing an empty box.
The problem is not stable. DDT is indifferent between and , but it’s possible to construct a variant where DDT is strictly FDT-suboptimal.
Full-Dependent Transparent Newcomb with Noise
We now assume Omega has a probability of filling the box even when the agent two-boxes.
The problem is pseudocausal, but not formally causal of course. DDT converges to FDT after iterations.
Self-Coordination
Here’s an interesting example of a problem with two decision points. Omega flips a coin and shows the result to the agent. The agent then has to choose between buttons A, B and C. Button C always yields 3 dollars. Buttons A and B yield 4 dollars if Omega predicts the agent would choose the same button in the other coin counterfactual, and 0 dollars otherwise.
The rest of the definitions are clear and we won’t write them out. The problem is pseudocausal but not formally causal. CDT and EDT agree here, with their behavior depending on the agent’s self-belief . For uniform they choose the FDT-suboptimal policy . Moreover, there is an “equilibrium” where they choose even for “calibrated” (i.e. that puts most of the probability mass on ).
It is simplest to think of both as finite sets, but they can also be compact Polish spaces.
In the topological case, is required to be closed.