Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.
E-mail: {first name}@alter.org.il
A few more observations.
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don’t have to make this assumption. Instead, we can assume a set of possible observations
I believe that Theorem 4 remains valid.
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large constant. In the limit, DDT converges to something I dubbed “Idealized Disambiguative Decision Theory” (IDDT)[1], which works as follows.
For IDDT, it is sufficient to let
For problems coming from unambiguous FDT,
The decision rule is then
Notice that it is now invariant w.r.t. adding constants to
Proposition 5: For any stable problem, it holds that (i) any IDDT-optimal policy is FDT-optimal (ii) there is an FDT-optimal policy which is IDDT-optimal. For any pseudocausal problem, it also holds that any FDT-optimal policy is IDDT-optimal.
One might think, based on this proposition, that IDDT is a superior decision theory to DDT. However, I think that IDDT is incompatible with learning, because of its discontinuous dependence on probabilities.
(Based on Aumann, Hart and Perry.) We will operationalize the problem by assuming the agent’s decision may deterministically depend on observing a coin flip. To simplify the presentation, we assume a single coin flip per intersection, which limits the resulting probabilities to
Denote by
Denote by
Consistently with our source, we set the loss function to be
This problem is formally causal. However, as opposed to all previous examples, it has no extensive form! Hence, EDT in the sense we defined it is ill-posed: to apply EDT reasoning here we need to at least supplement it by a theory of anthropic probabilities. CDT’s counterfactuals agree with FDT’s if we posit that the do-operator is constrained to choosing among “absent-minded” policies.
Previously we described the self-coordination problem, but perhaps self-PD is a more striking example.
Here,
Using the obvious notations
The loss is the usual PD loss of the “factual” player.
This problem is not formally causal, because e.g.
The natural CDT interpretation is the one where the factual policy controls the counterfacual player and the counterfactual policy controls the factual player. (Alas, the terminology gets confusing here: in one case the words “factual” and “counterfactual” refer to the agent’s policy, and in the other case to the coin’s outcome.) Both CDT and EDT play
IDDT is related to the old idea of “surmeasures” from the original infra-Bayesianism sequence.
We can also imagine equipping the agent with a “self-belief”
What you propose here doesn’t address the issue of non-realizability at all. For example, let’s say
This is an idea I came up with and presented in the Agent Foundations 2025 at CMU conference.
Here is a nice simple formalism for decision theory, that in particular supports the decision theory coming out of infra-Bayesianism. I now call the latter decision theory “Disambiguative Decision Theory”, since the counterfactuals work by “disambiguating” the agent’s belief.
Let
This data is common for all decision theories, but the rest of the details depend on the theory:
We are given a mapping
We will call an FDT problem “formally causal” when for any
CDT has the same formal form as FDT, but we always require the problem to formally causal. Moreover, the interpretation of
Given an FDT problem
Normally,
Given this data, we define the translation
To formalize EDT, we need to assume the decision process is given in “extensive” form. That is, we have a set
We assume that
We define a policy to be
For every
For every
We further assume that there is a mapping
Here,
For any
This represents the event “the decision point
So far, this notion of extensive form decision problem is useful not just for EDT. Specifically for EDT, we add the assumption that we’re given the agent’s belief
For every
Thus, the agent conditions both on following policy
Given an FDT problem
We are given the agent’s belief
Here,
We then have
This is the reason for the name “disambiguative”:
Given an FDT problem
That is,
DDT does have the odd property of non-invariance w.r.t. shifting
Now, let’s look into how different decision theories compare. We will be using FDT as the “gold standard” throughout, when it comes to choosing the correct policy. Note though, that FDT assumes we somehow assign strict meaning to the logical counterfactuals, which is unclear how to accomplish. On the other hand, DDT makes the substantially weaker assumption that can define the supracontribution belief. In particular, it is consistent with learning, as was explained here.
Proposition 1: Consider a formally causal FDT problem
Proposition 2: Consider a formally causal FDT problem in extensive form. Then,
Proposition 3: Consider a formally causal FDT problem. Then, Then,
Thus, in the strictly causal case all decision theories coincide: but even here DDT requires the least precise assumptions for that to work (compared to CDT and EDT). More importantly, DDT allows to go far beyond the formally causal case. However, we do need a mild assumption about the problem:
Definition 1: An FDT problem is called pseudocausal when for any
It’s easy to see that any formally causal problem is pseudocausal, but there are many counterexamples to the converse.
Essentially, pseudocausality means that the outcome cannot depend on decisions in situations of probability 0. Notice that in reality the agent is never absolutely certain about the decision problem, hence observing a situation of probability 0 should cause it to believe it is in a different decision problem altogether. This makes the pseudocausality condition very natural.
Pseudocausality has the nice property of not depending on the loss function. If we do allow dependence on the loss function, we can make do with an even weaker condition.
Definition 2: An FDT problem is called stable when there exists an FDT-optimal
It’s obvious that any stable problem is pseudocausal. Naturally, the converse is false.
Neither pseudocausality nor stability is sufficient to guarantee that DDT and FDT give identical recommendations. However, it becomes true when we iterate the problem.
Definition 3: Given a decision problem and
Given
For FDT, for any
For DDT, we take the belief to be
Note that iterating a problem commutes with converting it from FDT to DDT.
Theorem 4: For a stable FDT problem, there exists
The requirement to iterate doesn’t seem like a terrible cost, since in a learning context some kind of iteration is necessary anyway. It can also be understood as a natural result of the need for stability: problems that are close to being unstable require more iterations.
All these examples besides the last one have natural extensive forms with one decision point.
This problem is formally causal, however the usual causal interpretation is non-trivial:
As a result,
The problem is pseudocausal but not formally causal. Nevertheless, CDT agrees with FDT thanks to the following causal interpretation:
The problem is pseudocausal but not formally causal.
For simplicity, we postulate that the agent is forced to two-box when seeing a full box, since this choice is a “no-brainer” for all decision theories.
The problem is stable but not pseudocausal. EDT is ill-posed because
As above, we postulate that the agent is forced to two-box when seeing an empty box.
The problem is not stable. EDT is ill-posed because
We now assume Omega has a probability
The problem is pseudocausal, but not formally causal of course. EDT is well-posed and
Here’s an interesting example of a problem with two decision points. Omega flips a coin and shows the result to the agent. The agent then has to choose between buttons A, B and C. Button C always yields 3 dollars. Buttons A and B yield 4 dollars if Omega predicts the agent would choose the same button in the other coin counterfactual, and 0 dollars otherwise.
The rest of the definitions are clear and we won’t write them out. The problem is pseudocausal but not formally causal. CDT and EDT agree here, with their behavior depending on the agent’s self-belief
the computational complexity of individual hypotheses in the hypothesis class cannot be the thing that characterizes the hardness of learning, but rather it has to be some measure of how complex the entire hypothesis class is.
This is true, of course, but mostly immaterial. Outside of contrived examples, it’s rare for the hypothesis class to be feasible to learn while containing hypotheses that are infeasible to evaluate. It seems extremely implausible that you can find a hypothesis class that is simultaneously (i) possible to specify in practice [1] (ii) feasible to learn and (iii) contains a hypothesis which is an exact description of the real universe. Therefore, non-realizability is unavoidable.
By which I mean, we can construct the learning algorithm without being something akin to omniscient beings that already know everything about the universe and are able to hardcode this knowledge into the algorithm. Indeed, the reasons why we need a learning algorithm at all are (i) we don’t know a lot of what we want the agent to know (ii) it’s too labor-intensive to hardcode even the things that we do know. Therefore, we need a hypothesis class that is extremely broad and mostly uninformative.
This idea was described in a presentation I have in ’23, but wasn’t written down anywhere.
Here is a formalization of recursive self-improvement (more precisely, recursive metalearning) in the metacognitive agent framework.
Let
Let
Consider any symbolic representation of an element of
Define
Given
We now say that an agent is recursively metalearning (w.r.t. the choices involved), if (i) it satisfies a “good enough” regret bound w.r.t.
Intuitively, this reflects the idea if
For simplicity, we assume that
Just don’t. I understand the frustration of not getting engagement, but don’t spam the site.
Halpern and Leung propose the “minimax weighted expected regret” (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn’t work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.
The truth is outside of my hypothesis class, but my hypothesis class probably contains a non-trivial law that is a coarsening of the truth, which is the whole point.
For example, you can imagine that you start with some kind of intractable simplicity prior. Then, for each hypothesis you choose a tractable law that coarsens it. You end up with a probability distribution over laws.
A different way to view this is, this is just a way to force your policy to have low-regret w.r.t. all/most hypothesis while weighing complex hypotheses less. For a complex hypothesis, you naturally expect learning it to be harder so you’re weighing its regret less. Typically, it’s only possible to have a uniform regret bound if you impose a bound on the complexity of hypotheses in some sense. Absent such a bound, your regret bound must be non-uniform. You can formalize it by explicitly allowing the per-hypothesis regret to depend on some complexity parameter, but the Bayes approach is an alternative. (Also, Bayes regret obviously implies per-hypothesis non-uniform regret with a 1/probability coefficient.)
First, Bayes-regret and worst-case-regret are standard concepts in classical RL theory, and the infra-versions are straightforward analogs.
Second, you don’t have to focus on the Bayes-regret necessarily. In fact, in our papers, we focus entirely on uniform (worst-case) regret bounds.
Third, instead of an ordinary prior over laws you can consider an infraprior over laws (i.e. have ambiguity in hypothesis-space and not just in outcome-space). The resulting notion of “infra-Bayes-regret” has both Bayes-regret and worst-case-regret as special cases.
Fourth, the justification is quite straightforward. If you have an (unambiguous i.e. ordinary probability distribution) prior over laws, and your performance metric is the Bayes-infra-expected utility, then the Bayes-regret is just the difference between the performance of your policy and the performance of an optimal policy that magically knows the true hypothesis. So it’s a very natural measure of your policy’s ability to learn the hypothesis.
I like the overall vibe. Two issues:
It says “Top Posts” and the mouse-over text is “by karma”, however in reality I can choose which posts to put there. Now, I like it that I can choose which posts to put there, but once I customized them, the mouse-over becomes a lie.
The “recent comments” disappeared. This is really bad because I use that to find my recent comments when I want to edit them. (For example now I wanted to find this comment to add this second bullet but had to do it manually.) OK, I now see I can find them under “feed” but this might be confusing.
[Context: I’m not a digital minimalist but I am somewhat of a “digital reducetarian”: I don’t have social media (besides LinkedIn) and have a browser plugin that reduces my access to particular websites (like LessWrong).]
Cool post :)
For me, there’s something “strange” here (not surprising, but unlike my own experience), where the implication is that people have huge swaths of “free time” that they use for scrolling and the like (which you instead use for what’s described in this post). I spend the vast majority of my time either working or doing something with kids/lovers/friends. (I did read this post in bed preparing to start my day, and am sneaking in this comment between breakfast and work.) Plus short breaks from work, and a short time in bed before sleeping, during which I read fiction books (admittedly using digital means, but in principle I could use physical books just as well, if I could fit them all into my apartment).
It’s fun to hear about your experience talking to random strangers! Catalogued it under “I would never do this but I’m glad some people do”.
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be and the “true” reward function that we intend the AI to follow be . Supposedly, these function agree on some domain but catastrophically go apart outside of it. Then, if all the training data lies inside , which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors over .
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that and actually agree on , which is in itself wildly optimistic.
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.
I am separately worried about “Carefully Controlled Moderate Superintelligences that we’re running at scale, each instance of which is not threatening, but, we’re running a lot of them...
I think that this particular distinction is not the critical one. What constitutes an “instance” is somewhat fuzzy. (A single reasoning thread? A system with a particular human/corporate owner? A particular source code? A particular utility function?) I think it’s more useful to think in terms of machine intelligence suprasystems with strong internal coordination capabilities. That is, if we’re somehow confident that the “instances” can’t or won’t coordinate either causally or acausally, then they are arguably truly “instances”, but the more they can coordinate the more we should be thinking of them in the aggregate. (Hence, the most cautious risk estimate comes from comparing the sum total of all machine intelligence against the sum total of all human intelligence[1].)
More precisely, not even the sum total of all human intelligence, but the fraction of human intelligence that humans can effectively coordinate. See also comment by Nisan.
There seem to be two underlying motivations here, which are best kept separate.
One motivation is having a good vocabulary to talk about fine-grained distinctions. I’m on board with this one. We might want to distinguish e.g.:
Smarter than a median human along all AI-risk-relevant axes
Smarter than the smartest human along all AI-risk-relevant axes
Smarter than all of humanity put together along all AI-risk-relevant axes
Smart enough to have a 50% success probability to kill all humans if it chooses to, given current level of countermeasures
Smart enough to have a 50% success probability to kill all humans if it chooses to, even if best-case countermeasures are in place (this particular distinction inspired by Buck’s comments on this thread)
But then, first, it is clear that existing AI is not superintelligence according to any of the above interpretations. Second, I see no reason not to use catchy words like “hyperintelligence”, per One’s suggestion. (Although I agree that there is an advantage to choosing more descriptive terms.)
Another motivation is staying ahead of the hype cycles and epistemic warfare on twitter or whatnot. This one I take issue with.
I don’t have an account on twitter, and I hope that I never will have. Twisting ourselves into pretzels with ridiculous words like “AIdon’tkilleveryoneism” is incompatible with creating a vocabulary optimized for actually thinking and having productive discussions among people who are trying to be the adults in the room. Let the twitterites use whatever anti-language they want. The people trying to do beneficial politics there: I sincerely wish you luck, but I’m laboring in a different trench, and let’s use the proper tool for each task separately.
I understand that there can be practical difficulties such as, what if LW ends up using a language so different from the outside world that it will become inaccessible to outsiders, even when those outsiders would otherwise make valuable contributions. There are probably some tradeoffs that are reasonable to make with such considerations in mind. But let’s at least not abandon any linguistic position at the slightest threatening gesture of the enemy.
This post is an overview of Steven Byrnes’ AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes’ goal is to reverse engineer the human utility function, or at least some of its central features. I don’t think this will succeed in the sense of, we’ll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
Bridging brain science and agent theory is a promising way to make sure that we build a theory of agents broad enough to include humans. The latter is crucial in order to formally define alignment (since alignment is between the AI-agent and the human-agent), which is needed to have formal alignment guarantees. In particular, it is needed for value learning to become possible, such as in my COSI proposal.
While ideally we might wish for alignment guarantees to assume as little as possible, it might be difficult or even impossible to design a competitive AI system which is robustly aligned with a completely uniformed prior. As a conservative example, we might discover that one or several scalar parameters of humans should be approximately known (e.g. parameters related to amount of computing resources[1]). In this case, we would need to reverse engineer these parameters from brain science, which requires having a reliable dictionary between brain science and agent theory.
I hope that in the future this programme makes more direct contact with the mathematical formalism of agent theory, of the sort the LTA is constructing. However, I realize that this is a difficult challenge.
Why are we giving up on plain “superintelligence” so quickly? According to Wikipedia:
A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the most gifted human minds. Philosopher Nick Bostrom defines superintelligence as “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest”.
According to Google AI Overview:
Superintelligence (or Artificial Superintelligence—ASI) is a hypothetical AI that vastly surpasses human intellect in virtually all cognitive domains, possessing superior scientific creativity, general wisdom, and social skills, operating at speeds and capacities far beyond human capability, and potentially leading to profound societal transformation or existential risks if not safely aligned with human goals.
I don’t think I saw anyone use “superintelligence” as “better than a majority of humans on some specific tasks” before very recently. (Was DeepBlue a superintelligence? Is a calculator superintelligence?)
Star Trek
It’s supposed to look like the control panel of the Enterprise.