Vanessa Kosoy

Karma: 10,274

Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.

E-mail: {first name}@alter.org.il

Vanessa Kosoy 9 Apr 2026 15:25 UTC
LW: 2 AF: 2
0
AF
in reply to: DavidHolmes’s comment on: [Paper] Stringological sequence prediction I
I haven’t tried LZP in practice, but you can guess what results to expect by looking at the size of the LZ77-compression of the text. I expect that any remotely decent text prediction algorithm would be based on stochastic process prediction. The deterministic setting is just a toy model.
Thanks for the catch!

[Paper] Stringological sequence prediction I

Vanessa Kosoy7 Apr 2026 9:11 UTC

80 points

2 comments2 min readLW link

(arxiv.org)

Vanessa Kosoy 1 Apr 2026 15:48 UTC
21 points
0
on: Lesswrong Liberated
Star Trek
It’s supposed to look like the control panel of the Enterprise.

Vanessa Kosoy 29 Mar 2026 9:55 UTC
LW: 2 AF: 2
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
A few more observations.
Partially Observable Iteration
The definition of iteration we had before implicitly assumes that the agent can observe the full outcome of previous iterations. We don’t have to make this assumption. Instead, we can assume a set of possible observations and a mapping , in which case we define
I believe that Theorem 4 remains valid.
Idealized Disambiguative Decision Theory
As we remarked before, DDT is not invariant under adding a constant to the loss function. It is interesting to consider what happens when we add an increasingly large constant. In the limit, DDT converges to something I dubbed “Idealized Disambiguative Decision Theory” (IDDT)^[1], which works as follows.
For IDDT, it is sufficient to let be crisp (i.e. a credal set). We may allow supracontributions if we wish, but any problem defined via “unambiguous” FDT (i.e. as opposed to ) reduces to the crisp case. Define by
For problems coming from unambiguous FDT, , but IDDT is defined in full generality. For every , define by
The decision rule is then
Notice that it is now invariant w.r.t. adding constants to . Moreover,
Proposition 5: For any stable problem, it holds that (i) any IDDT-optimal policy is FDT-optimal (ii) there is an FDT-optimal policy which is IDDT-optimal. For any pseudocausal problem, it also holds that any FDT-optimal policy is IDDT-optimal.
One might think, based on this proposition, that IDDT is a superior decision theory to DDT. However, I think that IDDT is incompatible with learning, because of its discontinuous dependence on probabilities.
More Examples
Absent-Minded Driver
(Based on Aumann, Hart and Perry.) We will operationalize the problem by assuming the agent’s decision may deterministically depend on observing a coin flip. To simplify the presentation, we assume a single coin flip per intersection, which limits the resulting probabilities to , but it’s easy to generalize further.
Denote by and the constant policies. Denote by the policy
Denote by the remaining policy.
Consistently with our source, we set the loss function to be , , (it doesn’t depend on the coin flips).
This problem is formally causal. However, as opposed to all previous examples, it has no extensive form! Hence, EDT in the sense we defined it is ill-posed: to apply EDT reasoning here we need to at least supplement it by a theory of anthropic probabilities. CDT’s counterfactuals agree with FDT’s if we posit that the do-operator is constrained to choosing among “absent-minded” policies.
Self-Prisoner’s Dilemma
Previously we described the self-coordination problem, but perhaps self-PD is a more striking example.
Here, and is the agent’s factual play, whereas and is the agent’s counterfactual play as predicted by Omega.
Using the obvious notations , we have
The loss is the usual PD loss of the “factual” player.
This problem is not formally causal, because e.g.
The natural CDT interpretation is the one where the factual policy controls the counterfacual player and the counterfactual policy controls the factual player. (Alas, the terminology gets confusing here: in one case the words “factual” and “counterfactual” refer to the agent’s policy, and in the other case to the coin’s outcome.) Both CDT and EDT play regardless of self-belief. However, the problem is pseudocausal and hence DDT converges to .
1. ^
  IDDT is related to the old idea of “surmeasures” from the original infra-Bayesianism sequence.
2. ^
  We can also imagine equipping the agent with a “self-belief” (not necessarily ) and setting , in which case also becomes relevant.

Vanessa Kosoy 27 Mar 2026 13:23 UTC
LW: 2 AF: 2
0
AF
in reply to: Vinayak Pathak’s comment on: An Introduction to Credal Sets and Infra-Bayes Learnability
What you propose here doesn’t address the issue of non-realizability at all. For example, let’s say is countable. Then any of the 3 regret criteria (uniform, Bayesian and your own “credal” proposal) implies that the algorithm would converge to a near-optimal policy for any given . This cannot work if some such is infeasible to optimize.

Vanessa Kosoy 26 Mar 2026 16:48 UTC
LW: 3 AF: 2
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
This is an idea I came up with and presented in the Agent Foundations 2025 at CMU conference.

Here is a nice simple formalism for decision theory, that in particular supports the decision theory coming out of infra-Bayesianism. I now call the latter decision theory “Disambiguative Decision Theory”, since the counterfactuals work by “disambiguating” the agent’s belief.

Formalism

Let be the agent’s event space and the space of possible policies ^[1] . Let be the agent’s loss function. For each , we are given some . ^[2] This represents the event “the agent’s behavior is consistent with policy ”. We assume that

This data is common for all decision theories, but the rest of the details depend on the theory:

Functional Decision Theory (FDT)

We are given a mapping s.t. is supported on for all . The distribution represents the logical counterfactual associated with . It is also possible to consider the more general “robust” version , but we will avoid it here for simplicity. The decision rule is then

We will call an FDT problem “formally causal” when for any , the measures and agree when restricted to . That is, for any measurable , we require

Causal Decision Theory (CDT)

CDT has the same formal form as FDT, but we always require the problem to formally causal. Moreover, the interpretation of is different: it now represents the causal counterfactual associated with . The decision rule is also formally the same:

Given an FDT problem , we can translate it to a CDT problem, if we specify the agent’s belief about its own policies and causal interpretation: the kernel . Here is a copy of that represents the factual policy and is a copy of that represents the counterfactual policy. We require that , and that is formally causal in the second argument.

Normally, comes from a causal graph, where apply the do-operator for the counterfactual policy and condition on the factual policy (i.e. condition on what the policy would have been if not for the do-operator).

Given this data, we define the translation

Extensive Form and Evidential Decision Theory (EDT)

Extensive Form

To formalize EDT, we need to assume the decision process is given in “extensive” form. That is, we have a set of decision points, for each a set of actions , and a mapping , that defines the previous decision point and action. Here, we use the notation

We assume that is acyclic and hence makes into the vertices of a forest whose edges are labeled by .

We define a policy to be s.t.
2. For every , there is at most one s.t. .
3. For every , if then there exists some s.t. .
is now the set of policies defined in this way.

We further assume that there is a mapping (representing the last action taken) s.t. for all

Here, stands for iterating in the obvious way.

For any , we can use the notation

This represents the event “the decision point actually takes place”.

EDT

So far, this notion of extensive form decision problem is useful not just for EDT. Specifically for EDT, we add the assumption that we’re given the agent’s belief . We can now state the EDT decision rule. We define recursively. Always, .

For every s.t. , we set

Thus, the agent conditions both on following policy and observing decision point .

Given an FDT problem in extensive form, we can translate it to a EDT problem, if we specify the agent’s belief about its own policies . We define the translation

Disambiguative Decision Theory (DDT)

We are given the agent’s belief . Here, refers to supracontributions. The decision rule is

Here, is the characteristic function of the set . Equivalently, we can define by

We then have

This is the reason for the name “disambiguative”: is a “disambiguated” version of , where the policy is made unambiguous.

Given an FDT problem , we can translate it to a DDT problem without any further data:

That is, is the supracontribution hull of the distributions when ranges over .

DDT does have the odd property of non-invariance w.r.t. shifting by a constant, as opposed to all other decision theories considered. There might be some story about how this non-invariance is an inevitable consequence of learning (where imposing bounds on is important), but I’m not ready to tell it.

Comparison

Now, let’s look into how different decision theories compare. We will be using FDT as the “gold standard” throughout, when it comes to choosing the correct policy. Note though, that FDT assumes we somehow assign strict meaning to the logical counterfactuals, which is unclear how to accomplish. On the other hand, DDT makes the substantially weaker assumption that can define the supracontribution belief. In particular, it is consistent with learning, as was explained here.

Proposition 1: Consider a formally causal FDT problem . Assume that the causal interpretation takes the form . Then, .

Proposition 2: Consider a formally causal FDT problem in extensive form. Then, .

Proposition 3: Consider a formally causal FDT problem. Then, Then, .

Thus, in the strictly causal case all decision theories coincide: but even here DDT requires the least precise assumptions for that to work (compared to CDT and EDT). More importantly, DDT allows to go far beyond the formally causal case. However, we do need a mild assumption about the problem:

Definition 1: An FDT problem is called pseudocausal when for any , if then .

It’s easy to see that any formally causal problem is pseudocausal, but there are many counterexamples to the converse.

Essentially, pseudocausality means that the outcome cannot depend on decisions in situations of probability 0. Notice that in reality the agent is never absolutely certain about the decision problem, hence observing a situation of probability 0 should cause it to believe it is in a different decision problem altogether. This makes the pseudocausality condition very natural.

Pseudocausality has the nice property of not depending on the loss function. If we do allow dependence on the loss function, we can make do with an even weaker condition.

Definition 2: An FDT problem is called stable when there exists an FDT-optimal s.t. for any , if then is also FDT-optimal.

It’s obvious that any stable problem is pseudocausal. Naturally, the converse is false.

Neither pseudocausality nor stability is sufficient to guarantee that DDT and FDT give identical recommendations. However, it becomes true when we iterate the problem.

Definition 3: Given a decision problem and , we define its -th power as follows. The event space is just the ordinary power . The policy space is . The loss function is

Given , we define by

For FDT, for any we define the kernel by . We then define the logical counterfactuals

For DDT, we take the belief to be .

Note that iterating a problem commutes with converting it from FDT to DDT.

Theorem 4: For a stable FDT problem, there exists s.t. for any , DDT and FDT agree on the problem .

The requirement to iterate doesn’t seem like a terrible cost, since in a learning context some kind of iteration is necessary anyway. It can also be understood as a natural result of the need for stability: problems that are close to being unstable require more iterations.

Examples

All these examples besides the last one have natural extensive forms with one decision point.

Newcomb

This problem is formally causal, however the usual causal interpretation is non-trivial:

As a result, .

XOR Blackmail

The problem is pseudocausal but not formally causal. Nevertheless, CDT agrees with FDT thanks to the following causal interpretation:

Counterfactual Mugging

The problem is pseudocausal but not formally causal.

Empty-Dependent Transparent Newcomb

For simplicity, we postulate that the agent is forced to two-box when seeing a full box, since this choice is a “no-brainer” for all decision theories.

The problem is stable but not pseudocausal. EDT is ill-posed because , where is the unique decision point (that corresponds to seeing an empty box).

Full-Dependent Transparent Newcomb

As above, we postulate that the agent is forced to two-box when seeing an empty box.

The problem is not stable. EDT is ill-posed because , where is the unique decision point (that corresponds to seeing an full box). DDT is indifferent between and , but it’s possible to construct a variant where DDT is strictly FDT-suboptimal.

Full-Dependent Transparent Newcomb with Noise

We now assume Omega has a probability of filling the box even when the agent two-boxes.

The problem is pseudocausal, but not formally causal of course. EDT is well-posed and . DDT converges to FDT after iterations.

Self-Coordination

Here’s an interesting example of a problem with two decision points. Omega flips a coin and shows the result to the agent. The agent then has to choose between buttons A, B and C. Button C always yields 3 dollars. Buttons A and B yield 4 dollars if Omega predicts the agent would choose the same button in the other coin counterfactual, and 0 dollars otherwise.

The rest of the definitions are clear and we won’t write them out. The problem is pseudocausal but not formally causal. CDT and EDT agree here, with their behavior depending on the agent’s self-belief . For uniform they choose the FDT-suboptimal policy . Moreover, there is an “equilibrium” where they choose even for “calibrated” (i.e. that puts most of the probability mass on ).
1. ↩︎
  It is simplest to think of both as finite sets, but they can also be compact Polish spaces.
2. ↩︎
  In the topological case, is required to be closed.

Vanessa Kosoy 25 Mar 2026 19:19 UTC
LW: 2 AF: 2
0
AF
in reply to: Vinayak Pathak’s comment on: What is Inadequate about Bayesianism for AI Alignment: Motivating Infra-Bayesianism
the computational complexity of individual hypotheses in the hypothesis class cannot be the thing that characterizes the hardness of learning, but rather it has to be some measure of how complex the entire hypothesis class is.

This is true, of course, but mostly immaterial. Outside of contrived examples, it’s rare for the hypothesis class to be feasible to learn while containing hypotheses that are infeasible to evaluate. It seems extremely implausible that you can find a hypothesis class that is simultaneously (i) possible to specify in practice ^[1] (ii) feasible to learn and (iii) contains a hypothesis which is an exact description of the real universe. Therefore, non-realizability is unavoidable.
1. ↩︎
  By which I mean, we can construct the learning algorithm without being something akin to omniscient beings that already know everything about the universe and are able to hardcode this knowledge into the algorithm. Indeed, the reasons why we need a learning algorithm at all are (i) we don’t know a lot of what we want the agent to know (ii) it’s too labor-intensive to hardcode even the things that we do know. Therefore, we need a hypothesis class that is extremely broad and mostly uninformative.

Vanessa Kosoy 21 Mar 2026 17:32 UTC
LW: 6 AF: 4
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
This idea was described in a presentation I have in ’23, but wasn’t written down anywhere.

Here is a formalization of recursive self-improvement (more precisely, recursive metalearning) in the metacognitive agent framework.

Let be the set of programs or computations represented using some syntax. For example, might be a set of strings that serve as programs for a fixed Universal Turing Machine, but might also be something more structural, like the space of -terms or the space of PCF programs. Let be the space of semantic objects that we assign to computations. For example, we might just consider programs that output a single bit, in which case , but we can also consider rich semantic theories such as domain theory or game semantics. Let be the set of “weakly consistent” mappings from to . Here, “weakly consistent” means that we might impose some consistency conditions or none, but the conditions are weak enough to admit computationally tractable learning (in particular, these conditions fall short of picking out the true semantics). Thus, is our space of logical counterpossible worlds.

Let be the hypothesis class our agent is learning and an associated complexity function. With every we associate some which a prior over refinements of . Moreover, we require a symbolic representation of , that is, some s.t. , where is the true semantics. For example, maybe there is some with a mapping (some semantic objects that can be interpreted as elements of ) and s.t. for all , , and then .

Consider any symbolic representation of an element of , that is, some . Then, it is possible to construct , which is an intrinsic version of . That is, corresponds, from the agent’s perspective, to the assertion “the belief represented by is true” (without committing to the explicit form of this belief). is defined as follows.

Define by

can be regarded as a multivalued mapping from to . ^[1] This mapping is Kakutani. Therefore, by Kakutani’s theorem, it has a non-empty set of fixed points . Moreover, it’s easy to see that is convex and closed, and therefore an element of .

Given we define the associated “metahypothesis” as

We now say that an agent is recursively metalearning (w.r.t. the choices involved), if (i) it satisfies a “good enough” regret bound w.r.t. and (where is allowed to appear in some way in the regret bound for ) and (ii) For every , we have and , where is some sufficiently slow-growing function, e.g. a polynomial.

Intuitively, this reflects the idea if is true, the agent should be able to not just learn , but also learn to exploit for subsequent learning, where “subsequent learning” is operationalized as optimization under the prior .
1. ↩︎
  For simplicity, we assume that is just credal sets over . For supradistributions, use instead of .

Vanessa Kosoy 8 Mar 2026 20:06 UTC
17 points
19
on: How does LessWrong’s Ranking Algorithm Work?
Just don’t. I understand the frustration of not getting engagement, but don’t spam the site.

Vanessa Kosoy 22 Feb 2026 8:54 UTC
LW: 2 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: Formalizing Newcombian problems with fuzzy infra-Bayesianism
Halpern and Leung propose the “minimax weighted expected regret” (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn’t work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.

Vanessa Kosoy 21 Feb 2026 17:19 UTC
LW: 10 AF: 6
0
AF
in reply to: Cole Wyeth’s comment on: An Introduction to Credal Sets and Infra-Bayes Learnability
The truth is outside of my hypothesis class, but my hypothesis class probably contains a non-trivial law that is a coarsening of the truth, which is the whole point.
For example, you can imagine that you start with some kind of intractable simplicity prior. Then, for each hypothesis you choose a tractable law that coarsens it. You end up with a probability distribution over laws.
A different way to view this is, this is just a way to force your policy to have low-regret w.r.t. all/most hypothesis while weighing complex hypotheses less. For a complex hypothesis, you naturally expect learning it to be harder so you’re weighing its regret less. Typically, it’s only possible to have a uniform regret bound if you impose a bound on the complexity of hypotheses in some sense. Absent such a bound, your regret bound must be non-uniform. You can formalize it by explicitly allowing the per-hypothesis regret to depend on some complexity parameter, but the Bayes approach is an alternative. (Also, Bayes regret obviously implies per-hypothesis non-uniform regret with a 1/probability coefficient.)

Vanessa Kosoy 21 Feb 2026 11:10 UTC
LW: 2 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: An Introduction to Credal Sets and Infra-Bayes Learnability
First, Bayes-regret and worst-case-regret are standard concepts in classical RL theory, and the infra-versions are straightforward analogs.
Second, you don’t have to focus on the Bayes-regret necessarily. In fact, in our papers, we focus entirely on uniform (worst-case) regret bounds.
Third, instead of an ordinary prior over laws you can consider an infraprior over laws (i.e. have ambiguity in hypothesis-space and not just in outcome-space). The resulting notion of “infra-Bayes-regret” has both Bayes-regret and worst-case-regret as special cases.
Fourth, the justification is quite straightforward. If you have an (unambiguous i.e. ordinary probability distribution) prior over laws, and your performance metric is the Bayes-infra-expected utility, then the Bayes-regret is just the difference between the performance of your policy and the performance of an optimal policy that magically knows the true hypothesis. So it’s a very natural measure of your policy’s ability to learn the hypothesis.

Vanessa Kosoy 12 Feb 2026 10:02 UTC
3 points
0
in reply to: Joanna’s comment on: Joanna’s Shortform
I like the overall vibe. Two issues:
- It says “Top Posts” and the mouse-over text is “by karma”, however in reality I can choose which posts to put there. Now, I like it that I can choose which posts to put there, but once I customized them, the mouse-over becomes a lie.
- ~~The “recent comments” disappeared. This is~~ ~~really bad~~ ~~because I use that to find my recent comments when I want to edit them. (For example now I wanted to find this comment to add this second bullet but had to do it manually.)~~ OK, I now see I can find them under “feed” but this might be confusing.

Vanessa Kosoy 20 Jan 2026 7:39 UTC
8 points
2
on: “The first two weeks are the hardest”: my first digital declutter
[Context: I’m not a digital minimalist but I am somewhat of a “digital reducetarian”: I don’t have social media (besides LinkedIn) and have a browser plugin that reduces my access to particular websites (like LessWrong).]
Cool post :)
For me, there’s something “strange” here (not surprising, but unlike my own experience), where the implication is that people have huge swaths of “free time” that they use for scrolling and the like (which you instead use for what’s described in this post). I spend the vast majority of my time either working or doing something with kids/lovers/friends. (I did read this post in bed preparing to start my day, and am sneaking in this comment between breakfast and work.) Plus short breaks from work, and a short time in bed before sleeping, during which I read fiction books (admittedly using digital means, but in principle I could use physical books just as well, if I could fit them all into my apartment).
It’s fun to hear about your experience talking to random strangers! Catalogued it under “I would never do this but I’m glad some people do”.

[Closed] Apply to Vanessa’s mentorship at PIBBSS

Vanessa Kosoy14 Jan 2026 9:15 UTC

39 points

0 comments2 min readLW link

Vanessa Kosoy 7 Jan 2026 8:46 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Many arguments for AI x-risk are wrong
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)

Vanessa Kosoy 6 Jan 2026 18:09 UTC
LW: 4 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: Many arguments for AI x-risk are wrong
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).

Vanessa Kosoy 6 Jan 2026 16:48 UTC
LW: 4 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: Many arguments for AI x-risk are wrong
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be $r_{p}$ and the “true” reward function that we intend the AI to follow be $r_{t}$ . Supposedly, these function agree on some domain $D$ but catastrophically go apart outside of it. Then, if all the training data lies inside $D$ , which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors $r_{t}$ over $r_{p}$ .
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that $r_{p}$ and $r_{t}$ actually agree on $D$ , which is in itself wildly optimistic.

Vanessa Kosoy 5 Jan 2026 8:43 UTC
LW: 27 AF: 12
0
AF
on: Defining alignment research
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
- Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
- Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
- Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
- Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
- Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
- Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own^[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
1. ^
  There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.

Vanessa Kosoy 3 Jan 2026 8:52 UTC
2 points
0
on: Overwhelming Superintelligence
I am separately worried about “Carefully Controlled Moderate Superintelligences that we’re running at scale, each instance of which is not threatening, but, we’re running a lot of them...
I think that this particular distinction is not the critical one. What constitutes an “instance” is somewhat fuzzy. (A single reasoning thread? A system with a particular human/corporate owner? A particular source code? A particular utility function?) I think it’s more useful to think in terms of machine intelligence suprasystems with strong internal coordination capabilities. That is, if we’re somehow confident that the “instances” can’t or won’t coordinate either causally or acausally, then they are arguably truly “instances”, but the more they can coordinate the more we should be thinking of them in the aggregate. (Hence, the most cautious risk estimate comes from comparing the sum total of all machine intelligence against the sum total of all human intelligence^[1].)
1. ^
  More precisely, not even the sum total of all human intelligence, but the fraction of human intelligence that humans can effectively coordinate. See also comment by Nisan.

Vanessa Kosoy

[Paper] Stringolog­i­cal se­quence pre­dic­tion I

Partially Observable Iteration

Idealized Disambiguative Decision Theory

More Examples

Absent-Minded Driver

Self-Prisoner’s Dilemma

Formalism

Functional Decision Theory (FDT)

Causal Decision Theory (CDT)

Extensive Form and Evidential Decision Theory (EDT)

Extensive Form

EDT

Disambiguative Decision Theory (DDT)

Comparison

Examples

Newcomb

XOR Blackmail

Counterfactual Mugging

Empty-Dependent Transparent Newcomb

Full-Dependent Transparent Newcomb

Full-Dependent Transparent Newcomb with Noise

Self-Coordination

[Closed] Ap­ply to Vanessa’s men­tor­ship at PIBBSS

[Paper] Stringological sequence prediction I

[Closed] Apply to Vanessa’s mentorship at PIBBSS