Vanessa Kosoy

Karma: 10,128

Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.

E-mail: {first name}@alter.org.il

Vanessa Kosoy 22 Feb 2026 8:54 UTC
LW: 2 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: Formalizing Newcombian problems with fuzzy infra-Bayesianism
Halpern and Leung propose the “minimax weighted expected regret” (MWER) decision-rule, which is a generalization of the minimax-expected-regret (MER) decision-rule. In contrast, our decision rule is a weighted generalization of maximin-expected-utility (MMEU). The problem with MER is that it doesn’t work very well with learning. The closest thing to doing learning with MER is adversarial bandits. However, adversarial regret is statistically intractable for Markov Decision Processes. And even with bandits there is a hidden obliviousness assumption if you try to interpret it in a principled decision-theoretic way.

Vanessa Kosoy 21 Feb 2026 17:19 UTC
LW: 10 AF: 6
0
AF
in reply to: Cole Wyeth’s comment on: An Introduction to Credal Sets and Infra-Bayes Learnability
The truth is outside of my hypothesis class, but my hypothesis class probably contains a non-trivial law that is a coarsening of the truth, which is the whole point.
For example, you can imagine that you start with some kind of intractable simplicity prior. Then, for each hypothesis you choose a tractable law that coarsens it. You end up with a probability distribution over laws.
A different way to view this is, this is just a way to force your policy to have low-regret w.r.t. all/most hypothesis while weighing complex hypotheses less. For a complex hypothesis, you naturally expect learning it to be harder so you’re weighing its regret less. Typically, it’s only possible to have a uniform regret bound if you impose a bound on the complexity of hypotheses in some sense. Absent such a bound, your regret bound must be non-uniform. You can formalize it by explicitly allowing the per-hypothesis regret to depend on some complexity parameter, but the Bayes approach is an alternative. (Also, Bayes regret obviously implies per-hypothesis non-uniform regret with a 1/probability coefficient.)

Vanessa Kosoy 21 Feb 2026 11:10 UTC
LW: 2 AF: 2
0
AF
in reply to: Cole Wyeth’s comment on: An Introduction to Credal Sets and Infra-Bayes Learnability
First, Bayes-regret and worst-case-regret are standard concepts in classical RL theory, and the infra-versions are straightforward analogs.
Second, you don’t have to focus on the Bayes-regret necessarily. In fact, in our papers, we focus entirely on uniform (worst-case) regret bounds.
Third, instead of an ordinary prior over laws you can consider an infraprior over laws (i.e. have ambiguity in hypothesis-space and not just in outcome-space). The resulting notion of “infra-Bayes-regret” has both Bayes-regret and worst-case-regret as special cases.
Fourth, the justification is quite straightforward. If you have an (unambiguous i.e. ordinary probability distribution) prior over laws, and your performance metric is the Bayes-infra-expected utility, then the Bayes-regret is just the difference between the performance of your policy and the performance of an optimal policy that magically knows the true hypothesis. So it’s a very natural measure of your policy’s ability to learn the hypothesis.

Vanessa Kosoy 12 Feb 2026 10:02 UTC
3 points
0
in reply to: Joanna’s comment on: Joanna’s Shortform
I like the overall vibe. Two issues:
- It says “Top Posts” and the mouse-over text is “by karma”, however in reality I can choose which posts to put there. Now, I like it that I can choose which posts to put there, but once I customized them, the mouse-over becomes a lie.
- ~~The “recent comments” disappeared. This is~~ ~~really bad~~ ~~because I use that to find my recent comments when I want to edit them. (For example now I wanted to find this comment to add this second bullet but had to do it manually.)~~ OK, I now see I can find them under “feed” but this might be confusing.

Vanessa Kosoy 20 Jan 2026 7:39 UTC
8 points
2
on: “The first two weeks are the hardest”: my first digital declutter
[Context: I’m not a digital minimalist but I am somewhat of a “digital reducetarian”: I don’t have social media (besides LinkedIn) and have a browser plugin that reduces my access to particular websites (like LessWrong).]
Cool post :)
For me, there’s something “strange” here (not surprising, but unlike my own experience), where the implication is that people have huge swaths of “free time” that they use for scrolling and the like (which you instead use for what’s described in this post). I spend the vast majority of my time either working or doing something with kids/lovers/friends. (I did read this post in bed preparing to start my day, and am sneaking in this comment between breakfast and work.) Plus short breaks from work, and a short time in bed before sleeping, during which I read fiction books (admittedly using digital means, but in principle I could use physical books just as well, if I could fit them all into my apartment).
It’s fun to hear about your experience talking to random strangers! Catalogued it under “I would never do this but I’m glad some people do”.

[Closed] Apply to Vanessa’s mentorship at PIBBSS

Vanessa Kosoy14 Jan 2026 9:15 UTC

39 points

0 comments2 min readLW link

Vanessa Kosoy 7 Jan 2026 8:46 UTC
LW: 4 AF: 3
0
AF
in reply to: Steven Byrnes’s comment on: Many arguments for AI x-risk are wrong
A metacognitive agent is not really an RL algorithm is in the usual sense. To first approximation, you can think of it as metalearning a policy that approximates (infra-)Bayes-optimality on a simplicity prior. The actual thing is more sophisticated and less RL-ish, but this is enough to understand how it can avoid wireheading.
Many alignment failure modes (wireheading, self-modification, inner misalignment...) can be framed as “traps”, since they hinge of the non-asymptotic properties of generalization, i.e. on the “prior” or “inductive bias”. Therefore frameworks that explicitly impose a prior (such as a metacognitive agents) are useful for understanding and avoiding these failure modes. (But, this has little to do with the OP, the way I see it.)

Vanessa Kosoy 6 Jan 2026 18:09 UTC
LW: 4 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: Many arguments for AI x-risk are wrong
The problem is, even if the argument that wireheading doesn’t happen is valid, it is not a cancellation. It just means that the wireheading failure mode is replaced by some equally bad or worse failure mode. This argument is basically saying “your model that predicted this extremely specific bad outcome is inaccurate, therefore this bad outcome probably won’t happen, good news!” But it’s not meaningful good news, because it does literally nothing to select for the extremely specific good outcome that we want.
If a rocket was launched towards the Moon with a faulty navigation system, that does not mean the rocket is likely to land on Mars instead. Observing that the navigation system is faulty is not even an “ingredient in a plan” to get the rocket to Mars.
I also don’t think that the drug analogy is especially strong evidence about anything. If you assumed that the human brain is a simple RL algorithm trained on something like pleasure vs. pain then not doing drugs would indeed be an example of not-wireheading. But why should we assume that? I think that the human brain is likely to be at least something like a metacognitive agent in which case you can come to model drugs as a “trap” (and there can be many additional complications).

Vanessa Kosoy 6 Jan 2026 16:48 UTC
LW: 4 AF: 4
0
AF
in reply to: Steven Byrnes’s comment on: Many arguments for AI x-risk are wrong
What do you mean “randomly come upon A”? RL is not random. Why wouldn’t it find A?
Let the proxy reward function we use to train the AI be $r_{p}$ and the “true” reward function that we intend the AI to follow be $r_{t}$ . Supposedly, these function agree on some domain $D$ but catastrophically go apart outside of it. Then, if all the training data lies inside $D$ , which reward function is selected depends on the algorithm’s inductive bias (and possibly also on luck). The “cancellation” hope is then that inductive bias favors $r_{t}$ over $r_{p}$ .
But why would that be the case? Realistically, the inductive bias is something like “simplicity”. And human preferences are very complex. On the other hand, something like “the reward is such-and-such bits in the input” is very simple. So instead of cancelling out, the problem is only aggravated.
And that’s under the assumption that $r_{p}$ and $r_{t}$ actually agree on $D$ , which is in itself wildly optimistic.

Vanessa Kosoy 5 Jan 2026 8:43 UTC
LW: 27 AF: 12
0
AF
on: Defining alignment research
In the post Richard Ngo talks about delineating “alignment research” vs. “capability research”, i.e. understanding what properties of technical AI research make it, in expectation, beneficial for reducing AI-risk rather than harmful. He comes up with a taxonomy based on two axes:
- Cognitivist vs. Behaviorist, i.e. focused on internals vs. external behavior. Arguably, net-beneficial research tends to be on the cognitivist side.
- Worst-case vs. Average-case, i.e. focused on rare failures vs. “usual” behavior. Arguably, net-beneficial research tends to be on the worst-case side.
I think that Ngo raises an important question, and his answers are pointing in the right way. On my part, I would like to slightly reframe his parameters and add two more axes:
- Instead of “cognitivist vs. behaviorist”, I would say “gears-level vs. surface-level”. We want research that explains the actual underlying mechanisms rather than just empirically registering particular phenomena or trends. This is definitely similar to “cognitivist vs. behaviorist”, but research that takes into account the internals of the algorithm can still be mostly surface-level: e.g. maybe it’s just saying, if we tweak this parameter in the algorithm, the performance goes up. I think that Ngo might object that tweaking a parameter is very different from talking e.g. about “beliefs” or “goals” that the system has, and I would agree, but I think that gears vs. surface might be a clearer delineation.
- Instead of “worst-case vs. average-case”, I would say “robust vs. fragile”. This is because it’s not entirely clearly what distribution we are “averaging” over, and because it’s important that rare failure modes can arise due to systematic reasons rather than just bad luck. The way I think about it: “fragile” methods are methods that can work if you can afford failures, so that every time there is a failure you amend the system until the result is satisfying. “Robust” methods are methods that you need if you can’t afford even one failure.
- Another axis I would add is “two-body vs. one body”. This is related to Ngo’s remark in the end that “further down the line, perhaps all parts of the table will unify into a rigorous science of cognition more generally, encompassing not just artificial but also biological minds”. The point is, alignment is fundamentally a two-body problem. We are aligning AI to a human (or many humans). And humans are already confused about what their preferences are, or about what would it mean to solve a problem without “undesirable side effects”. Therefore, we need research that illuminates the human side of things as well as the AI side of things. The way I envision it, is by creating a theory of agents that is applicable to AIs and humans alike. Other approaches might treat those two sides more asymmetrically, but they do have to address both sides.
- Additionally, I would add “precise vs. vague”. This is the difference between making vague, informal statements, and making precise, mathematical, hopefully quantitative statements. Being precise is certainly not sufficient: e.g. scaling laws can be precise, but fail to be gears-level. But it does seem like an important desideratum. Maybe this doesn’t need to be its own axis: precision seems necessary for achieving robustness. But, I think it’s a useful criterion for assessing research that stands on its own^[1].
Of course, on most of those axes, going “left” is useful for capabilities and not just alignment. As Ngo justly points out, a lot of research is inevitably dual use. However, approaches that lean “right” are often sufficient to advance capabilities and are unlikely to be sufficient to solve alignment, making them clearly the worse option overall.
In earlier times, I would also add here the consideration of applicability. When assembling a dangerous machine, it seems best to plug in the parts that make it dangerous last, if at all possible. Similarly, it’s better to start developing our understanding of agents from parts that don’t immediately allow building agents. Today, this is still true to some extent, however the urgency of the problem might make it moot. Unless the Pause-AI efforts are massively successful, we might have to make our theories of alignment applicable quite soon, and might not have the luxury of not parallelizing this research as much as possible.
Finally, I state a relatively minor quibble: Ngo seems to put a lot of the emphasis here on understanding deep learning. I would not go so far, for two reasons: one is the two-body desideratum I mentioned before, but the other is that deep learning might not be The Way. It’s possible that it’s better to find a different path towards AI altogether, one designed on better understanding from the start. This might seem overly ambitious, but I do have some leads.
1. ^
  There are certainly examples of research which is at least trying to be robust, while still failing to be very precise (e.g. some of Paul Christiano’s work falls in this category). Such research can be a good starting point for investigation, but should become precise at some stage for it to truly produce robust solutions.

Vanessa Kosoy 3 Jan 2026 8:52 UTC
2 points
0
on: Overwhelming Superintelligence
I am separately worried about “Carefully Controlled Moderate Superintelligences that we’re running at scale, each instance of which is not threatening, but, we’re running a lot of them...
I think that this particular distinction is not the critical one. What constitutes an “instance” is somewhat fuzzy. (A single reasoning thread? A system with a particular human/corporate owner? A particular source code? A particular utility function?) I think it’s more useful to think in terms of machine intelligence suprasystems with strong internal coordination capabilities. That is, if we’re somehow confident that the “instances” can’t or won’t coordinate either causally or acausally, then they are arguably truly “instances”, but the more they can coordinate the more we should be thinking of them in the aggregate. (Hence, the most cautious risk estimate comes from comparing the sum total of all machine intelligence against the sum total of all human intelligence^[1].)
1. ^
  More precisely, not even the sum total of all human intelligence, but the fraction of human intelligence that humans can effectively coordinate. See also comment by Nisan.

Vanessa Kosoy 3 Jan 2026 8:43 UTC
13 points
8
in reply to: Raemon’s comment on: Overwhelming Superintelligence
There seem to be two underlying motivations here, which are best kept separate.
One motivation is having a good vocabulary to talk about fine-grained distinctions. I’m on board with this one. We might want to distinguish e.g.:
- Smarter than a median human along all AI-risk-relevant axes
- Smarter than the smartest human along all AI-risk-relevant axes
- Smarter than all of humanity put together along all AI-risk-relevant axes
- Smart enough to have a 50% success probability to kill all humans if it chooses to, given current level of countermeasures
- Smart enough to have a 50% success probability to kill all humans if it chooses to, even if best-case countermeasures are in place (this particular distinction inspired by Buck’s comments on this thread)
But then, first, it is clear that existing AI is not superintelligence according to any of the above interpretations. Second, I see no reason not to use catchy words like “hyperintelligence”, per One’s suggestion. (Although I agree that there is an advantage to choosing more descriptive terms.)
Another motivation is staying ahead of the hype cycles and epistemic warfare on twitter or whatnot. This one I take issue with.
I don’t have an account on twitter, and I hope that I never will have. Twisting ourselves into pretzels with ridiculous words like “AIdon’tkilleveryoneism” is incompatible with creating a vocabulary optimized for actually thinking and having productive discussions among people who are trying to be the adults in the room. Let the twitterites use whatever anti-language they want. The people trying to do beneficial politics there: I sincerely wish you luck, but I’m laboring in a different trench, and let’s use the proper tool for each task separately.
I understand that there can be practical difficulties such as, what if LW ends up using a language so different from the outside world that it will become inaccessible to outsiders, even when those outsiders would otherwise make valuable contributions. There are probably some tradeoffs that are reasonable to make with such considerations in mind. But let’s at least not abandon any linguistic position at the slightest threatening gesture of the enemy.

Vanessa Kosoy 2 Jan 2026 17:54 UTC
LW: 16 AF: 7
0
AF
on: My AGI safety research—2024 review, ’25 plans
This post is an overview of Steven Byrnes’ AI alignment research programme, which I think is interesting and potentially very useful.
In a nutshell, Byrnes’ goal is to reverse engineer the human utility function, or at least some of its central features. I don’t think this will succeed in the sense of, we’ll find an explicit representation that can be hard-coded into AI. However, I believe that this kind of research is useful for two main reasons:
- Bridging brain science and agent theory is a promising way to make sure that we build a theory of agents broad enough to include humans. The latter is crucial in order to formally define alignment (since alignment is between the AI-agent and the human-agent), which is needed to have formal alignment guarantees. In particular, it is needed for value learning to become possible, such as in my COSI proposal.
- While ideally we might wish for alignment guarantees to assume as little as possible, it might be difficult or even impossible to design a competitive AI system which is robustly aligned with a completely uniformed prior. As a conservative example, we might discover that one or several scalar parameters of humans should be approximately known (e.g. parameters related to amount of computing resources^[1]). In this case, we would need to reverse engineer these parameters from brain science, which requires having a reliable dictionary between brain science and agent theory.
I hope that in the future this programme makes more direct contact with the mathematical formalism of agent theory, of the sort the LTA is constructing. However, I realize that this is a difficult challenge.
1. ^
  There are concrete reasons why such a scenario is somewhat plausible. The amount of computing resources might be necessary to fix the complexity measure that appears in ADAM, and/or it might be necessary to define the counterfactuals necessary for user detection in COSI.

Vanessa Kosoy 2 Jan 2026 7:04 UTC
15 points
5
on: Overwhelming Superintelligence
Why are we giving up on plain “superintelligence” so quickly? According to Wikipedia:
A superintelligence is a hypothetical agent that possesses intelligence surpassing that of the most gifted human minds. Philosopher Nick Bostrom defines superintelligence as “any intellect that greatly exceeds the cognitive performance of humans in virtually all domains of interest”.
According to Google AI Overview:
Superintelligence (or Artificial Superintelligence—ASI) is a hypothetical AI that vastly surpasses human intellect in virtually all cognitive domains, possessing superior scientific creativity, general wisdom, and social skills, operating at speeds and capacities far beyond human capability, and potentially leading to profound societal transformation or existential risks if not safely aligned with human goals.
I don’t think I saw anyone use “superintelligence” as “better than a majority of humans on some specific tasks” before very recently. (Was DeepBlue a superintelligence? Is a calculator superintelligence?)

Vanessa Kosoy 31 Dec 2025 13:00 UTC
LW: 60 AF: 22
42
AF
on: Many arguments for AI x-risk are wrong
This is a deeply confused post.
In this post, Turner sets out to debunk what he perceives as “fundamentally confused ideas” which are common in the AI alignment field. I strongly disagree with his claims.
In section 1, Turner quotes a passage from “Superintelligence”, in which Bostrom talks about the problem of wireheading. Turner declares this to be “nonsense” since, according to Turner, RL systems don’t seek to maximize a reward.
First, Bostrom (AFAICT) is describing a system which (i) learns online (ii) maximizes long-term consequences. There are good reasons to focus on such a system: these are properties that are desirable in an AI defense system, if the system is aligned. Now, the LLM+RLHF paradigm which Turners puts in the center is, at least superficially, not like that. However, this is no argument against Bostrom: today’s systems already went beyond LLM+RLHF (introducing RL over chain-of-thought) and tomorrow’s systems are likely to be even more different. And, if a given AI design does not somehow acquire properties i+ii even indirectly (e.g. via in-context learning), then it’s not clear how would it be useful for creating a defense system.
Second, Turner might argue that even granted i+ii, the AI would still not maximize reward because the properties of deep learning would cause it to converge to some different, reward-suboptimal, model. While this is often true, it is hardly an argument why not to worry.
While deep learning is not known to guarantee convergence to the reward-optimal policy (we don’t know how to prove almost any guarantees about deep learning), RL algorithms are certainly designed with reward maximization in mind. If your AI is unaligned even under best-case assumptions about learning convergence, it seems very unlikely that deviating from these assumptions would somehow cause it to be aligned (while remaining highly capable). To argue otherwise is akin to hoping for the rocket to reach the moon because our equations of orbital mechanics don’t account for some errors, rather than despite of it.
After this argument, Turner adds that “as a point of further fact, RL approaches constitute humanity’s current best tools for aligning AI systems today”. This observation seems completely irrelevant. It was indeed expected that RL would be useful in the subhuman regime, when the system cannot fail catastrophically simply because it lacks the capabilities. (Even when it convinces some vulnerable person to commit suicide, OpenAI’s legal department can handle it.) I would expect it to be obvious to Bostrom even back then, and doesn’t invalidate his conclusions in the slightest.
In section 3, Turner proceeds to attack the so-called “counting argument” for misalignment. The counting argument goes, since there are much more misaligned minds/goals than aligned minds/goals, even conditional on “good” behavior in training, it seems unlikely that current methods will produce an aligned mind. Turner (quoting Belrose and Pope) counters this argument by way of analogy. Deep learning successfully generalizes even though most models that perform well on the training data don’t perform well on the test data. Hence, (they argue) the counting argument must be fallacious.
The major error that Turner, Belrose and Pope are making is that of confusing aleatoric and epistemic uncertainty. There is also a minor error of being careless about what measure the counting is performed over.
If we did not know anything about some algorithm except that it performs well on the training data, we would indeed have at most a weak expectation of it performing well on the test data. However, deep learning is far from random in this regard: it was selected by decades of research to be that sort of algorithm that does generalize well. Hence, the counting argument in this case gives us a perfectly reasonable prior.
(The minor point is that, w.r.t to a simplicity prior, even a random algorithm has some bounded-from-below probability of generalizing well.)
The counting argument is not premised on deep understanding of how deep learning works (which at present doesn’t exist), but on a reasonable prior about what should we expect from our vantage point of ignorance. It describes our epistemic uncertainty, not the aleatoric uncertainty of deep learning. We can imagine that, if we knew how deep learning really works in the context of typical LLM training data etc, we would be able to confidently conclude that, say, RLHF has a high probability to eventually produce agents that primarily want to build astronomical superstructures in the shape of English letters, or whatnot. (It is ofc also possible we would conclude that LLM+RLHF will never produce anything powerful enough to be dangerous or useful-as-defense-sytem.) That would not be inconsistent with the counting argument as applied from our current state of knowledge.
The real question is then, conditional on our knowledge that deep learning often generalizes well, how confident are we that it will generalize aligned behavior from training to deployment, when scaled up to highly capable systems. Unfortunately, I don’t think this update is strong enough to make us remotely safe. The fact deep learning generalizes implies that it implements some form of Occam’s razor, but Occam’s razor doesn’t strongly select for alignment, as far as we can tell. Our current (more or less) best model of Occam’s razor is Solomonoff induction, which Turner dismisses as irrelevant to neural networks: but here again, the fact that our understanding is flawed just pushes us back towards the counting-argument-prior, not towards safety.
Also, we should keep in mind that deep learning doesn’t always generalize well empirically, it’s just that when it fails we add more data until it starts generalizing. But, if the failure is “kill all humans”, there is nobody left to add more data.
Turner’s conclusion is “it becomes far easier to just use the AIs as tools which do things we ask”. The extent to which I agree with this depends on the interpretation of the vague term “tools”. Certainly modern AI is a tool that does approximately what we ask (even though when using AI for math, I’m already often annoyed at its attempts to cheat and hide the flaws of its arguments). However, I don’t think we know how to safety create “tools” that are powerful enough to e.g. nearly-autonomously do alignment research or otherwise make substantial steps toward building an AI defense systems.

Vanessa Kosoy 28 Dec 2025 11:35 UTC
LW: 19 AF: 12
0
AF
on: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
This post contains an interesting mathematical result: that the machinery of natural latents can be transferred from classical information theory to algorithmic information theory. I find it intriguing for multiple reasons:
- It updates me towards natural latents being a useful concept for foundational questions in agent theory, as opposed to being some artifact of overindexing on Bayesian networks as the “right” ontology.
- The proof technique involves defining an algorithmic information theory analogue of Bayesian networks, which is something I haven’t seen before and seems quite interesting in itself.
- It would be interesting to see whether any of this carries over to the efficiently computable counterparts of Kolmogorov complexity I recently invented^[1].
The main thing this post is missing is any rigorous examples or existence proofs of these AIT natural latents. I’m guessing that the following construction should work:
- Choose a universal Turing machine $T$ .
- Choose $Λ$ to be a $T$ -program for a total recursive function s.t. $K (Λ) ≫ 0$ .
- Choose $ϕ_{i}$ to be random strings of length $n ≫ 0$ .
- Set $x_{i} := T (Λ, ϕ_{i})$ .
- Then, with high probability, $Λ$ is a natural latent for the $x_{i}$ . (I think?)
It would be nice to see something like that in the post.
These ideas seem conceptually close to concepts like sophistication in algorithmic statistics, and the connection might be worth investigating.
Now, about the stated motivation: the OP claims that natural latents capture how “reasonable” agents choose to define categories about the world. The argument seems somewhat compelling, although some further justification is required for the claim that
If you’ve been wondering why on Earth we would ever expect to find such simple structures in the complicated real world, conditioning on background knowledge is the main answer.
That said, I think that real-world categorizations are also somewhat value-laden: depending on the agent’s preferences, and on the laws of the universe in which they find themselves, there might be particular features they care about much more than other features. (Since they are more decision-relevant.) The importance of these features will likely influence which categories are useful to define. This fact cannot be captured in a formalism on the level of abstraction in this post. (Although maybe we can get some of the way there by drawing on rate-distortion theory?)
1. ^
  Still unpublished.

Vanessa Kosoy 27 Dec 2025 10:30 UTC
LW: 4 AF: 4
0
AF
on: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
Then: $O ($ $ϵ + ϵ^{'} +$ log $K (x)$ ) $\geq K (Λ^{'} | Λ)$ .
O-notation in contexts like this is somewhat ambiguous, because it’s not clear what is treated as a constant and what is treated as a variable that grows to infinity. Would it be correct to say that there exist constants $A, B > 0$ that depend on nothing except for the implicit choice of UTM (or maybe they also need to depend on $T M$ ?) s.t.
$K (Λ^{'} | Λ) \leq A (ϵ + ϵ^{'} + log K (x)) + B$

Vanessa Kosoy 26 Dec 2025 7:26 UTC
LW: 6 AF: 5
0
AF
on: A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
Aside: if a string is natural over some strings y_1, …, y_n, then it’s also natural over any subset consisting of two or more of those strings.
Is this actually true? Intuitively, it feels wrong: a natural $Λ$ is supposed to contain exactly the information which is mutual between the $y$ strings. But, a subset would have more mutual information than the full set. Formally, the redundancy conditional obviously descends to subsets, but mediation seems to break, at least superficially?

Vanessa Kosoy 24 Dec 2025 13:20 UTC
LW: 7 AF: 6
−1
AF
on: Hierarchical Agency: A Missing Piece in AI Alignment
In this post Jan Kulveit calls for creating a theory of “hierarchical agency”, i.e. a theory that talks about agents composed of agents, which might themselves be composed of agents etc.
The form of the post is a dialogue between Kulveit and Claude (the AI). I don’t like this format. I think that dialogues are a bad format in general, disorganized and not skimming friendly. The only case where IMO dialogues are defensible, is when it’s a real dialogue: real people with different world-views that are trying to bridge and/or argue their differences.
Now, about the content. I agree with Kulveit that multi-agency is important. I’m not entirely sold on the importance of the “hierarchical agency” frame, but I agree that “when can a system of agents be regarded as a single agent” seems like a question that should be answered. At the very least, a certain type of answer to this question might relieve of us of the need to worry about “what if humans are multi-agents” in the context of alignment (because, arguably, it would be possible to just regard humans as uni-agents anyway).
After reading this post, I came up with the following (extremely simplistic) toy model for hierarchical agency.
Let $D$ be the set of possible decisions an agent can make, $W$ the set of “possible worlds”, $O$ the set of “outcomes” and $G : D \times W \to Δ O$ the (known) process by which outcomes are generated. Then, an agent that makes decision $d^{*} \in D$ can be ascribed the belief $ζ : D \to Δ W$ and the utility function $u : O \to R$ when $d^{*}$ is the unique maximum of $E_{w \sim ζ (d), o \sim G (d, w)} [u (o)]$ over $d$ . Some decisions cannot be ascribed “intent” at all: for example, if $| W | = 1$ then $d \in D$ be be ascribed intent iff $G (d)$ is an exposed point of the convex hull of the image of $G$ . (See also)
We can now consider a system of $n \in N$ agents with decision sets $D_{1} \dots D_{n}$ and a process $G : \prod_{i} D_{i} \times W \to Δ O$ . For each set $A \subseteq {1 \dots n}$ , we can define $D_{A} := \prod_{i \in A} D_{i}$ and $W_{A} := \prod_{i \notin A} D_{i} \times W$ , and then $G_{A} : D_{A} \times W_{A} \to Δ O$ is defined in the obvious way. We can then ask which agent sets $A$ have “collective intent” and which don’t.
To give a simple example, let $D_{1} = D_{2} = {H, T, coin}$ , $| W | = 1$ , $O = {H, T}^{2}$ and $G$ is defined in the obvious way, where $coin$ corresponds to flipping a fair coin to decide between $H$ and $T$ . Then, if both agents choose $c o i n$ then they have intent individually (we can think of them as playing matching pennies, with each agent believing the other one’s action to depend on their own action), but not collectively. On the other hand, if they play a pure strategy that they have collective intent as well.
Extending this into a theory that fully engages with all relevant aspects of the problem would require incorporating infra-Bayesianism, Formal Computational Realism, possibly some form of the Algorithmic Descriptive Agency Measure etc, and more generally, first developing the theory of uni-agents. But maybe starting from the “hierarchical agency” end can be useful as well.
What links here?
- Deeper Reviews for the top 15 (of the 2024 Review) by Raemon (14 Jan 2026 23:59 UTC; 45 points)

Vanessa Kosoy 22 Dec 2025 7:26 UTC
LW: 13 AF: 8
0
AF
on: Modern Transformers are AGI, and Human-Level
In this post, Abram Demski argues that existing AI systems are already “AGI”. They are clearly general in a way previous generations of AI were not, and claiming that they are still not AGI smells of moving the goalposts.
Abram also helpfully edited the post to summarize and address some of the discussion in the comments. The commenters argued, and Abram largely agreed, that there are still important abilities that modern AI lacks. However, there is still the question of whether that should disqualify it from the moniker “AGI”, or maybe we need new terminology.
I tend to agree with Abram that there’s a sense in which modern AI is already “AGI”, and also agree with the commenters that there might be something important missing. To put the latter in my own words: I think that there is some natural property in computational-system-space s.t.
- The prospect of AI with this property is the key reason to be worried about X-risk from unaligned AI.
- Humans, or at least some humans, or at least humanity as a collective, has at least a little of this property, and this is what enabled humanity to become a technological civilization.
To handwave in the direction of that property, I would say “the ability to effectively and continuously acquire deep knowledge and exploit this knowledge to construct and execute goal-directed plans over long lifetimes and consequence horizons”.
It is IMO unclear whether modern AI are better thought of as having a positive but subhuman amount of this property, or as lacking it entirely (i.e. lacking some algorithmic component necessary for it). This question is hard to answer from our understanding of the algorithms, because foundation models “steal” some human cognitive algorithms in opaque ways, and we don’t even understand deep learning itself. Clearly, a civilization comprised of modern AI and no humans would not survive (not to mention progress), even if equipped with excellent robotic bodies. But, the latter might be just a “coincidental” fact about how harsh our specific universe is.
Be the case as it may, I think that the argument for more fine-grained terminology is strong. We can concede that modern AI is AGI, and have a new term for the thing modern AI might-not-yet-be. Maybe AGA: “Aritificial General Agent”?

Vanessa Kosoy

[Closed] Ap­ply to Vanessa’s men­tor­ship at PIBBSS

[Closed] Apply to Vanessa’s mentorship at PIBBSS