Vanessa Kosoy

Karma: 9,882

Research Lead at CORAL. Director of AI research at ALTER. PhD student in Shay Moran’s group in the Technion (my PhD research and my CORAL/ALTER research are one and the same). See also Google Scholar and LinkedIn.

E-mail: {first name}@alter.org.il

Vanessa Kosoy 22 Dec 2025 7:26 UTC
LW: 13 AF: 8
0
AF
on: Modern Transformers are AGI, and Human-Level
In this post, Abram Demski argues that existing AI systems are already “AGI”. They are clearly general in a way previous generations of AI were not, and claiming that they are still not AGI smells of moving the goalposts.
Abram also helpfully edited the post to summarize and address some of the discussion in the comments. The commenters argued, and Abram largely agreed, that there are still important abilities that modern AI lacks. However, there is still the question of whether that should disqualify it from the moniker “AGI”, or maybe we need new terminology.
I tend to agree with Abram that there’s a sense in which modern AI is already “AGI”, and also agree with the commenters that there might be something important missing. To put the latter in my own words: I think that there is some natural property in computational-system-space s.t.
- The prospect of AI with this property is the key reason to be worried about X-risk from unaligned AI.
- Humans, or at least some humans, or at least humanity as a collective, has at least a little of this property, and this is what enabled humanity to become a technological civilization.
To handwave in the direction of that property, I would say “the ability to effectively and continuously acquire deep knowledge and exploit this knowledge to construct and execute goal-directed plans over long lifetimes and consequence horizons”.
It is IMO unclear whether modern AI are better thought of as having a positive but subhuman amount of this property, or as lacking it entirely (i.e. lacking some algorithmic component necessary for it). This question is hard to answer from our understanding of the algorithms, because foundation models “steal” some human cognitive algorithms in opaque ways, and we don’t even understand deep learning itself. Clearly, a civilization comprised of modern AI and no humans would not survive (not to mention progress), even if equipped with excellent robotic bodies. But, the latter might be just a “coincidental” fact about how harsh our specific universe is.
Be the case as it may, I think that the argument for more fine-grained terminology is strong. We can concede that modern AI is AGI, and have a new term for the thing modern AI might-not-yet-be. Maybe AGA: “Aritificial General Agent”?

Vanessa Kosoy 21 Dec 2025 18:37 UTC
6 points
0
on: The 2024 LessWrong Review
Here’s a feature proposal.
The problem: At present, when a post has 0 reviews, there is an incentive against writing critical reviews. Writing such a review enables the post to enter the voting phase, which you don’t especially want to happen if you think the post is undeserving. This seems perverse: critical reviews are valuable, especially so if someone would write a positive review later, enabling the post to enter voting anyway. (In principle, you can “lie in ambush” until someone writes a positive review and only then write your negative review, but that requires annoying logistics.)
My suggestion: Allow flagging reviews as “critical” in the UI. (One option is to consider a review “critical” whenever your own vote for the post is negative, another is to have a separate checkbox.) Such reviews would not count for enabling the post to enter voting.

Vanessa Kosoy 21 Dec 2025 7:13 UTC
LW: 4 AF: 4
0
AF
on: Linear infra-Bayesian Bandits
This work^[1] was the first^[2] foray into proving non-trivial regret bounds in the robust (infra-Bayesian) setting. The specific bound I got was later slightly improved in Diffractor’s and my later paper. This work studied a variant of linear bandits, due the usual reasons linear models are often studied in learning theory: it is a conveniently simple setting where we actually know how to prove things, even with computationally efficient algorithms. (Although we still don’t have a computationally efficient algorithm for the robust version: not because it’s very difficult, but (probably) just because nobody got around to solving it.) As such, this work was useful as a toy-model test that infra-Bayesianism doesn’t run into statistical intractability issues. As to whether linear-model algorithms or their direct descendants will actually play a role in the ultimate theory of learning, that is still an open question.
1. ^
  An abridged version was also published as a paper in JMLR.
2. ^
  Other than Tian et al, which technically is a robust regret bound, but was not framed by the authors as such (instead, their motivation was studying zero-sum games).

Vanessa Kosoy 20 Dec 2025 11:13 UTC
LW: 4 AF: 4
0
AF
on: Infra-Bayesian haggling
TLDR: This post introduces a novel and interesting game-theoretic solution concept and provides informal arguments for why robust (infra-Bayesian) reinforcement learning algorithms might be expected to produce this solution in the multi-agent setting. As such, it is potentially an important step towards understanding multi-agency.
Disclosure: This review is hardly impartial, since the post was written with my guidance and based on my own work.
Understanding multi-agency is IMO, one of the most confusing and difficult challenges in the construction of a general theory of intelligent agents. I have a lot of uncertainty about what shape the solution should take even in the broadest brushstrokes, as I outlined in my recent five worlds taxonomy^[1]. This is in contrast to uni-agency, where Formal Computational Realism (FCR) is, IMO, pretty close to at least nailing down the correct type signature and qualitative nature of the desiderata.
At the same time, understanding multi-agency seems quite important in the context of AI alignment. There are many sorts of multi-agent interactions that are potentially relevant:
- AI-user is in the very core of the problem.
- user-[arbitrary agent] is important since the AI is supposed to faithfully “represent” the user in those interactions, and since examining those interactions might be necessary to correctly interpreting the user’s preferences.
- [counterfactual user]-[counterfactual user] is relevant to dealing with uncertainty during value learning.
- user-user is important for multi-user alignment.
- AI-[counterfactual agent] is important when considering inner alignment, since mesaoptimizers can sometimes be regarded as “acausal attacks” by counterfactual agents.
- AI-[successor agent] seems important for thinking about self-improving / reproducing agents.
- AI-AI is important if we expect a multipole scenario.
This post tells a particular story of how multi-agent theory might look like. In this story, agents converge to a new type of solution concept described in the “stable cycles for multiplayer games” section. (I call this solution “haggling equilibrium”). As opposed to Nash equilibria, the “typical” (but not any) haggling equilibrium in a two-player game is Pareto-efficient. This stands in contrast even to Nash equilibria in repeated games, where Pareto-efficiency is possibly but, due to the folk theorem, very underdetermined.
Moreover, there is an argument that a particular type of robust RL algorithm (robust UCB) would converge to such equilibria under some assumptions. However, the argument is pretty informal and there is not even a rigorous conjecture at present. There are, broadly speaking, two possibilities how the story might be completed:
- We promote convergence to haggling equilibrium to a desideratum, and demonstrate algorithms that accomplish it with good statistical and computational efficiency. (This corresponds to the “Economica” world in my five world taxonomy.)
- We show that there are reasonable uni-agent desiderata (robust regret bounds and maybe more?) that imply convergence to haggling equilibrium. (This corresponds to the “Harmonia” world in my five world taxonomy.)
With either possibility, the hope is that combining such a result with FCR would promote it to applying in more “exotic” contexts as well, such as one-shot games with transparent source code (along the lines of Demski’s “logical time”).
It is also interesting to study the notion of haggling equilibrium in itself, for example: is there always a Pareto-efficient haggling equilibrium? (True for two players, but I don’t know the answer in general.)
To summarize, the ideas in this post are, AFAIK, novel (although somewhat similar ideas appeared in the literature in the guise of “aspiration-based” algorithms in multi-agent RL, see e.g. Crandall and Goodrich 2013) and might be key to understanding multi-agency. However, the jury is still very much out.
1. ^
  In the terminology of those five worlds, I consider Nihiland and Discord to be quite unlikely, but Linguistica, Economica and Harmonia all seem plausible.

Vanessa Kosoy 19 Dec 2025 10:08 UTC
LW: 17 AF: 9
0
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
I propose a taxonomy of 5 possible worlds for multi-agent theory, inspired by Imagliazzo’s 5 possible worlds of complexity theory (and also the Aaronson-Barak 5 worlds of AI):
- Nihiland: There is not even a coherent uni-agent theory, not to mention multi-agency. I find this world quite unlikely, but leave it here for the sake of completeness (and for the sake of the number 5). Closely related is antirealism about rationality, which I have criticized in the past. In this world it is not clear whether the alignment problem is well-posed at all.
- Discordia: There is a coherent uni-agent theory, but no coherent theory of multi-agency. This world is conceivable, since the current understanding of multi-agency is much worse than the understanding of “solitary” agents. In this world, negative-sum conflicts and coordination failures are probably ubiquitous (even among arbitrarily sophisticated agents), because there is no principle of rationality that rules them out. Acausal trade is probably not a thing, or at least rare and fragile. In the context of value learning, there might be no principled way to deal with uncertainty (which could otherwise be regarded as a bargaining problem). There is also no principled solution to multi-user alignment.
- Linguistica: There is a coherent theory of multi-agency, but agents are inevitably divided into “types” s.t. only interactions between agents of the same type have strong guarantees. (The name of the world is because we can metaphorically think of the types as different “languages”.) An example of how this might happen is reflective oracles, where the type corresponds to the choice of fixed point. Acausal trade probably exists^[1], but is segregated by type. Alignment is complicated by the need to specify or learn the human type.
- Economica: There is a coherent uni-type theory of multi-agency, but this theory involves desiderata that can only be motivated by multi-agency. Explicitly thinking about multi-agency is necessary to construct the full theory of agents^[2]. In this world, the Yudkowskian hope for ubiquitous strong cooperation guarantees can be justified, and acausal trade might be very common. Figuring out the multi-agent theory, and not just the uni-agent fragment, is probably important for alignment, or at least necessary in order to avoid leaving huge gains from trade on the table.
- Harmonia: There is a coherent uni-type theory of multi-agency, and this theory can be derived entirely from desiderata that can be motivated without invoking multi-agency at all. There is no special “mutli-agent sauce”: any sufficiently rational agents automatically have strong guarantees in the multi-agent setting. Explicitly understanding multi-agency is arguably still important for dealing with uncertainty in value-learning, and dealing with multi-user alignment. (And also in order to know that we are in this world.)
1. ^
  For simplicity, I’m ignoring what is arguably an “orthogonal” axis: to which extent the “correct” multi-agent theory implies acausal cooperation even under favorable conditions. I believe that, outside of Nihiland and Discordia, it probably does, but the alternative hypothesis is also tenable.
2. ^
  On the border between Linguistica and Economica, there are worlds with strong guarantees for agents of the same type and medium-strength guarantees for agents of different type (where “medium-strength” is still stronger than “achieve maximin payoff”: the latter is already guaranteed in infra-Bayesianism). This blurs the boundary, but I would consider this to be Linguistica if even slightly different types have much weaker guarantees (or if there is no useful notion of “slightly different types”) and Economica if there is continuous graceful degradation like in Yudkowsky’s subjective fairness proposal.

Vanessa Kosoy 12 Dec 2025 22:32 UTC
LW: 4 AF: 3
0
AF
on: In Defense of Open-Minded UDT
This post discusses an important point: it is impossible to be simultaneously perfectly priorist (“updateless”) and learn. Learning requires eventually “passing to” something like a posterior, which is inconsistent with forever maintaining “entanglement” with a counterfactual world. This is somewhat similar to the problem of traps (irreversible transitions): being prudent about risking traps requires relying on your prior, which prevents you from learning every conceivable opportunity.
My own position on this cluster of questions is that you should be priorist/(infra-)Bayesian about physics but postist/learner/frequentist about logic. This idea is formally embodied in the no-regret criterion for Formal Computational Realism. I believe that this no-regret condition implies something like the OP’s “Eventual Learning”, but formally demonstrating it is future work.

Vanessa Kosoy 9 Dec 2025 8:45 UTC
4 points
0
in reply to: Algon’s comment on: Interpreting Quantum Mechanics in Infra-Bayesian Physicalism
Strictly speaking, there’s no result saying you can’t represent quantum phenomena by stochastic dynamics (a.k.a. hidden variables). Indeed, e.g. the de Broglie-Bohm interpretation does exactly that. What does exist is Bell’s inequality, which implies that it’s impossible to represent quantum phenomena by local hidden variables (local = the distribution is the limit of causal graphs in which variables are localized in spacetime and causal connections only run along future-directed timelike (not superluminal) separations). Now, our framework doesn’t even fall in the domain of Bell’s inequality, since (i) we have supracontributions (in this post called “ultracontributions”) instead of ordinary probability distributions (ii) we have multiple co-existing “worlds”. AFAIK, Bell-inequality-based arguments against local hidden variables support neither i nor ii. As such, it is conceivable that our interpretation is in some sense “local”. On the other hand, I don’t know that it’s local and have no strong reason to believe it.

Vanessa Kosoy 5 Dec 2025 7:35 UTC
LW: 19 AF: 9
0
AF
on: Interpreting Quantum Mechanics in Infra-Bayesian Physicalism
The interpretation of quantum mechanics is a philosophical puzzle that was baffling physicists and philosophers for about a century. In my view, this confusion is a symptom of us lacking a rigorous theory of epistemology and metaphysics. At the same time, creating such a theory seems to me like a necessary prerequisite for solving the technical AI alignment problem. Therefore, once we created a candidate theory of metaphysics (Formal Computation Realism (FCR), formerly known as infra-Bayesian Physicalism), the interpretation of quantum mechanics stood out as a powerful test case. In the work presented in this post, we demonstrated that FCR indeed passes this test (at least to a first approximation).
What is so confusing about quantum mechanics? To understand this, let’s take a look at a few of the most popular pre-existing interpretations.
The Copenhagen Interpretation (CI) proposes a mathematical rule for computing the probabilities of observation sequences, via postulating the collapse of the wavefunction. For every observation, you can apply the Born Rule to compute the probabilities of different results, and once a result is selected, the wavefunction is “collapsed” by projecting it to the corresponding eigenspace.
CI seems satisfactory to a logical positivist: if all we need from a physical theory is computing the probabilities of observations, we have it. However, this is unsatisfactory for a decision-making agent if the agent’s utility function depends on something other than its direct observations. For such an agent, CI offers no well-defined way to compute expected utility. Moreover, while normally decoherence ensures that the observations of all agents are in some sense “consistent”, in principle it is theoretically possible to create a situation in which decoherence fails and CI will prescribe contradictory beliefs to different agents (as in the Wigner’s friend thought experiment).
In CI, the wavefunction is merely a book-keeping device with no deep meaning of its own. In contrast, the Many Worlds Interpretation (MWI) takes a realist metaphysical stance, postulating that the wavefunction describes the objective physical state of the universe. This, in principle, admits meaningful unobservable quantities on which the values of agents can depend. However, the MWI has no mathematical rule for computing probabilities of observation sequences. If all “worlds” exist at the same time, there’s no obvious reason to expect to see one of them rather than another. MWI proponents address this by handwaving into existence some “degree of reality” that some worlds posses more than others. However, the fundamental fact remains that there is no well-defined prescription for probabilities of observation sequences, unless we copy the prescription of CI: however the latter is inconsistent with the intent of MWI in cases when decoherence fails, such as Wigner’s friend.
In principle, we can defend an MWI-based decision-theory in which the utility function is a self-adjoint operator on the Hilbert space and we are maximizing its expectation in the usual quantum-mechanical sense. Such a decision-theory can avoid the need for a well-defined probability distribution over observation sequences. However, it would leave us with an “ontological crisis”: if our agent did not start out knowing quantum mechanics, how would it translate its values into this quantum mechanical form?^[1]
The De Broglie-Bohm Interpretation (DBBI) proposes that in addition to the wavefunction, we should also postulate a classical trajectory following a time-evolution law that depends on the wavefunction. This results in a realist theory with a well-defined distribution over observation sequences. However, it comes with two major issues:
1. The observations have to be a function of the positional variables only (and not of the momenta variables). This is probably false even for humans and hardly applicable for arbitrary agents.
2. The theory is not Lorentz invariant. This is disturbing in and of itself, but in particular it implies that e.g. the experiences of humans moving at relativistic speeds compared to the preferred frame of reference (whatever that is) are devoid of value (i.e. such humans apparently have no qualia), if we circumvented problem #1 somehow.
In my view, the real source of all the confusion is the lack of rigorous metaphysics: we didn’t know (prior to this line of research), in full generality, what the type signature of a physical theory should be and how should we evaluate such a theory.
Enter Formal Computational Realism (FCR). According to FCR, the fundamental ontology in which all beliefs and values about the world should be expressed is computable logical facts plus the computational information content of the universe. The universe can contain information about computations (e.g., if someone calculated the 700th digit of pi, then the universe contains this information), and fundamentally this information is all there is^[2]. Moreover, given an algorithmic description of a physical theory plus the epistemic state of an agent in relation to computable logical facts, it is possible to formally specify the computational information content that this physical theory implies, from the perspective of the agent. The latter operation is called the “bridge transform”.
To apply this to quantum mechanics, we need to choose a particular algorithmic description. The choice we settled on is fairly natural: We imagine all possible quantum observables as having marginal distributions that obey the Born rule, with the joint distribution being otherwise completely ambiguous (in the sense that imprecise probability allows distributions to be ambiguous, i.e. we have “Knightian uncertainty” about it). The latter is a natural choice, because quantum mechanics has no prescription for the joint distribution of noncommuting observables. The combined values of all observables is the “state” that the physical theory computes, and the agent’s policy is treated as an unknown logical fact on which the computation depends.
Applying the bridge transform to the above operationalization of quantum mechanics, we infer the computational information content of the universe according to quantum mechanics, and then use the latter to extract the probabilities of various agent experiences. What we discover is as follows:
- As in MWI, multiple copies of the agent experiencing different “Everett branches” can coexist.
- Unlike MWI, not all Everett branches are instantiated, but only some (according to some well-defined probability distribution).
- Assuming decoherence, the probability distribution is s.t. from the perspective of any copy of the agent, the universe seems to obey CI probabilities. That is, any statistical test that the agent can run will converge to confirming the CI distribution in the limit with probability 1, for all copies of the agent.
- As opposed to CI, all agents have “consistent” expectations even when decoherence fails. We haven’t fully formalized this claim, but it seems clear from the general nature of FCR, and we did it investigate it in the specific example of Wigner’s friend.
As opposed to most pre-existing interpretations, the resulting formalism has precisely defined decision-theoretic prescriptions for an agent in any “weird” (i.e. not-decohering) situation like e.g. Wigner’s friend. This only requires the agent’s values to be specified in the FCR ontology (and in particular allows the agent to assign value to their own experiences, in some arbitrary history-dependent way, and/or the experiences of particular other agents).
In conclusion, FCR passed a non-trivial test here. It was not obvious to me that it would: before Gergely figured out the details, I wasn’t sure that it’s going to work at all. As such, I believe this to be a milestone result. (With some caveats: e.g. it needs to be rechecked for the non-monotonic version of the framework.)
1. ^
  Note that de Blanc’s proposal is inapplicable here, since the quantum ontology is not a Markov decision process.
2. ^
  To be clear, this is just a vague informal description, FCR is an actual rigorous mathematical framework.

Vanessa Kosoy 25 Nov 2025 22:54 UTC
LW: 12 AF: 9
2
AF
in reply to: Vanessa Kosoy’s comment on: Vanessa Kosoy’s Shortform
I’m renaming Infra-Bayesian Physicalism to Formal Computational Realism (FCR), since the latter name is much more in line with the nomenclature in academic philosophy.
AFAICT, the closest pre-existing philosophical views are Ontic Structural Realism (see 1 2) and Floridi’s Information Realism. In fact, FCR can be viewed as a rejection of physicalism, since it posits that a physical theory is meaningless unless it’s conjoined with beliefs about computable mathematics.
The adjective “formal” is meant to indicate that it’s a formal mathematical framework, not just a philosophical position. The previously used adjective “infra-Bayesian” now seems to me potentially confusing: On the one hand, it’s true that the framework requires imprecise probability (hence “infra”), on the other hand it’s a hybrid of frequentist and Bayesian.
To keep terminology consistent, Physicalist Superimitation should now be called Computational Superimitation (COSI).
What links here?
- The Learning-Theoretic Agenda: Status 2023 by Vanessa Kosoy (19 Apr 2023 5:21 UTC; 144 points)
- Vanessa Kosoy's comment on Interpreting Quantum Mechanics in Infra-Bayesian Physicalism by Yegreg (5 Dec 2025 7:35 UTC; 19 points)

Vanessa Kosoy 16 Nov 2025 5:18 UTC
10 points
0
on: Lambda Calculus Prior
I think that the problem is in the way you define the prior. Here is an alternative proposal:
Given a lambda-term $t$ , we can interpret it as defining a partial function $f_{t} : {0, 1}^{*} \times {0, 1} \times N ⇀ Q \cap [0, 1]$ . This function works by applying $t$ to the (appropriately encoded) inputs, beta-reducing, and then interpreting the result as an element of $Q \cap [0, 1]$ using some reasonable encoding. It’s a partial function because the reduction can fail to terminate or the output can violate the expected format.
Given $f : {0, 1}^{*} \times {0, 1} \times N ⇀ Q \cap [0, 1]$ , we define the “corrected” function $^f:{0,1}∗×{0,1}×N⇀Q∩[0,1]$ as follows. (The goal here is to make it monotonic in the last argument, and also ensure that probabilities sum to $\leq 1$ .) First, we write $f_{max} (u, b, k) = x$ whenever (i) for all $i \leq k$ , $(u, b, i) \in dom (f)$ and (ii) $x = {max}_{i \leq k} f (u, b, i)$ . If there is no such $x$ (i.e. when condition i fails) then $f_{max} (u, b, k)$ is undefined. Now, we have two cases:
- When $\forall i \leq k : f_{max} (u, 0, i) + f_{max} (u, 1, i) \leq 1$ (in particular, the terms on the LHS are defined), we define $^f (u, b, k) := f_{max} (u, b, k)$ .
- In other cases, we define $^f (u, b, k) := f_{max} (u, b, j)$ where $j$ is maximal s.t. $(u, b, j)$ is in the former case. If there is no such $j$ , we set $^f (u, b, k) := 0$ .
We can now define the semimeasure $μ_{f}$ by
$μ_{f} (b | u) := lim k \to \infty^f (u, b, k)$
For $f = f_{t}$ , this semimeasure is lower-semicomputable. Conversely, any lower-semicomputable semimeasure is of this form. Mixing these semimeasures according to our prior over lambda terms gives the desired Solomonoff-like prior.

Vanessa Kosoy 15 Nov 2025 12:38 UTC
2 points
2
in reply to: TAG’s comment on: Human Values ≠ Goodness
I agree, except that I don’t think it’s especially misleading. If I live on the 10th floor and someone is dangling a tasty cake two meters outside of my window (and suppose for the sake of the argument that it’s offered free of charge), I won’t just walk out of the window and fall to my death. This doesn’t mean I’m not following my values, it just means I’m actually thinking through the consequences rather than reacting impulsively to every value-laden thing.

Vanessa Kosoy 13 Nov 2025 8:45 UTC
11 points
2
on: Turing-Complete vs Turing-Universal
...The prototypical example of a prior based on Turing machines is Solomonoff’s prior. Someone not familiar with the distinction between Turing-complete and Turing-universal might naively think that a prior based on lambda calculus would be equally powerful. It is not so. Solomonoff’s prior guarantees a constant Bayes loss compared to the best computable prior for the job. In contrast, a prior based on lambda calculus can guarantee only a multiplicative loss.
Can you please make this precise?
When I think of “a prior based on lambda calculus”, I imagine something like the following. First, we choose some reasonable complexity measure $C$ on lambda terms, such as:
- For a variable $x$ , we define $C (x) := 1$
- For two terms $t, s$ , we define $C (t s) := C (t) + C (s)$
- For a term $t$ and a variable $x$ , we define $C (λ x . t) := C (t) + 1$
Denote the set of lambda-terms by $Λ$ . We then choose $β > 0$ s.t. $\sum_{t \in Λ} e^{- β C (t)} \leq 1$ .
Now, we choose some reasonable way to describe lower-semicomputable semimeasures using lambda terms, and make the prior probabilities of different lambda terms proporitional to $e^{- β C (t)}$ . It seems to me that the resulting semimeasure dominates every lower-semicomputable semimeasure and is arguably “as good as” the Solomonoff prior. What am I missing?
What links here?
- Lambda Calculus Prior by abramdemski (14 Nov 2025 21:29 UTC; 25 points)

Vanessa Kosoy 8 Nov 2025 12:05 UTC
24 points
11
in reply to: Alexander Gietelink Oldenziel’s comment on: Alexander Gietelink Oldenziel’s Shortform
Contemporary AI is smart in some ways and dumb in other ways. It’s a useful tool that you should integrate into your workflow if you don’t want to miss out on productivity. However. I’m worried that exposure to AI is dangerous in similar ways to how exposure to social media is dangerous, only more. You’re interacting with something designed to hijack your attention and addict you. Only this time the “something” has its own intelligence that is working towards this purpose (and possibly other, unknown, purposes).
As to the AI safety space: we’ve been saying for decades that AI is dangerous and now you’re surprised that we think AI is dangerous? I don’t think it’s taking over the world just yet, but that doesn’t mean there are no smaller-scale risks. It’s dangerous not because it’s dumb (the fact it’s still dumb is the saving grace) but precisely because it’s smart.
My own approach is, use AI is clear, compartmentalized ways. If you have a particular task which you know can be done faster by using AI in a particular way, by all means, use it. (But, do pay attention to time wasted on tweaking the prompt etc.) Naturally, you should also occasionally keep experimenting with new tasks or new ways of using it. But, if there’s no clear benefit, don’t use it. If it’s just to amuse yourself, don’t. And, avoid exposing other people if there’s no good reason.

Vanessa Kosoy 5 Nov 2025 7:14 UTC
LW: 7 AF: 4
0
AF
on: Legible vs. Illegible AI Safety Problems
This frame seems useful, but might obscure some nuance:
- The systems we should be most worried about are the AIs of tomorrow, not the AIs of today. Hence, some critical problems might not manifest at all in today’s AIs. You can still say it’s a sort of “illegible problem” of modern AI that it’s progressing towards a certain failure mode, but that might be confusing.
- While it’s true that deployment is the relevant threshold for the financial goals of a company, making it crucial for the company’s decision-making and available resources for further R&D, the dangers are not necessarily tied to deployment. It’s possible for a world-ending event to originate during testing or even during training.

Vanessa Kosoy 3 Nov 2025 10:17 UTC
10 points
0
on: Human Values ≠ Goodness
I mostly agree with this, the part which feels off is
I’d like to say here “screw memetic egregores, follow the actual values of actual humans”
Humans already follow their actual Values^[1], and will always do because their Values are the reason they do anything at all. They also construct narratives about themselves that involve Goodness, and sometimes deny the distinction between Goodness and Values altogether. This act of (self-)deception is in itself motivated by the Values, at least instrumentally.
I do have a version of the “screw memetic egregores” attitude, which is, stop self-deceiving. Because, deception distorts epistemics, and we cannot afford distorted epistemics right now. It’s not necessarily correct advice for everyone, but I believe it’s correct advice for everyone who is seriously trying to save the world, at least.
Another nuance is that, in addition to empathy and naive tit-for-tat, there is also acausal tit-for-tat. This further pushes the Value-recommended strategy in the direction of something Goodness-like (in certain respects), even though ofc it doesn’t coincide with the Goodness of any particular culture in any particular historical period.
1. ^
  As Steven Byrnes wrote, “values” might be not the best term, but I will keep it here.

Vanessa Kosoy 1 Nov 2025 8:09 UTC
LW: 27 AF: 12
13
AF
in reply to: Thomas Kwa’s comment on: Wei Dai’s Shortform
No, it’s not at all the same thing as OpenAI is doing.
First, OpenAI is working using a methodology that’s completely inadequate for solving the alignment problem. I’m talking about racing to actually solve the alignment problem, not racing to any sort of superintelligence that our wishful thinking says might be okay.
Second, when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”. My race is cooperative, their race is adversarial.
Third, I actually signed the FLI statement on superintelligence. OpenAI hasn’t.
Obviously any parallel efforts might end up competing for resources. There are real trade-offs between investing more in governance vs. investing more in technical research. We still need to invest in both, because of diminishing marginal returns. Moreover, consider this: even the approximately-best-case scenario of governance only buys us time, it doesn’t shut down AI forever. The ultimate solution has to come from technical research.

Vanessa Kosoy 1 Nov 2025 7:54 UTC
4 points
0
in reply to: jbash’s comment on: Wei Dai’s Shortform
I’m using the term “meta-ethics” in the standard sense of analytic philosophy. Not sure what bothers you so greatly about it.
I find your manner of argumentation quite biased: you preemptively defend yourself by radical skepticism against any claim you might oppose, but when it comes to a claim you support (in this case “ethical realism is false”), suddenly this claim is “pretty close to analytic”. The latter maneuver seems to me the same thing as the “Obviously Right” you criticize later.
Also, this brand of radical skepticism is an example of the Charybdis I was warning against. Of course you can always deny that anything matters. You can also deny Occam’s razor or the evidence of your own eyes or even that 2+2=4. After all, “there’s no predefined standard for standards”. (I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?)
Under the premises of radical skepticism, why are we having this debate? Why did you decide to reply to my comment? If anyone can deny anything, why would any of us accept the other’s arguments?
To have any sort of productive conversation, we need to be at least open to the possibility that some new idea, if you delve deeply and honestly into understanding it, might become persuasive by the force of the intuitions it engenders and its inner logical coherence combined. To deny the possibility preemptively is to close the path to any progress.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
P.S.
I wish to also clarify my positions on a slightly lower level of meta.
First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental.
We can still distinguish between “theories of human preferences” and “metatheories of preferences”, similarly to the distinction between “ethics” and “meta-ethics”. Namely, “theories of human preferences” would have to describe the actual human preferences, whereas “metatheories of preferences” would only have to describe what does it even mean to talk about someone’s preferences at all (whether this someone is human or not: among other things, such a metatheory would have to establish what kind of entities have preferences in a meaningful sense).
The relevant difference between the theory and the metatheory is that Occam’s razor is only fully applicable to the latter. In general, we should expect simple answers to simple questions. “What are human preferences?” is not a simple question, because it references the complex object “human”. On the other hand “what does it mean to talk about preferences?” does seem to me to be a simple question. As an analogy, “what is the shape of Africa?” is not a simple question because it references the specific continent of Africa on the specific planet Earth, whereas “what are the general laws of continent formation” is at least a simpler question (perhaps not quite as simple, since the notion of “continent” is not so fundamental).
Therefore, I expect there to be a (relatively) simple metatheory of preferences, but I do not expect there to be anything like a simple theory of human preferences. This is why this distinction is quite important.

Vanessa Kosoy 31 Oct 2025 18:53 UTC
4 points
3
in reply to: jbash’s comment on: Wei Dai’s Shortform
Your failure to distinguish ethics from meta-ethics is the source of your confusion (or at least one major source). When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.

Vanessa Kosoy 31 Oct 2025 9:55 UTC
LW: 68 AF: 25
1
AF
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
Strong disagree.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
It’s not that academic philosophers are exceptionally bad at their jobs. It’s that academic philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent. It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions. It’s also not very meaningful to talk about either without also looking at concepts such as beliefs and learning.
As to “gung-ho attitude”, we need to be careful both of the Scylla and the Charybdis. The Scylla is not treating the problems with the respect they deserve, for example not noticing when a thought experiment (e.g. Newcomb’s problem or Christiano’s malign prior) is genuinely puzzling and accepting any excuse to ignore it. The Charybdis is perpetual hyperskepticism / analysis-paralysis, never making any real progress because any useful idea, at the point of its conception, is always half-baked and half-intuitive and doesn’t immediately come with unassailable foundations and justifications from every possible angle. To succeed, we need to chart a path between the two.

Vanessa Kosoy 7 Oct 2025 10:32 UTC
2 points
0
in reply to: avturchin’s comment on: avturchin’s Shortform
Thanks for the heads up. Can you share which AI models were involved?