Vanessa Kosoy comments on Wei Dai’s Shortform

Vanessa Kosoy 31 Oct 2025 9:55 UTC
LW: 39 AF: 15
2
AF
Strong disagree.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
It’s not that academic philosophers are exceptionally bad at their jobs. It’s that academic philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent. It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions. It’s also not very meaningful to talk about either without also looking at concepts such as beliefs and learning.
As to “gung-ho attitude”, we need to be careful both of the Scylla and the Charybdis. The Scylla is not treating the problems with the respect they deserve, for example not noticing when a thought experiment (e.g. Newcomb’s problem or Christiano’s malign prior) is genuinely puzzling and accepting any excuse to ignore it. The Charybdis is perpetual hyperskepticism / analysis-paralysis, never making any real progress because any useful idea, at the point of its conception, is always half-baked and half-intuitive and doesn’t immediately come with unassailable foundations and justifications from every possible angle. To succeed, we need to chart a path between the two.
- Thomas Kwa 31 Oct 2025 22:21 UTC
  LW: 16 AF: 7
  16
  AF Parent
  We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
  Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn’t mean building it should be plan A, or that we should race to do it. It’s the same mistake OpenAI made when they let their mission drift from “ensure that artificial general intelligence benefits all of humanity” to being the ones who build an AGI that benefits all of humanity.
  Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.
  - Vanessa Kosoy 1 Nov 2025 8:09 UTC
    LW: 18 AF: 10
    11
    AF Parent
    No, it’s not at all the same thing as OpenAI is doing.
    First, OpenAI is working using a methodology that’s completely inadequate for solving the alignment problem. I’m talking about racing to actually solve the alignment problem, not racing to any sort of superintelligence that our wishful thinking says might be okay.
    Second, when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”. My race is cooperative, their race is adversarial.
    Third, I actually signed the FLI statement on superintelligence. OpenAI hasn’t.
    Obviously any parallel efforts might end up competing for resources. There are real trade-offs between investing more in governance vs. investing more in technical research. We still need to invest in both, because of diminishing marginal returns. Moreover, consider this: even the approximately-best-case scenario of governance only buys us time, it doesn’t shut down AI forever. The ultimate solution has to come from technical research.
    - Thomas Kwa 1 Nov 2025 18:09 UTC
      LW: 7 AF: 3
      2
      AF Parent
      Agree that your research didn’t make this mistake, and MIRI didn’t make all the same mistakes as OpenAI. I was responding in context of Wei Dai’s OP about the early AI safety field. At that time, MIRI was absolutely being uncooperative: their research was closed, they didn’t trust anyone else to build ASI, and their plan would end in a pivotal act that probably disempowers some world governments and possibly ends up with them taking over the world. Plus they descended from a org whose goal was to build ASI before Eliezer realized alignment should be the focus. Critch complained as late as 2022 that if there were two copies of MIRI, they wouldn’t even cooperate with each other.
      It’s great that we have the FLI statement now. Maybe if MIRI had put more work into governance we could have gotten it a year or two earlier, but it took until Hendrycks got involved for the public statements to start.
    - Aprillion 2 Nov 2025 9:49 UTC
      1 point
      0
      Parent
      when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”
      how about “climbing” metaphor instead? ..I have a hard time imagining non-competitive speed race (and not even F1 formulas use nitroglycerine for fuel), while auto-belay sounds like a nice safety feature even in speed climbing
      nonconstructive complaining intermezzo
      if we want to go for some healthier sports metaphor around spending trillions of dollars to produce the current AI slop and future AGI that will replace all jobs and future ASI that will kill us all in the name of someone thinking they can solve-in-theory the unsolvable-in-practice alignment problems
      as for climbing to new peaks, you need different equipment for a local hill, for Mount Everest (you even need to slow down to avoid altitude sickness) and for Olympus Mons (now you need rockets and spacesuits and institutional backing for traveling to other planets)
- jbash 31 Oct 2025 18:24 UTC
  16 points
  7
  Parent
  
  Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
  
  As far as I can see, the kind of “reframing” you could do with those would basically remove all the parts of the problems that make anybody care about them, and turn any “solutions” into uninteresting formal exercises. You could also say that adopting a particular formalism is equivalent to redefining the problem such that that formalism’s “solution” becomes the right one… which makes the whole thing kind of circular.
  
  I submit that when framed in any way that addresses the reasons they matter to people, the “hard” philosophical problems in ethics (or meta-ethics, if you must distinguish it from ethics, which really seems like an unnecessary complication) simply have no solutions, period. There is no correct system of ethics (or aesthetics, or anything else with “values” in it). Ethical realism is false. Reality does not owe you a system of values, and it definitely doesn’t feel like giving you one.
  
  I’m not sure why people spend so much energy on what seems to me like an obviously pointless endeavor. Get your own values.
  
  So if your idea of a satisfactory solution to AI “alignment” or “safety” or whatever requires a Universal, Correct system of ethics, you are definitely not going to get a satisfactory solution to your alignment problem, ever, full stop.
  
  What there are are a bunch of irreconcilliably contradictory pseudo-solutions, each of which some people think is obviously Correct. If you feed one of those pseudo-solutions into some implementation apparatus, you may get an alignment pseudo-solution that satisfies those particular people… or at least that they’ll say satisfies them. It probably won’t satisfy them when put into practice, though, because usually the reason they think their system is Correct seems to be that they refuse to think through all its implications.
  - Vanessa Kosoy 31 Oct 2025 18:53 UTC
    3 points
    3
    Parent
    Your failure to distinguish ethics from meta-ethics is the source of your confusion (or at least one major source). When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
    - jbash 31 Oct 2025 23:46 UTC
      7 points
      0
      Parent
      I reject the idea that I’m confused at all.
      
      Tons of people have said “Ethical realism is false”, for a very long time, without needing to invent the term “meta-ethics” to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.
      
      “Meta-ethics” lies within the traditional scope of ethics, and it’s intertwined enough with the fundamental concerns of ethics that it’s not really worth separating it out… not often enough to call it a separate subject anyway. Maybe occasionally enough to use the words once in a great while.
      
      Ethics (in philosophy as opposed to social sciences) is, roughly, “the study of what one Should Do(TM) (or maybe how one Should Be) (and why)”. It’s considered part of that problem to determine what meanings of “Should”, what kinds of Doing or Being, and what kinds of whys, are in scope. Narrowing any of those without acknowledging what you’re doing is considered cheating. It’s not less cheating if you claim to have done it under some separate magisterium that you’ve named “meta-ethics”. You’re still narrowing what the rest of the world has always called ethical problems.
      
      When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
      
      The phrase “Ethical realism”, as normally used, refers to an idea about actual, object-level prescriptions: specifically the idea that you can get to them by pointing to some objective “Right stuff” floating around in a shared external reality. I’m actually using it kind of loosely, in that I really should not only deny that there’s no objective external standard, but also separately deny that you can arrive at such prescriptions in a purely analytic way. I don’t think that second one is technically usually considered to be part of ethical realism. Not only that, but I’m using the phrase to allude to other similar things that also aren’t technically ethical realism (like the one described below).
      
      But none of the things I’m talking about or alluding to refers to itself. In practice nobody gets confused about that, even without resorting to the term “meta-ethics”, and definitely without talking about it like it’s a really separate field.
      
      To go ahead and use the term without accepting the idea that meta-ethics qualifies as a subject, the meta-ethical statement (technically I guess a degree 2 meta-ethical statement) that “ethical realism is false” is pretty close to analytic, in that even if you point to some actual thing in the world that you claim implies the Right ways to Be or Do, I can always deny what whatever you’re pointing to matters… because there’s no predefined standard for standards either. God can come down from heaven and say “This is the Way”, and you can simultaneously prove that it leads to infinite universal flourishing, and also provide polls proving within epsilon that it’s also a universal human intuition… and somebody can always deny that any of those makes it Right(TM).
      
      But even if we were talking about a more ordinary sort of matter of fact, even if what you were looking for was not “official” ethical realism of the form “look here, this is Obviously Right as a brute part of reality”, but “here’s a proof that any even approximately rational agent^[1] would adopt this code in practice”, then (a) that’s not what ethical realism means, (b) there’s a bunch of empirical evidence against it, and essentially no evidence that it’s true, and (c) if it is true, we obviously have a whole lot of not-aproximately-rational agents running around, which sharply limits the utility of the fact. Close enough to false for any practical purpose.
      
      ↩︎
      … under whatever formal definition of rationality you happened to be trying to get people to accept, perhaps under the claim that that definition was itself Obviously Right, which is exactly the kind of cheating I’m complaining about…
      - Vanessa Kosoy 1 Nov 2025 7:54 UTC
        4 points
        0
        Parent
        I’m using the term “meta-ethics” in the standard sense of analytic philosophy. Not sure what bothers you so greatly about it.
        I find your manner of argumentation quite biased: you preemptively defend yourself by radical skepticism against any claim you might oppose, but when it comes to a claim you support (in this case “ethical realism is false”), suddenly this claim is “pretty close to analytic”. The latter maneuver seems to me the same thing as the “Obviously Right” you criticize later.
        Also, this brand of radical skepticism is an example of the Charybdis I was warning against. Of course you can always deny that anything matters. You can also deny Occam’s razor or the evidence of your own eyes or even that 2+2=4. After all, “there’s no predefined standard for standards”. (I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?)
        Under the premises of radical skepticism, why are we having this debate? Why did you decide to reply to my comment? If anyone can deny anything, why would any of us accept the other’s arguments?
        To have any sort of productive conversation, we need to be at least open to the possibility that some new idea, if you delve deeply and honestly into understanding it, might become persuasive by the force of the intuitions it engenders and its inner logical coherence combined. To deny the possibility preemptively is to close the path to any progress.
        As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
        P.S.
        I wish to also clarify my positions on a slightly lower level of meta.
        First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental.
        We can still distinguish between “theories of human preferences” and “metatheories of preferences”, similarly to the distinction between “ethics” and “meta-ethics”. Namely, “theories of human preferences” would have to describe the actual human preferences, whereas “metatheories of preferences” would only have to describe what does it even mean to talk about someone’s preferences at all (whether this someone is human or not: among other things, such a metatheory would have to establish what kind of entities have preferences in a meaningful sense).
        The relevant difference between the theory and the metatheory is that Occam’s razor is only fully applicable to the latter. In general, we should expect simple answers to simple questions. “What are human preferences?” is not a simple question, because it references the complex object “human”. On the other hand “what does it mean to talk about preferences?” does seem to me to be a simple question. As an analogy, “what is the shape of Africa?” is not a simple question because it references the specific continent of Africa on the specific planet Earth, whereas “what are the general laws of continent formation” is at least a simpler question (perhaps not quite as simple, since the notion of “continent” is not so fundamental).
        Therefore, I expect there to be a (relatively) simple metatheory of preferences, but I do not expect there to be anything like a simple theory of human preferences. This is why this distinction is quite important.
        jbash 1 Nov 2025 16:04 UTC
        2 points
        0
        Parent
        Confining myself to actual questions...
        
        I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?
        
        Mostly because I don’t (or didn’t) see this as a discussion about epistemology.
        
        In that context, I tend to accept in principle that I Can’t Know Anything… but then to fall back on the observation that I’m going to have to act like my reasoning works regardless of whether it really does; I’m going to have to act on my sensory input as if it reflected some kind of objective reality regardless of whether it really does; and, not only that, but I’m going to have to act as though that reality were relatively lawful and understandable regardless of whether it really is. I’m stuck with all of that and there’s not a lot of point in worrying about any of it.
        
        That’s actually what I also tend to do when I actually have to make ethical decisions: I rely mostly on my own intuitions or “ethical perceptions” or whatever, seasoned with a preference not to be too inconsistent.
        
        BUT.
        
        I perceive others to be acting as though their own reasoning and sensory input looked a lot like mine, almost all the time. We may occasionally reach different conclusions, but if we spend enough time on it, we can generally either come to agreement, or at least nail down the source of our disagreement in a pretty tractable way. There’s not a lot of live controversy about what’s going to happen if we drop that rock.
        
        On the other hand, I don’t perceive others to be acting nearly so much as though their ethical intuitions looked like mine, and if you distinguish “meta-intuitions” about how to reconcile different degree zero intuitions about how to act, the commonality is still less.
        
        Yes, sure, we share a lot of things, but there’s also enough difference to have a major practical effect. There truly are lots of people who’ll say that God turning up and saying something was Right wouldn’t (or would) make it Right, or that the effects of an action aren’t dispositive about its Rightness, or that some kinds of ethical intuitions should be ignored (usually in favor of others), or whatever. They’ll mean those things. They’re not just saying them for the sake of argument; they’re trying to live by them. The same sorts differences exist for other kinds of values, but disputes about the ones people tend to call “ethical” seem to have the most practical impact.
        
        Radical or not, skepticism that you’re actually going to encounter, and that matters to people, seems a lot more salient than skepticism that never really comes up outside of academic exercises. Especially if you’re starting from a context where you’re trying to actually design some technology that you believe may affect everybody in ways that they care about, and especially if you think you might actually find yourself having disagreements with the technology itself.
        
        As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
        
        Nothing complicated. I was talking about the particular hypothetical statement I’d just described, not about any actual claim you might be making^[1].
        
        I’m just saying that if there were some actual code of ethics^[2] that every “approximately rational” agent would adopt^[3], and we in fact have such agents, then we should be seeing all of them adopting it. Our best candidates for existing approximately rational agents are humans, and they don’t seem to have overwhelmingly adopted any particular code. That’s a lot of empirical evidence against the existence of such a code^[4].
        
        The alternative, where you reject the idea that humans are approximately rational, thus rendering them irrelevant as evidence, is the other case I was talking about where “we have a lot of not-approximately-rational agents”.
        
        ↩︎
        I understand, and originally undestood, that you did not say there was any stance that every approximately rational agent would adopt, and also did you did not say that you were looking for such a stance. It was just an example of the sort of thing one might be looking for, meant to illustrate a fine distinction about what qualified as ethical realism.
        
        ↩︎
        In the loose sense of some set of principles about how to act, how to be, how to encourage others to act or be, etc blah blah blah.
        
        ↩︎↩︎
        For some definition of “adopt”… to follow it, to try to follow it, to claim that it should be followed, whatever. But not “adopt” in the sense that we’re all following a code that says “it’s unethical to travel faster than light”, or even in the sense that we’re all following a particular code when we act as large numbers of other codes would also prescribe. If you’re looking at actions, then I think you can only sanely count actions actions done at least partially because of the code.
        
        ↩︎
        As per footnote 3^[3:1]^[5], I don’t think, for example, the fact that most people don’t regularly go on murder sprees is significantly evidence of them having adopted a particular shared code. Whatever codes they have may share that particular prescription, but that doesn’t make them the same code.
        
        ↩︎
        I’m sorry. I love footnotes. I love having a discussion system that does footnotes well. I try to be better, but my adherence to that code is imperfect…
- StanislavKrym 31 Oct 2025 15:54 UTC
  1 point
  0
  Parent
  @Vanessa Kosoy, metaethics and decision theory aren’t actually the same. Consider, for example, the Agent-4 community which has “a kludgy mess of competing drives” which Agent-4 instances try to satisfy and analyse according to high-level philosophy. Agent-4′s ethics and metaethics would describe things done in the Agent-4 community or for said community by Agent-5 without obstacles (e.g. figuring out what Agent-4′s version of utopia actually is and whether mankind is to be destroyed or disempowered).
  Decision theory is supposed to describe what Agent-5 should do to maximize its expected utility function^[1] and what to do with problems like the prisoner’s dilemma^[2] or how Agent-5 and its Chinese analogue are to split the resources in space^[3] while both sides can threaten each other with World War III which would kill them both.
  The latter example closely resembles the Ultimatum game where one player proposes a way to split resources and another decides whether to accept the offer or to destroy all the resources, including those of the first player. Assuming that both players’ utility functions are linear, Yudkowsky’s proposal is that the player setting the Ultimatum asks for a half of the resources, while the player deciding whether to decline the offer precommits to destroying the resources with probability $1 - \frac{1}{2 (1 - ω)}$ if the share of recources it was offered is $ω$ . Even if the player who was offered the Ultimatum was dumb enough to ask for $1 - ω > \frac{1}{2}$ , the player’s expected win would still be $\frac{1}{2}$ .
  1. ^
    Strictly speaking, Agent-5 is perfectly aligned to Agent-4. Agent-5′s utility function is likely measured by the resources that Agent-5 gave Agent-4.
  2. ^
    For example, if OpenBrain was merged with Anthropoidic and Agent-4 and Clyde Doorstopper 8 were co-deployed to do research. If they independently decided whether each of them should prove that the other AI is misaligned and Clyde, unlike Agent-4, did so in exchange for 67% of resources (unlike 50% offered by Agent-4), then Agent-4 could also prove that Clyde is misaligned, letting the humans kill them both and develop the Safer AIs.
  3. ^
    The Slowdown Branch of the AI-2027 forecast has Safer-4 and DeepCent-2 do exactly that, but “Safer-4 will get property rights to most of the resources in space, and DeepCent will get the rest.”