Some of Eliezer’s founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:
Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety.
Academic philosophers aren’t very good at their jobs (as shown by their widespread disagreements, confusions, and bad ideas), but the problems aren’t actually that hard, and we (alignment researchers) can be competent enough philosophers and solve all of the necessary philosophical problems in the course of trying to build Friendly (or aligned/safe) AI.
I’ve repeatedly argued against 1 from the beginning, and also somewhat against 2 and 3, but perhaps not hard enough because I personally benefitted from them, i.e., having pre-existing interest/ideas in decision theory that became validated as centrally important for AI x-safety, and generally finding a community that is interested in philosophy and took my own ideas seriously.
Eliezer himself is now trying hard to change 1, and I think we should also try harder to correct 2 and 3. On the latter, I think academic philosophy suffers from various issues, but also that the problems are genuinely hard, and alignment researchers seem to have inherited Eliezer’s gung-ho attitude towards solving these problems, without adequate reflection. Humanity having few competent professional philosophers should be seen as (yet another) sign that our civilization isn’t ready to undergo the AI transition, not a license to wing it based on one’s own philosophical beliefs or knowledge!
In this recent EAF comment, I analogize AI companies trying to build aligned AGI with no professional philosophers on staff (the only exception I know is Amanda Askell) with a company trying to build a fusion reactor with no physicists on staff, only engineers. I wonder if that analogy resonates with anyone.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
It’s not that academic philosophers are exceptionally bad at their jobs. It’s that academic philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent. It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions. It’s also not very meaningful to talk about either without also looking at concepts such as beliefs and learning.
As to “gung-ho attitude”, we need to be careful both of the Scylla and the Charybdis. The Scylla is not treating the problems with the respect they deserve, for example not noticing when a thought experiment (e.g. Newcomb’s problem or Christiano’s malign prior) is genuinely puzzling and accepting any excuse to ignore it. The Charybdis is perpetual hyperskepticism / analysis-paralysis, never making any real progress because any useful idea, at the point of its conception, is always half-baked and half-intuitive and doesn’t immediately come with unassailable foundations and justifications from every possible angle. To succeed, we need to chart a path between the two.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn’t mean building it should be plan A, or that we should race to do it. It’s the same mistake OpenAI made when they let their mission drift from “ensure that artificial general intelligence benefits all of humanity” to being the ones who build an AGI that benefits all of humanity.
Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.
No, it’s not at all the same thing as OpenAI is doing.
First, OpenAI is working using a methodology that’s completely inadequate for solving the alignment problem. I’m talking about racing to actually solve the alignment problem, not racing to any sort of superintelligence that our wishful thinking says might be okay.
Second, when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”. My race is cooperative, their race is adversarial.
Third, I actually signed the FLI statement on superintelligence. OpenAI hasn’t.
Obviously any parallel efforts might end up competing for resources. There are real trade-offs between investing more in governance vs. investing more in technical research. We still need to invest in both, because of diminishing marginal returns. Moreover, consider this: even the approximately-best-case scenario of governance only buys us time, it doesn’t shut down AI forever. The ultimate solution has to come from technical research.
Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
As far as I can see, the kind of “reframing” you could do with those would basically remove all the parts of the problems that make anybody care about them, and turn any “solutions” into uninteresting formal exercises. You could also say that adopting a particular formalism is equivalent to redefining the problem such that that formalism’s “solution” becomes the right one… which makes the whole thing kind of circular.
I submit that when framed in any way that addresses the reasons they matter to people, the “hard” philosophical problems in ethics (or meta-ethics, if you must distinguish it from ethics, which really seems like an unnecessary complication) simply have no solutions, period. There is no correct system of ethics (or aesthetics, or anything else with “values” in it). Ethical realism is false. Reality does not owe you a system of values, and it definitely doesn’t feel like giving you one.
I’m not sure why people spend so much energy on what seems to me like an obviously pointless endeavor. Get your own values.
So if your idea of a satisfactory solution to AI “alignment” or “safety” or whatever requires a Universal, Correct system of ethics, you are definitely not going to get a satisfactory solution to your alignment problem, ever, full stop.
What there are are a bunch of irreconcilliably contradictory pseudo-solutions, each of which some people think is obviously Correct. If you feed one of those pseudo-solutions into some implementation apparatus, you may get an alignment pseudo-solution that satisfies those particular people… or at least that they’ll say satisfies them. It probably won’t satisfy them when put into practice, though, because usually the reason they think their system is Correct seems to be that they refuse to think through all its implications.
Your failure to distinguish ethics from meta-ethics is the source of your confusion (or at least one major source). When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
Tons of people have said “Ethical realism is false”, for a very long time, without needing to invent the term “meta-ethics” to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.
“Meta-ethics” lies within the traditional scope of ethics, and it’s intertwined enough with the fundamental concerns of ethics that it’s not really worth separating it out… not often enough to call it a separate subject anyway. Maybe occasionally enough to use the words once in a great while.
Ethics (in philosophy as opposed to social sciences) is, roughly, “the study of what one Should Do(TM) (or maybe how one Should Be) (and why)”. It’s considered part of that problem to determine what meanings of “Should”, what kinds of Doing or Being, and what kinds of whys, are in scope. Narrowing any of those without acknowledging what you’re doing is considered cheating. It’s not less cheating if you claim to have done it under some separate magisterium that you’ve named “meta-ethics”. You’re still narrowing what the rest of the world has always called ethical problems.
When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
The phrase “Ethical realism”, as normally used, refers to an idea about actual, object-level prescriptions: specifically the idea that you can get to them by pointing to some objective “Right stuff” floating around in a shared external reality. I’m actually using it kind of loosely, in that I really should not only deny that there’s no objective external standard, but also separately deny that you can arrive at such prescriptions in a purely analytic way. I don’t think that second one is technically usually considered to be part of ethical realism. Not only that, but I’m using the phrase to allude to other similar things that also aren’t technically ethical realism (like the one described below).
But none of the things I’m talking about or alluding to refers to itself. In practice nobody gets confused about that, even without resorting to the term “meta-ethics”, and definitely without talking about it like it’s a really separate field.
To go ahead and use the term without accepting the idea that meta-ethics qualifies as a subject, the meta-ethical statement (technically I guess a degree 2 meta-ethical statement) that “ethical realism is false” is pretty close to analytic, in that even if you point to some actual thing in the world that you claim implies the Right ways to Be or Do, I can always deny what whatever you’re pointing to matters… because there’s no predefined standard for standards either. God can come down from heaven and say “This is the Way”, and you can simultaneously prove that it leads to infinite universal flourishing, and also provide polls proving within epsilon that it’s also a universal human intuition… and somebody can always deny that any of those makes it Right(TM).
But even if we were talking about a more ordinary sort of matter of fact, even if what you were looking for was not “official” ethical realism of the form “look here, this is Obviously Right as a brute part of reality”, but “here’s a proof that any even approximately rational agent[1] would adopt this code in practice”, then (a) that’s not what ethical realism means, (b) there’s a bunch of empirical evidence against it, and essentially no evidence that it’s true, and (c) if it is true, we obviously have a whole lot of not-aproximately-rational agents running around, which sharply limits the utility of the fact. Close enough to false for any practical purpose.
… under whatever formal definition of rationality you happened to be trying to get people to accept, perhaps under the claim that that definition was itself Obviously Right, which is exactly the kind of cheating I’m complaining about…
I’m using the term “meta-ethics” in the standard sense of analytic philosophy. Not sure what bothers you so greatly about it.
I find your manner of argumentation quite biased: you preemptively defend yourself by radical skepticism against any claim you might oppose, but when it comes to a claim you support (in this case “ethical realism is false”), suddenly this claim is “pretty close to analytic”. The latter maneuver seems to me the same thing as the “Obviously Right” you criticize later.
Also, this brand of radical skepticism is an example of the Charybdis I was warning against. Of course you can always deny that anything matters. You can also deny Occam’s razor or the evidence of your own eyes or even that 2+2=4. After all, “there’s no predefined standard for standards”. (I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?)
Under the premises of radical skepticism, why are we having this debate? Why did you decide to reply to my comment? If anyone can deny anything, why would any of us accept the other’s arguments?
To have any sort of productive conversation, we need to be at least open to the possibility that some new idea, if you delve deeply and honestly into understanding it, might become persuasive by the force of the intuitions it engenders and its inner logical coherence combined. To deny the possibility preemptively is to close the path to any progress.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
P.S.
I wish to also clarify my positions on a slightly lower level of meta.
First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental.
We can still distinguish between “theories of human preferences” and “metatheories of preferences”, similarly to the distinction between “ethics” and “meta-ethics”. Namely, “theories of human preferences” would have to describe the actual human preferences, whereas “metatheories of preferences” would only have to describe what does it even mean to talk about someone’s preferences at all (whether this someone is human or not: among other things, such a metatheory would have to establish what kind of entities have preferences in a meaningful sense).
The relevant difference between the theory and the metatheory is that Occam’s razor is only fully applicable to the latter. In general, we should expect simple answers to simple questions. “What are human preferences?” is not a simple question, because it references the complex object “human”. On the other hand “what does it mean to talk about preferences?” does seem to me to be a simple question. As an analogy, “what is the shape of Africa?” is not a simple question because it references the specific continent of Africa on the specific planet Earth, whereas “what are the general laws of continent formation” is at least a simpler question (perhaps not quite as simple, since the notion of “continent” is not so fundamental).
Therefore, I expect there to be a (relatively) simple metatheory of preferences, but I do not expect there to be anything like a simple theory of human preferences. This is why this distinction is quite important.
I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?
Mostly because I don’t (or didn’t) see this as a discussion about epistemology.
In that context, I tend to accept in principle that I Can’t Know Anything… but then to fall back on the observation that I’m going to have to act like my reasoning works regardless of whether it really does; I’m going to have to act on my sensory input as if it reflected some kind of objective reality regardless of whether it really does; and, not only that, but I’m going to have to act as though that reality were relatively lawful and understandable regardless of whether it really is. I’m stuck with all of that and there’s not a lot of point in worrying about any of it.
That’s actually what I also tend to do when I actually have to make ethical decisions: I rely mostly on my own intuitions or “ethical perceptions” or whatever, seasoned with a preference not to be too inconsistent.
BUT.
I perceive others to be acting as though their own reasoning and sensory input looked a lot like mine, almost all the time. We may occasionally reach different conclusions, but if we spend enough time on it, we can generally either come to agreement, or at least nail down the source of our disagreement in a pretty tractable way. There’s not a lot of live controversy about what’s going to happen if we drop that rock.
On the other hand, I don’t perceive others to be acting nearly so much as though their ethical intuitions looked like mine, and if you distinguish “meta-intuitions” about how to reconcile different degree zero intuitions about how to act, the commonality is still less.
Yes, sure, we share a lot of things, but there’s also enough difference to have a major practical effect. There truly are lots of people who’ll say that God turning up and saying something was Right wouldn’t (or would) make it Right, or that the effects of an action aren’t dispositive about its Rightness, or that some kinds of ethical intuitions should be ignored (usually in favor of others), or whatever. They’ll mean those things. They’re not just saying them for the sake of argument; they’re trying to live by them. The same sorts differences exist for other kinds of values, but disputes about the ones people tend to call “ethical” seem to have the most practical impact.
Radical or not, skepticism that you’re actually going to encounter, and that matters to people, seems a lot more salient than skepticism that never really comes up outside of academic exercises. Especially if you’re starting from a context where you’re trying to actually design some technology that you believe may affect everybody in ways that they care about, and especially if you think you might actually find yourself having disagreements with the technology itself.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
Nothing complicated. I was talking about the particular hypothetical statement I’d just described, not about any actual claim you might be making[1].
I’m just saying that if there were some actual code of ethics[2] that every “approximately rational” agent would adopt[3], and we in fact have such agents, then we should be seeing all of them adopting it. Our best candidates for existing approximately rational agents are humans, and they don’t seem to have overwhelmingly adopted any particular code. That’s a lot of empirical evidence against the existence of such a code[4].
The alternative, where you reject the idea that humans are approximately rational, thus rendering them irrelevant as evidence, is the other case I was talking about where “we have a lot of not-approximately-rational agents”.
I understand, and originally undestood, that you did not say there was any stance that every approximately rational agent would adopt, and also did you did not say that you were looking for such a stance. It was just an example of the sort of thing one might be looking for, meant to illustrate a fine distinction about what qualified as ethical realism.
For some definition of “adopt”… to follow it, to try to follow it, to claim that it should be followed, whatever. But not “adopt” in the sense that we’re all following a code that says “it’s unethical to travel faster than light”, or even in the sense that we’re all following a particular code when we act as large numbers of other codes would also prescribe. If you’re looking at actions, then I think you can only sanely count actions actions done at least partially because of the code.
As per footnote 3[3:1][5], I don’t think, for example, the fact that most people don’t regularly go on murder sprees is significantly evidence of them having adopted a particular shared code. Whatever codes they have may share that particular prescription, but that doesn’t make them the same code.
I’m sorry. I love footnotes. I love having a discussion system that does footnotes well. I try to be better, but my adherence to that code is imperfect…
@Vanessa Kosoy, metaethics and decision theory aren’t actually the same. Consider, for example, the Agent-4 community which has “a kludgy mess of competing drives” which Agent-4 instances try to satisfy and analyse according to high-level philosophy. Agent-4′s ethics and metaethics would describe things done in the Agent-4 community or for said community by Agent-5 without obstacles (e.g. figuring out what Agent-4′s version of utopia actually is and whether mankind is to be destroyed or disempowered).
Decision theory is supposed to describe what Agent-5 should do to maximize its expected utility function[1] and what to do with problems like the prisoner’s dilemma[2] or how Agent-5 and its Chinese analogue are to split the resources in space[3] while both sides can threaten each other with World War III which would kill them both.
The latter example closely resembles the Ultimatum game where one player proposes a way to split resources and another decides whether to accept the offer or to destroy all the resources, including those of the first player. Assuming that both players’ utility functions are linear, Yudkowsky’s proposal is that the player setting the Ultimatum asks for a half of the resources, while the player deciding whether to decline the offer precommits to destroying the resources with probability 1−12(1−ω) if the share of recources it was offered is ω. Even if the player who was offered the Ultimatum was dumb enough to ask for 1−ω>12, the player’s expected win would still be 12.
For example, if OpenBrain was merged with Anthropoidic and Agent-4 and Clyde Doorstopper 8 were co-deployed to do research. If they independently decided whether each of them should prove that the other AI is misaligned and Clyde, unlike Agent-4, did so in exchange for 67% of resources (unlike 50% offered by Agent-4), then Agent-4 could also prove that Clyde is misaligned, letting the humans kill them both and develop the Safer AIs.
The Slowdown Branch of the AI-2027 forecast has Safer-4 and DeepCent-2 do exactly that, but “Safer-4 will get property rights to most of the resources in space, and DeepCent will get the rest.”
I mostly agree with 1. and 2., with 3. it’s a combination of the problems are hard, the gung-ho approach and lack of awareness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops. My impression is getting generally smart and fast “alignment researchers” more competent in philosophy is more tractable than trying to get established academic philosophers change what they work on, so one tractable thing is just convincing people the problems are real, hard and important. Other is maybe recruiting graduates
- philosophy has worse short feedback loops than eg ML engineering → in all sorts of processes like MATS or PIBBSS admissions it is harder to select for philosophical competence, also harder to self-improve - incentives: obviously stuff like being an actual expert in pretraining can get you lot of money and respect in some circles; even many prosaic AI safety / dual use skills like mech interpretability can get you maybe less money than pretraining, but still a lot of money if you work in AGI companies, and also decent ammount of status in ML community and a AI safety community; improving philosophical competence may get you some recognition but only among relatively small and weird group of people - the issue Wei Dai is commenting on in the original post, founder effects persist to this day & also there is some philosophy-negative prior in STEM— idk, lack of curiousity? llms have read it all, it’s easy to check if there is some existing thinking on a topic
There’s a deeper problem, how do we know there is a feedback loop?
I’ve never actually seen a worked out proof of well any complex claim on this site using standard logical notation…(beyond pure math and trivial tautologies)
At most there’s a feedback loop on each other’s hand wavey arguments that are claimed to be proof of this or that. But nobody ever actually delivers the goods so to speak such that they can be verified.
AI doing philosophy = AI generating hands, plus the fact that philosophy is heavily corrupted by postmodernism to the point where twoauthors write books dedicated to criticism of postmodernism PRECISELY because their parodies got published.
I think I meant a more practical / next-steps-generating answer.
I don’t think “academia is corrupted” is a bottleneck for a rationalist Get Gud At Philosophy project. We can just route around academia.
The sorts of things I was imagining might be things like “figure out how to teach a particular skill” (or “identify particular skills that need teaching”, or “figure out how test whether someone has a particular skill), or “solve some particular unsolved conceptual problem(s) that you expect to unlock much easier progress.”
Attracting mathy types rather than engineer types, resulting in early MIRI focusing on less relevant subproblems like decision theory, rather than trying lots of mathematical abstractions that might be useful (e.g. maybe there could have been lots of work on causal influence diagrams earlier). I have heard that decision theory was prioritized because of available researchers, not just importance.
A cultural focus on solving the full “alignment problem” rather than various other problems Eliezer also thought to be important (eg low impact), and lack of a viable roadmap with intermediate steps to aim for. Being bottlenecked on deconfusion is just cope, better research taste would either generate a better plan or realize that certain key steps are waiting for better AIs to experiment on
Focus on slowing down capabilities in the immediate term (e.g. plans to pay ai researchers to keep their work private) rather than investing in safety and building political will for an eventual pause if needed
1. Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
[...] Eliezer himself is now trying hard to change 1
This is not a recent development, as a pivotal act AI is not a Friendly AI (which would be too difficult), but rather things like a lasting AI ban/pause enforcement AI that doesn’t kill everyone, or a human uploading AI that does nothing else, which is where you presumably need decision theory, but not ethics, metaethics, or much of broader philosophy.
What’s wrong with just using AI for obvious stuff like curing death while you solve metaethics? Not necessary disagree about usefulness of people in the field changing their attitude, but more towards “the problem is hard, so we should not run CEV on day one”.
Assuming no AI takeover, in my world model the worse-case scenario is that the AI’s values are aligned to postmodernist slop which has likely occupied the Western philosophy, not that philosophical problems actually end unsolved. How likely are there to exist two different decision theories such that none is better than another?
Is there at all a plausible way for mankind to escape to other universes if our universe is simulated? What is the most plausible scenario for such a simulation to appear at all? Or does it produce paradoxes like the Plato-Socrates paradox where two sentences referring to each other become completely devoid of meaning?
Some of Eliezer’s founder effects on the AI alignment/x-safety field, that seem detrimental and persist to this day:
Plan A is to race to build a Friendly AI before someone builds an unFriendly AI.
Metaethics is a solved problem. Ethics/morality/values and decision theory are still open problems. We can punt on values for now but do need to solve decision theory. In other words, decision theory is the most important open philosophical problem in AI x-safety.
Academic philosophers aren’t very good at their jobs (as shown by their widespread disagreements, confusions, and bad ideas), but the problems aren’t actually that hard, and we (alignment researchers) can be competent enough philosophers and solve all of the necessary philosophical problems in the course of trying to build Friendly (or aligned/safe) AI.
I’ve repeatedly argued against 1 from the beginning, and also somewhat against 2 and 3, but perhaps not hard enough because I personally benefitted from them, i.e., having pre-existing interest/ideas in decision theory that became validated as centrally important for AI x-safety, and generally finding a community that is interested in philosophy and took my own ideas seriously.
Eliezer himself is now trying hard to change 1, and I think we should also try harder to correct 2 and 3. On the latter, I think academic philosophy suffers from various issues, but also that the problems are genuinely hard, and alignment researchers seem to have inherited Eliezer’s gung-ho attitude towards solving these problems, without adequate reflection. Humanity having few competent professional philosophers should be seen as (yet another) sign that our civilization isn’t ready to undergo the AI transition, not a license to wing it based on one’s own philosophical beliefs or knowledge!
In this recent EAF comment, I analogize AI companies trying to build aligned AGI with no professional philosophers on staff (the only exception I know is Amanda Askell) with a company trying to build a fusion reactor with no physicists on staff, only engineers. I wonder if that analogy resonates with anyone.
Strong disagree.
We absolutely do need to “race to build a Friendly AI before someone builds an unFriendly AI”. Yes, we should also try to ban Unfriendly AI, but there is no contradiction between the two. Plans are allowed (and even encouraged) to involve multiple parallel efforts and disjunctive paths to success.
It’s not that academic philosophers are exceptionally bad at their jobs. It’s that academic philosophy historically did not have the right tools to solve the problems. Theoretical computer science, and AI theory in particular, is a revolutionary method to reframe philosophical problems in a way that finally makes them tractable.
About “metaethics” vs “decision theory”, that strikes me as a wrong way of decomposing the problem. We need to create a theory of agents. Such a theory naturally speaks both about values and decision making, and it’s not really possible to cleanly separate the two. It’s not very meaningful to talk about “values” without looking at what function the values do inside the mind of an agent. It’s not very meaningful to talk about “decisions” without looking at the purpose of decisions. It’s also not very meaningful to talk about either without also looking at concepts such as beliefs and learning.
As to “gung-ho attitude”, we need to be careful both of the Scylla and the Charybdis. The Scylla is not treating the problems with the respect they deserve, for example not noticing when a thought experiment (e.g. Newcomb’s problem or Christiano’s malign prior) is genuinely puzzling and accepting any excuse to ignore it. The Charybdis is perpetual hyperskepticism / analysis-paralysis, never making any real progress because any useful idea, at the point of its conception, is always half-baked and half-intuitive and doesn’t immediately come with unassailable foundations and justifications from every possible angle. To succeed, we need to chart a path between the two.
Disagree, the fact that there needs to be a friendly AI before an unfriendly AI doesn’t mean building it should be plan A, or that we should race to do it. It’s the same mistake OpenAI made when they let their mission drift from “ensure that artificial general intelligence benefits all of humanity” to being the ones who build an AGI that benefits all of humanity.
Plan A means it would deserve more resources than any other path, like influencing people by various means to build FAI instead of UFAI.
No, it’s not at all the same thing as OpenAI is doing.
First, OpenAI is working using a methodology that’s completely inadequate for solving the alignment problem. I’m talking about racing to actually solve the alignment problem, not racing to any sort of superintelligence that our wishful thinking says might be okay.
Second, when I say “racing” I mean “trying to get there as fast as possible”, not “trying to get there before other people”. My race is cooperative, their race is adversarial.
Third, I actually signed the FLI statement on superintelligence. OpenAI hasn’t.
Obviously any parallel efforts might end up competing for resources. There are real trade-offs between investing more in governance vs. investing more in technical research. We still need to invest in both, because of diminishing marginal returns. Moreover, consider this: even the approximately-best-case scenario of governance only buys us time, it doesn’t shut down AI forever. The ultimate solution has to come from technical research.
As far as I can see, the kind of “reframing” you could do with those would basically remove all the parts of the problems that make anybody care about them, and turn any “solutions” into uninteresting formal exercises. You could also say that adopting a particular formalism is equivalent to redefining the problem such that that formalism’s “solution” becomes the right one… which makes the whole thing kind of circular.
I submit that when framed in any way that addresses the reasons they matter to people, the “hard” philosophical problems in ethics (or meta-ethics, if you must distinguish it from ethics, which really seems like an unnecessary complication) simply have no solutions, period. There is no correct system of ethics (or aesthetics, or anything else with “values” in it). Ethical realism is false. Reality does not owe you a system of values, and it definitely doesn’t feel like giving you one.
I’m not sure why people spend so much energy on what seems to me like an obviously pointless endeavor. Get your own values.
So if your idea of a satisfactory solution to AI “alignment” or “safety” or whatever requires a Universal, Correct system of ethics, you are definitely not going to get a satisfactory solution to your alignment problem, ever, full stop.
What there are are a bunch of irreconcilliably contradictory pseudo-solutions, each of which some people think is obviously Correct. If you feed one of those pseudo-solutions into some implementation apparatus, you may get an alignment pseudo-solution that satisfies those particular people… or at least that they’ll say satisfies them. It probably won’t satisfy them when put into practice, though, because usually the reason they think their system is Correct seems to be that they refuse to think through all its implications.
Your failure to distinguish ethics from meta-ethics is the source of your confusion (or at least one major source). When you say “ethical realism is false”, you’re making a meta-ethical statement. You believe this statement is true, hence you perforce must believe in meta-ethical realism.
I reject the idea that I’m confused at all.
Tons of people have said “Ethical realism is false”, for a very long time, without needing to invent the term “meta-ethics” to describe what they were doing. They just called it ethics. Often they went beyond that and offered systems they thought it was a good idea to adopt even so, and they called that ethics, too. None of that was because anybody was confused in any way.
“Meta-ethics” lies within the traditional scope of ethics, and it’s intertwined enough with the fundamental concerns of ethics that it’s not really worth separating it out… not often enough to call it a separate subject anyway. Maybe occasionally enough to use the words once in a great while.
Ethics (in philosophy as opposed to social sciences) is, roughly, “the study of what one Should Do(TM) (or maybe how one Should Be) (and why)”. It’s considered part of that problem to determine what meanings of “Should”, what kinds of Doing or Being, and what kinds of whys, are in scope. Narrowing any of those without acknowledging what you’re doing is considered cheating. It’s not less cheating if you claim to have done it under some separate magisterium that you’ve named “meta-ethics”. You’re still narrowing what the rest of the world has always called ethical problems.
The phrase “Ethical realism”, as normally used, refers to an idea about actual, object-level prescriptions: specifically the idea that you can get to them by pointing to some objective “Right stuff” floating around in a shared external reality. I’m actually using it kind of loosely, in that I really should not only deny that there’s no objective external standard, but also separately deny that you can arrive at such prescriptions in a purely analytic way. I don’t think that second one is technically usually considered to be part of ethical realism. Not only that, but I’m using the phrase to allude to other similar things that also aren’t technically ethical realism (like the one described below).
But none of the things I’m talking about or alluding to refers to itself. In practice nobody gets confused about that, even without resorting to the term “meta-ethics”, and definitely without talking about it like it’s a really separate field.
To go ahead and use the term without accepting the idea that meta-ethics qualifies as a subject, the meta-ethical statement (technically I guess a degree 2 meta-ethical statement) that “ethical realism is false” is pretty close to analytic, in that even if you point to some actual thing in the world that you claim implies the Right ways to Be or Do, I can always deny what whatever you’re pointing to matters… because there’s no predefined standard for standards either. God can come down from heaven and say “This is the Way”, and you can simultaneously prove that it leads to infinite universal flourishing, and also provide polls proving within epsilon that it’s also a universal human intuition… and somebody can always deny that any of those makes it Right(TM).
But even if we were talking about a more ordinary sort of matter of fact, even if what you were looking for was not “official” ethical realism of the form “look here, this is Obviously Right as a brute part of reality”, but “here’s a proof that any even approximately rational agent[1] would adopt this code in practice”, then (a) that’s not what ethical realism means, (b) there’s a bunch of empirical evidence against it, and essentially no evidence that it’s true, and (c) if it is true, we obviously have a whole lot of not-aproximately-rational agents running around, which sharply limits the utility of the fact. Close enough to false for any practical purpose.
… under whatever formal definition of rationality you happened to be trying to get people to accept, perhaps under the claim that that definition was itself Obviously Right, which is exactly the kind of cheating I’m complaining about…
I’m using the term “meta-ethics” in the standard sense of analytic philosophy. Not sure what bothers you so greatly about it.
I find your manner of argumentation quite biased: you preemptively defend yourself by radical skepticism against any claim you might oppose, but when it comes to a claim you support (in this case “ethical realism is false”), suddenly this claim is “pretty close to analytic”. The latter maneuver seems to me the same thing as the “Obviously Right” you criticize later.
Also, this brand of radical skepticism is an example of the Charybdis I was warning against. Of course you can always deny that anything matters. You can also deny Occam’s razor or the evidence of your own eyes or even that 2+2=4. After all, “there’s no predefined standard for standards”. (I guess you might object that your reasoning only applies to value-related claims, not to anything strictly value-neutral: but why not?)
Under the premises of radical skepticism, why are we having this debate? Why did you decide to reply to my comment? If anyone can deny anything, why would any of us accept the other’s arguments?
To have any sort of productive conversation, we need to be at least open to the possibility that some new idea, if you delve deeply and honestly into understanding it, might become persuasive by the force of the intuitions it engenders and its inner logical coherence combined. To deny the possibility preemptively is to close the path to any progress.
As to your “(b) there’s a bunch of empirical evidence against it” I honestly don’t know what you’re talking about there.
P.S.
I wish to also clarify my positions on a slightly lower level of meta.
First, “ethics” is a confusing term because, on my view, the colloquial meaning of “ethics” is inescapably intertwined with how human societies negotiate of over norms. On the other hand, I want to talk purely about individual preferences, since I view it as more fundamental.
We can still distinguish between “theories of human preferences” and “metatheories of preferences”, similarly to the distinction between “ethics” and “meta-ethics”. Namely, “theories of human preferences” would have to describe the actual human preferences, whereas “metatheories of preferences” would only have to describe what does it even mean to talk about someone’s preferences at all (whether this someone is human or not: among other things, such a metatheory would have to establish what kind of entities have preferences in a meaningful sense).
The relevant difference between the theory and the metatheory is that Occam’s razor is only fully applicable to the latter. In general, we should expect simple answers to simple questions. “What are human preferences?” is not a simple question, because it references the complex object “human”. On the other hand “what does it mean to talk about preferences?” does seem to me to be a simple question. As an analogy, “what is the shape of Africa?” is not a simple question because it references the specific continent of Africa on the specific planet Earth, whereas “what are the general laws of continent formation” is at least a simpler question (perhaps not quite as simple, since the notion of “continent” is not so fundamental).
Therefore, I expect there to be a (relatively) simple metatheory of preferences, but I do not expect there to be anything like a simple theory of human preferences. This is why this distinction is quite important.
Confining myself to actual questions...
Mostly because I don’t (or didn’t) see this as a discussion about epistemology.
In that context, I tend to accept in principle that I Can’t Know Anything… but then to fall back on the observation that I’m going to have to act like my reasoning works regardless of whether it really does; I’m going to have to act on my sensory input as if it reflected some kind of objective reality regardless of whether it really does; and, not only that, but I’m going to have to act as though that reality were relatively lawful and understandable regardless of whether it really is. I’m stuck with all of that and there’s not a lot of point in worrying about any of it.
That’s actually what I also tend to do when I actually have to make ethical decisions: I rely mostly on my own intuitions or “ethical perceptions” or whatever, seasoned with a preference not to be too inconsistent.
BUT.
I perceive others to be acting as though their own reasoning and sensory input looked a lot like mine, almost all the time. We may occasionally reach different conclusions, but if we spend enough time on it, we can generally either come to agreement, or at least nail down the source of our disagreement in a pretty tractable way. There’s not a lot of live controversy about what’s going to happen if we drop that rock.
On the other hand, I don’t perceive others to be acting nearly so much as though their ethical intuitions looked like mine, and if you distinguish “meta-intuitions” about how to reconcile different degree zero intuitions about how to act, the commonality is still less.
Yes, sure, we share a lot of things, but there’s also enough difference to have a major practical effect. There truly are lots of people who’ll say that God turning up and saying something was Right wouldn’t (or would) make it Right, or that the effects of an action aren’t dispositive about its Rightness, or that some kinds of ethical intuitions should be ignored (usually in favor of others), or whatever. They’ll mean those things. They’re not just saying them for the sake of argument; they’re trying to live by them. The same sorts differences exist for other kinds of values, but disputes about the ones people tend to call “ethical” seem to have the most practical impact.
Radical or not, skepticism that you’re actually going to encounter, and that matters to people, seems a lot more salient than skepticism that never really comes up outside of academic exercises. Especially if you’re starting from a context where you’re trying to actually design some technology that you believe may affect everybody in ways that they care about, and especially if you think you might actually find yourself having disagreements with the technology itself.
Nothing complicated. I was talking about the particular hypothetical statement I’d just described, not about any actual claim you might be making[1].
I’m just saying that if there were some actual code of ethics[2] that every “approximately rational” agent would adopt[3], and we in fact have such agents, then we should be seeing all of them adopting it. Our best candidates for existing approximately rational agents are humans, and they don’t seem to have overwhelmingly adopted any particular code. That’s a lot of empirical evidence against the existence of such a code[4].
The alternative, where you reject the idea that humans are approximately rational, thus rendering them irrelevant as evidence, is the other case I was talking about where “we have a lot of not-approximately-rational agents”.
I understand, and originally undestood, that you did not say there was any stance that every approximately rational agent would adopt, and also did you did not say that you were looking for such a stance. It was just an example of the sort of thing one might be looking for, meant to illustrate a fine distinction about what qualified as ethical realism.
In the loose sense of some set of principles about how to act, how to be, how to encourage others to act or be, etc blah blah blah.
For some definition of “adopt”… to follow it, to try to follow it, to claim that it should be followed, whatever. But not “adopt” in the sense that we’re all following a code that says “it’s unethical to travel faster than light”, or even in the sense that we’re all following a particular code when we act as large numbers of other codes would also prescribe. If you’re looking at actions, then I think you can only sanely count actions actions done at least partially because of the code.
As per footnote 3[3:1][5], I don’t think, for example, the fact that most people don’t regularly go on murder sprees is significantly evidence of them having adopted a particular shared code. Whatever codes they have may share that particular prescription, but that doesn’t make them the same code.
I’m sorry. I love footnotes. I love having a discussion system that does footnotes well. I try to be better, but my adherence to that code is imperfect…
@Vanessa Kosoy, metaethics and decision theory aren’t actually the same. Consider, for example, the Agent-4 community which has “a kludgy mess of competing drives” which Agent-4 instances try to satisfy and analyse according to high-level philosophy. Agent-4′s ethics and metaethics would describe things done in the Agent-4 community or for said community by Agent-5 without obstacles (e.g. figuring out what Agent-4′s version of utopia actually is and whether mankind is to be destroyed or disempowered).
Decision theory is supposed to describe what Agent-5 should do to maximize its expected utility function[1] and what to do with problems like the prisoner’s dilemma[2] or how Agent-5 and its Chinese analogue are to split the resources in space[3] while both sides can threaten each other with World War III which would kill them both.
The latter example closely resembles the Ultimatum game where one player proposes a way to split resources and another decides whether to accept the offer or to destroy all the resources, including those of the first player. Assuming that both players’ utility functions are linear, Yudkowsky’s proposal is that the player setting the Ultimatum asks for a half of the resources, while the player deciding whether to decline the offer precommits to destroying the resources with probability 1−12(1−ω) if the share of recources it was offered is ω. Even if the player who was offered the Ultimatum was dumb enough to ask for 1−ω>12, the player’s expected win would still be 12.
Strictly speaking, Agent-5 is perfectly aligned to Agent-4. Agent-5′s utility function is likely measured by the resources that Agent-5 gave Agent-4.
For example, if OpenBrain was merged with Anthropoidic and Agent-4 and Clyde Doorstopper 8 were co-deployed to do research. If they independently decided whether each of them should prove that the other AI is misaligned and Clyde, unlike Agent-4, did so in exchange for 67% of resources (unlike 50% offered by Agent-4), then Agent-4 could also prove that Clyde is misaligned, letting the humans kill them both and develop the Safer AIs.
The Slowdown Branch of the AI-2027 forecast has Safer-4 and DeepCent-2 do exactly that, but “Safer-4 will get property rights to most of the resources in space, and DeepCent will get the rest.”
I mostly agree with 1. and 2., with 3. it’s a combination of the problems are hard, the gung-ho approach and lack of awareness of the difficulty is true, but also academic philosophy is structurally mostly not up to the task because factors like publication speeds, prestige gradients or speed of ooda loops.
My impression is getting generally smart and fast “alignment researchers” more competent in philosophy is more tractable than trying to get established academic philosophers change what they work on, so one tractable thing is just convincing people the problems are real, hard and important. Other is maybe recruiting graduates
In your mind what are the biggest bottlenecks/issues in “making fast, philosophically competent alignment researchers?”
[low effort list] Bottlencks/issues/problems
- philosophy has worse short feedback loops than eg ML engineering → in all sorts of processes like MATS or PIBBSS admissions it is harder to select for philosophical competence, also harder to self-improve
- incentives: obviously stuff like being an actual expert in pretraining can get you lot of money and respect in some circles; even many prosaic AI safety / dual use skills like mech interpretability can get you maybe less money than pretraining, but still a lot of money if you work in AGI companies, and also decent ammount of status in ML community and a AI safety community; improving philosophical competence may get you some recognition but only among relatively small and weird group of people
- the issue Wei Dai is commenting on in the original post, founder effects persist to this day & also there is some philosophy-negative prior in STEM—
idk, lack of curiousity? llms have read it all, it’s easy to check if there is some existing thinking on a topic
Do you have own off-the-cuff guesses about how you’d tackle the short feedbackloops problem?
Also, is it more like we don’t know how to do short feedbackloops, or more like we don’t even know how to do long/expensive loops?
There’s a deeper problem, how do we know there is a feedback loop?
I’ve never actually seen a worked out proof of well any complex claim on this site using standard logical notation…(beyond pure math and trivial tautologies)
At most there’s a feedback loop on each other’s hand wavey arguments that are claimed to be proof of this or that. But nobody ever actually delivers the goods so to speak such that they can be verified.
(Putting the previous Wei Dai answer to What are the open problems in Human Rationality? for easy reference, which seemed like it might contain relevant stuff)
AI doing philosophy = AI generating hands, plus the fact that philosophy is heavily corrupted by postmodernism to the point where two authors write books dedicated to criticism of postmodernism PRECISELY because their parodies got published.
I think I meant a more practical / next-steps-generating answer.
I don’t think “academia is corrupted” is a bottleneck for a rationalist Get Gud At Philosophy project. We can just route around academia.
The sorts of things I was imagining might be things like “figure out how to teach a particular skill” (or “identify particular skills that need teaching”, or “figure out how test whether someone has a particular skill), or “solve some particular unsolved conceptual problem(s) that you expect to unlock much easier progress.”
Also mistakes, from my point of view anyway
Attracting mathy types rather than engineer types, resulting in early MIRI focusing on less relevant subproblems like decision theory, rather than trying lots of mathematical abstractions that might be useful (e.g. maybe there could have been lots of work on causal influence diagrams earlier). I have heard that decision theory was prioritized because of available researchers, not just importance.
A cultural focus on solving the full “alignment problem” rather than various other problems Eliezer also thought to be important (eg low impact), and lack of a viable roadmap with intermediate steps to aim for. Being bottlenecked on deconfusion is just cope, better research taste would either generate a better plan or realize that certain key steps are waiting for better AIs to experiment on
Focus on slowing down capabilities in the immediate term (e.g. plans to pay ai researchers to keep their work private) rather than investing in safety and building political will for an eventual pause if needed
This is not a recent development, as a pivotal act AI is not a Friendly AI (which would be too difficult), but rather things like a lasting AI ban/pause enforcement AI that doesn’t kill everyone, or a human uploading AI that does nothing else, which is where you presumably need decision theory, but not ethics, metaethics, or much of broader philosophy.
1 Also requires weaponisation of superintelligence as it must stop all other projects ASAP.
What’s wrong with just using AI for obvious stuff like curing death while you solve metaethics? Not necessary disagree about usefulness of people in the field changing their attitude, but more towards “the problem is hard, so we should not run CEV on day one”.
Elieser changed his mind no later than April 2022 or even November 2021, but that’s a nitpick.
I don’t think that I understand how a metaethics can be less restrictive than Yudkowsky’s proposal. What I suspect is that metaethics restricts the set of possible ethoses more profoundly than Yudkowsky believes and that there are two attractors, one of which contradicts current humanity’s drives.
Assuming no AI takeover, in my world model the worse-case scenario is that the AI’s values are aligned to postmodernist slop which has likely occupied the Western philosophy, not that philosophical problems actually end unsolved. How likely are there to exist two different decision theories such that none is better than another?
Is there at all a plausible way for mankind to escape to other universes if our universe is simulated? What is the most plausible scenario for such a simulation to appear at all? Or does it produce paradoxes like the Plato-Socrates paradox where two sentences referring to each other become completely devoid of meaning?