I agree with this. I think the EA example I mentioned fits this pattern fairly well—the more rational you are, the more likely you are to consider what careers and cause areas actually lead to the outcomes you care about, and go into one of those. But then you need the different skill of actually being good at it.
This seems to be roughly orthogonal to what I’m claiming? Whether you get the benefits from rationality quickly or slowly is distinct from what those benefits actually are.
Hmm, interesting. It doesn’t discuss the Galileo affair, which seems like the most important case where the distinction is relevant. Nevertheless, in light of this, “geocentric models with epicycles had always been in the former category” is too strong and I’ll amend it accordingly.
Mostly I am questioning whether things will turn out badly this way.
Do you not expect this threshold to be crossed sooner or later, assuming AI alignment remains unsolved?
Probably, but I’m pretty uncertain about this. It depends on a lot of messy details about reality, things like: how offense-defence balance scales; what proportion of powerful systems are mostly aligned; whether influence-seeking systems are risk-neutral; what self-governance structures they’ll set up; the extent to which their preferences are compatible with ours; how human-comprehensible the most important upcoming scientific advances are.
I think the idea is that once influence-seeking systems gain a certain amount of influence, it may become faster or more certain for them to gain more influence by causing a catastrophe than to continue to work within existing rules and institutions.
The key issue here is whether there will be coordination between a set of influence-seeking systems that can cause (and will benefit from) a catastrophe, even when other systems are opposing them. If we picture systems as having power comparable to what companies have now, that seems difficult. If we picture them as having power comparable to what countries have now, that seems fairly easy.
Eventually we reach the point where we could not recover from a correlated automation failure. Under these conditions influence-seeking systems stop behaving in the intended way, since their incentives have changed—they are now more interested in controlling influence after the resulting catastrophe then continuing to play nice with existing institutions and incentives.
I’m not sure I understand this part. The influence-seeking systems which have the most influence also have the most to lose from a catastrophe. So they’ll be incentivised to police each other and make catastrophe-avoidance mechanisms more robust.
As an analogy: we may already be past the point where we could recover from a correlated “world leader failure”: every world leader simultaneously launching a coup. But this doesn’t make such a failure very likely, unless world leaders also have strong coordination and commitment mechanisms between themselves (which are binding even after the catastrophe).
The 75% figure is from now until single agent AGI. I measure it proportionately because otherwise it says more about timeline estimates than about CAIS.
The operationalisation which feels most natural to me is something like:
Make a list of cognitively difficult jobs (lawyer, doctor, speechwriter, CEO, engineer, scientist, accountant, trader, consultant, venture capitalist, etc...)
A job is automatable when there exists a publicly accessible AI service which allows an equally skilled person to do just as well in less than 25% of the time that it used to take a specialist, OR which allows someone with little skill or training to do the job in about the same time that it used to take a specialist.
I claim that over 75% of the jobs on this list will be automatable within 75% of the time until a single superhuman AGI is developed.
(Note that there are three free parameters in this definition, which I’ve set to arbitrary numbers that seem intuitively reasonable).
Thanks! I agree that more connection to past writings is always good, and I’m happy to update it appropriately—although, upon thinking about it, there’s nothing which really comes to mind as an obvious omission (except perhaps citing sections of Superintelligence?) Of course I’m pretty biased, since I already put in the things which I thought were most important—so I’d be glad to hear any additional suggestions you have.
Kudos for writing about making mistakes and changing your mind. If I’m interpreting you correctly, your current perspective is quite similar to mine (which I’ve tried to explain here and here).
Agreed that “clarification” is confusing. What about “exploration”?
Thanks for the detailed comments! I only have time to engage with a few of them:
Most of this is underdefined, and that’s unsettling at least in some (but not necessarily all) cases, and if we want to make it less underdefined, the notion of ‘one ethics’ has to give.
I’m not that wedded to ‘one ethics’, more like ‘one process for producing moral judgements’. But note that if we allow arbitrariness of scope, then ‘one process’ can be a piecewise function which uses one subprocess in some cases and another in others.
I find myself having similarly strong meta-level intuitions about wanting to do something that is “non-arbitrary” and in relevant ways “simple/elegant”. …motivationally it feels like this intuition is importantly connected to what makes it easy for me to go “all-in“ for my ethical/altruistic beliefs.
I agree that these intuitions are very strong, and they are closely connected to motivational systems. But so are some object-level intuitions like “suffering is bad”, and so the relevant question is what you’d do if it were a choice between that and simplicity. I’m not sure your arguments distinguish one from the other in that context.
one can maybe avoid to feel this uncomfortable feeling of uncertainty by deferring to idealized reflection. But it’s not obvious that this lastingly solves the underlying problem
Another way of phrasing this point: reflection is almost always good for figuring out what’s the best thing to do, but it’s not a good way to define what’s the best thing to do.
For the record, this is probably my key objection to preference utilitarianism, but I didn’t want to dive into the details in the post above (for a very long post about such things, see here).
From Rohin’s post, a quote which I also endorse:
You could argue that while [building AIs with really weird utility functions] is possible in principle, no one would ever build such an agent. I wholeheartedly agree, but note that this is now an argument based on particular empirical facts about humans (or perhaps agent-building processes more generally).
And if you’re going to argue based on particular empirical facts about what goals we expect, then I don’t think that doing so via coherence arguments helps very much.
This seems pretty false to me.
I agree that this problem is not a particularly important one, and explicitly discard it a few sentences later. I hadn’t considered your objection though, and will need to think more about it.
(Side note: I’m pretty annoyed with all the use of “there’s no coherence theorem for X” in this post.)
Mind explaining why? Is this more a stylistic preference, or do you think most of them are wrong/irrelevant?
the “further out” your goal is and the more that your actions are for instrumental value, the more it should look like world 1 in which agents are valuing abstract properties of world states, and the less we should observe preferences over trajectories to reach said states.
Also true if you make world states temporally extended.
If I had to define it using your taxonomy, then yes. However, it’s also trying to do something broader. For example, it’s intended to be persuasive to people who don’t think of meta-ethics in terms of preferences and rationality at all. (The original intended audience was the EA forum, not LW).
Edit: on further reflection, your list is more comprehensive than I thought it was, and maybe the people I mentioned above actually would be on it even if they wouldn’t describe themselves that way.
Another edit: maybe the people who are missing from your list are those who would agree that morality has normative force but deny that rationality does (except insofar as it makes you more moral), or at least are much more concerned with the former than the latter. E.g. you could say that morality is a categorical imperative but rationality is only a hypothetical imperative.