Sam Clarke comments on Collection of arguments to expect (outer and inner) alignment failure?

Sam Clarke 4 Oct 2021 16:27 UTC
LW: 1 AF: 1
0
AF

A more valid criticism would be that the authors spend most of their time on showing that all of these failure mechanisms are theoretically possible, without spending much time discussing how likely each of them is are in practice

Sure, I agree this is a stronger point.

The collection of posts under the threat models tag may be what you are looking for: many of these posts highlight the particular risk scenarios the authors feel are most compelling or likely.

Not really, unfortunately. In those posts, the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place—which is what I’m interested in (with the exception of Steven’s scenario, who already answered here).

The main problem with distilling this work into, say, a top 3 of most powerful 1-page arguments is that we are not dealing purely with technology-driven failure modes.

I fully agree that thinking through e.g. incentives that different actors will have in the lead up to TAI, the interaction between AI technology and society, etc. is super important. But we can think through those things as well—e.g. we can look at historical examples of humanity being faced with scenarios where the global economy is (mis)aligned with human needs, and reason about the extent to which AI will be different. I’d count all of that as part of the argument to expect alignment failure. Yes, as soon as you bring societal interactions into the mix, things become a whole lot more complicated. But that isn’t reason not to try.

As it stands, I don’t think there are super clear arguments for alignment failure that take into account interactions between AI tech and society that are ready to be distilled down, though I tried doing some of it here.

Equally, much of the discussion (and predictions of many leading thinkers in this space) is premised on technical alignment failure being the central concern (i.e. if we had better technical alignment solutions, we would manage to avoid existential catastrophe). I don’t want to argue about whether that’s correct here, but just want to point out that at least some people think that at least some of the plausible failure modes are mostly technology-driven.

So will you be distilling for an audience of pessimists or optimists?

Neither—just trying to think clearly through the arguments on both sides.

In the particular case you describe, I find the “pessimist” side more compelling, because I don’t see much evidence that humanity has really learned any lessons from oil and climate change. In particular, we still don’t know how to solve collective action problems.

This has kept me from spending much energy myself on rationally quantifying the odds of different failure mode scenarios. I’d rather spend my energy in finding ways to improve the odds.

Yeah, I’m sympathetic to this line of thought, and I think I personally tend to err on the side of trying to spend too much energy on quantifying odds and not enough on acting.

However, to the extent that you’re impartial between different ways of trying to improve the odds (e.g. working on technical AI alignment vs other technical AI safety vs AI policy vs meta interventions vs other cause areas entirely), then it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
- Koen.Holtman 9 Oct 2021 11:21 UTC
  LW: 2 AF: 2
  0
  AF Parent
  
  Not really, unfortunately. In those posts [under the threat models tag], the authors are focusing on painting a plausible picture of what the world looks like if we screw up alignment, rather than analysing the arguments that we should expect alignment failures in the first place.
  
  I feel that Christiano’s post here is pretty good at identifying plausible failure modes inside society that lead to unaligned agents not being corrected. My recollection of that post is partly why I mentioned the posts under that tag.
  
  There is an interesting question of methodology here: if you want to estimate the probability that society will fail in this this way in handing the impact of AI, do you send a poll to a bunch of AI technology experts, or should you be polling a bunch of global warming activists or historians of the tobacco industry instead? But I think I am reading in your work that this question is no news to you.
  
  Several of the AI alignment organisations you polled have people in them who produced work like this examination of the nuclear arms race. I wonder what happens in your analysis of your polling data if you single out this type of respondent specifically. In my own experience in analysing polling results with this type of response rate, I would be surprised however if you could find a clear signal above the noise floor.
  
  However [...] it still pays to work out (e.g.) how plausible AI alignment failure is, in order to inform your decision about what to do if you want to have the best chance of helping.
  
  Agree, that is why I am occasionally reading various posts with failure scenarios and polls of experts. To be clear: my personal choice of alignment research subjects is only partially motivated by what I think is the most important to work to do, if I want to have the best chance of helping. Another driver is that I want to have some fun with mathematics. I tend to work on problems which lie in the intersection of those two fuzzy sets.