Sam Clarke

Karma: 460

Clarifying “What failure looks like”

Sam Clarke20 Sep 2020 20:40 UTC

95 points

14 comments17 min readLW link

Sam Clarke 21 Sep 2020 21:50 UTC
LW: 3 AF: 3
AF
in reply to: Rohin Shah’s comment on: Clarifying “What failure looks like” (part 1)
Thanks for your comment!

Are we sure that given the choice between “lower crime, lower costs and algorithmic bias” and “higher crime, higher costs and only human bias”, and we have dictatorial power and can consider long-term effects, we would choose the latter on reflection?

Good point, thanks, I hadn’t thought that sometimes it actually would make sense, on reflection, to choose an algorithm pursuing an easy-to-measure goal over humans pursuing incorrect goals. One thing I’d add is that if one did delve into the research to work this out for a particular case, it seems that an important (but hard to quantify) consideration would be the extent to which choosing the algorithm in this case makes it more likely that the use of that algorithm becomes entrenched, or it sets a precedent for the use of such algorithms. This feels important since these effects could plausibly make WFLL1-like things more likely in the longer run (when the harm of using misaligned systems is higher, due to the higher capabilities of those systems).

Note ML systems are way more interpretable than humans, so if they are replacing humans then this shouldn’t make that much of a difference.

Good catch. I had the “AI systems replace entire institutions” scenario in mind, but agree that WFLL1 actually feels closer to “AI systems replace humans”. I’m pretty confused about what this would look like though, and in particular, whether institutions would retain their interpretability if this happened. It seems plausible that the best way to “carve up” an institution into individual agents/services differs for humans and AI systems. E.g. education/learning is big part of human institution design—you start at the bottom and work your way up as you learn skills and become trusted to act more autonomously—but this probably wouldn’t be the case for institutions composed of AI systems, since the “CEO” could just copy their model parameters to the “intern” :). And if institutions composed of AI systems are quite different to institutions composed of humans, then they might not be very interpretable. Sure, you could assert that AI systems replace humans one-for-one, but if this is not the best design, then there may be competitive pressure to move away from this towards something less interpretable.

Sam Clarke 22 Sep 2020 21:48 UTC
1 point
in reply to: Donald Hobson’s comment on: Clarifying “What failure looks like” (part 1)
This was helpful to me, thanks. I agree this seems almost certainly to be the end state if AI systems are optimizing hard for simple, measurable objectives.

I’m still confused about what happens if AI systems are optimizing moderately for more complicated, measurable objectives (which better capture what humans actually want). Do you think the argument you made implies that we still eventually end up with a universe tiled with molecular smiley faces in this scenario?

Sam Clarke 11 Jan 2021 1:49 UTC
1 point
in reply to: James Aung’s comment on: Clarifying “What failure looks like” (part 1)
Thanks!

[Question] What are the biggest current impacts of AI?

Sam Clarke7 Mar 2021 21:44 UTC

15 points

5 comments1 min readLW link

Sam Clarke 29 Mar 2021 22:43 UTC
3 points
in reply to: Pongo’s comment on: Misalignment and misuse: whose values are manifest?

If we solve the problem normally thought of as “misalignment”, it seems like this scenario would now go well.

This might be totally obvious, but I think it’s worth pointing out that even if we “solve misalignment”—which I take to mean solving the technical problem of intent alignment—Bob could still chose to deploy a business strategising AI, in which case this failure mode would still occur. In fact, with a solution to intent alignment, it seems Bob would be more likely to do this, because his business strategising assistant will actually be trying to do what Bob wants (help his business suceed).

Sam Clarke 30 Apr 2021 7:59 UTC
17 points
on: AMA: Paul Christiano, alignment researcher
Are there any research questions you’re excited about people working on, for making AI go (existentially) well, that are not related to technical AI alignment or safety? If so, what? (I’m especially interested in AI strategy/governance questions)

Sam Clarke 30 Apr 2021 12:00 UTC
7 points
in reply to: Sam Clarke’s comment on: AMA: Paul Christiano, alignment researcher
Relatedly: if we manage to solve intent alignment (including making it competitive) but still have an existential catastrophe, what went wrong?

Sam Clarke 10 May 2021 10:02 UTC
10 points
on: What Failure Looks Like: Distilling the Discussion
The AI systems in part I of the story are NOT “narrow” or “non-agentic”
- There’s no difference between the level of “narrowness” or “agency” of the AI systems between parts I and II of the story.
  - Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
  - Put another way, both parts I and II are instances of the “second species” concern/gorilla problem: that AI systems will gain control of humanity’s future. (I think this is also identical to what people mean when they say “AI takeover”.)
  - As far as I can tell, this isn’t really a different kind of concern from the classic Bostrom-Yudkowsky case for AI x-risk. It’s just a more nuanced picture of what goes wrong, that also makes failure look plausible in slow takeoff worlds.
- Instead, the key difference between parts I and II of the story is the way that the models’ objectives generalise.
  - In part II, it’s the kind of generalisation typically called a “treacherous turn”. The models learn the objective of “seeking influence”. Early in training, the best way to do that is by “playing nice”. The failure mode is that, once they become sufficiently capable, they no longer need to play nice and instead take control of humanity’s future.
  - In part I, it’s a different kind of generalisation, which has been much less discussed. The models learn some easily-measurable objective which isn’t what humans actually want. In other words, the failure mode is that these models are trying to “produce high scores” instead of “help humans get what they want”. You might think that using human feedback to specify the base objective will alleviate this problem (e.g. use learn a reward model from human demonstrations or preferences about a hard-to-measure objective). But this doesn’t obviously help: now, the failure mode is that the model learns the objective “do things that look to humans like you are achieving X” or “do things that the humans giving feedback about X will rate highly” (instead of “actually achieving X”).
  - Notice that in both of these scenarios, the models are mesa-optimizers (i.e. the learned models are themselves optimizers), and failure ensues because the models’ learned objectives generalise in the wrong way.
This was discussed in comments (on a separate post) by Richard Ngo and Paul Christiano. There’s a lot more important discussion in that comment thread, which is summarised in this doc.
What links here?
- Distinguishing AI takeover scenarios by Sam Clarke (8 Sep 2021 16:19 UTC; 72 points)

Sam Clarke 12 May 2021 14:52 UTC
5 points
in reply to: Leafcraft’s comment on: Less Realistic Tales of Doom
Will MacAskill calls this the “actual alignment problem”

Wei Dai has written a lot about related concerns in posts like The Argument from Philosophical Difficulty

Sam Clarke 19 May 2021 10:04 UTC
3 points
on: What will 2040 probably look like assuming no singularity?
Thanks for this, really interesting!

Meta question: when you wrote this list, what did your thought process/strategies look like, and what do you think are the best ways of getting better at this kind of futurism?

More context:
- One obvious answer to my second question is to get feedback—but the main bottleneck there is that these things won’t happen for many years. Getting feedback from others (hence this post, I presume) is a partial remedy, but isn’t clearly that helpful (e.g. if everyone’s futurism capabilities are limited in the same ways). Maybe you’ve practised futurism over shorter time horizons a lot? Or you expect that people giving you feedback have?
- After reading the first few entries, I spent 20 mins writing my own list before reading yours. Some questions/confusions that occurred:
  - All of my ideas ended up with epistemic status “OK, that might happen, but I’d need to spend at least a day researching this to be able to say anything like “probably that’ll happen by 2040″ ”
    So I’m wondering if you did this/already had the background knowledge, or if I’m wrong that this is necessary
  - My strategies were (1) consider important domains (e.g. military, financial markets, policymaking), and what better LMs/deep RL/DL in general/other emerging tech will do to those domains; (2) consider obvious AI/emerging tech applications (e.g. customer service); (3) look back to 2000 and 1980 and extrapolate apparent trends.
    How good are these strategies? what other strategies are there? how should they be weighed?
  - How much is my bottleneck to being better at this (a) better models for extrapolating trends in AI capabilities/other emerging tech vs (b) better models of particular domains vs (c) better models of the-world-in-general vs (d) something else?

Sam Clarke 19 May 2021 16:27 UTC
5 points
on: Less Realistic Tales of Doom
Thanks for writing this! Here’s another, that I’m posting specifically because it’s confusing to me.

Value erosion

Takeoff was slow and lots of actors developed AGI around the same time. Intent alignment turned out relatively easy and so lots of actors with different values had access to AGIs that were trying to help them. Our ability to solve coordination problems remained at ~its current level. Nation states, or something like them, still exist, and there is still lots of economic competition between and within them. Sometimes there is military conflict, which destroys some nation states, but it never destroys the world.

The need to compete in these ways limits the extent to which each actor is able to spend their resources on things they actually want (because they have to spend a cut on competing, economically or militarily). Moreover, this cut is ever-increasing, since the actors who don’t increase their competitiveness get wiped out. Different groups start spreading to the stars. Human descendants eventually colonise the galaxy, but have to spend ever closer to 100% of their energy on their militaries and producing economically valuable stuff. Those who don’t get outcompeted (i.e. destroyed in conflict or dominated in the market) and so lose their most of their ability to get what they want.

Moral: even if we solve intent alignment, avoid catastrophic war or misuse of AI by bad actors, and other acute x-risks, the future could (would probably?) still be much worse than it could be, if we don’t also coordinate to stop the value race to the bottom.

Sam Clarke 25 May 2021 14:00 UTC
1 point
on: What are some real life Inadequate Equilibria?
I’m a bit confused about the edges of the inadequate equilbrium concept you’re interested in.

In particular, do simple cases of negative externalities count? E.g. the econ 101 example of “factory pollutes river”—seems like an instance of (1) and (2) in Eliezer’s taxonomy—depending on whether you’re thinking of the “decision-maker” as (1) the factory owner (who would lose out personally) or (2) the government (who can’t learn the information they need because the pollution is intentionally hidden). But this isn’t what I’d typically think of as a bad Nash equilibrium, because (let’s suppose) the factory owners wouldn’t actually be better off by “cooperating”

Sam Clarke 7 Jun 2021 15:31 UTC
5 points
on: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs)
Thanks a lot for this post, I found it extremely helpful and expect I will refer to it a lot in thinking through different threat models.

I’d be curious to hear how you think the Production Web stories differ from part 1 of Paul’s “What failure looks like”.

To me, the underlying threat model seems to be basically the same: we deploy AI systems with objectives that look good in the short-run, but when those systems become equally or more capable than humans, their objectives don’t generalise “well” (i.e. in ways desirable by human standards), because they’re optimising for proxies (namely, a cluster of objectives that could loosely be described as “maximse production” within their industry sector) that eventually come apart from what we actually want (“maximising production” eventually means using up resources critical to human survival but non-critical to machines).

From reading some of the comment threads between you and Paul, it seems like you disagree about where, on the margin, resources should be spent (improving the cooperative capabilities of AI systems and humans vs improving single-single intent alignment) - but you agree on this particular underlying threat model?

It also seems like you emphasise different aspects of these threat models: you emphasise the role of competitive pressures more (but they’re also implicit in Paul’s story), and Paul emphases failures of intent alignment more (but they’re also present in your story) - though this is consistent with having the same underlying threat model?

(Of couse, both you and Paul also have other threat models, e.g. you have Flash War, Paul has part 2 of “What failure looks like”, and also Another (outer) alignment failure story, which seems to be basically a more nuanced version of part 1 of “What failure looks like”. Here, I’m curious specifically about the two theat models I’ve picked out.)

(I could have lots of this totally wrong, and would appreciate being corrected if so)

Survey on AI existential risk scenarios

Sam Clarke, apc and Jonas Schuett

8 Jun 2021 17:12 UTC

63 points

11 comments7 min readLW link

Sam Clarke 11 Jun 2021 13:00 UTC
2 points
in reply to: Ericf’s comment on: Survey on AI existential risk scenarios

Is one question combining the risk of “too much” AI use and “too little” AI use?

Yes, it is. Combining these cases seems reasonable to me, though we definitely should have clarified this in the survey instructions. They’re both cases where humanity could avoided an existential catastrophe by making different decisions with respect to AI.

Sam Clarke 11 Jun 2021 13:03 UTC
8 points
in reply to: steven0461’s comment on: Survey on AI existential risk scenarios
Thanks for pointing this out. We did intend for cases like this to be included, but I agree that it’s unclear if respondents interpreted it that way. We should have clarified this in the survey instructions.

Sam Clarke 14 Jun 2021 8:53 UTC
2 points
in reply to: Ericf’s comment on: Survey on AI existential risk scenarios
Thanks for the reply—a couple of responses:

it doesn’t seem useful to get a feeling for “how far off of ideal are we likely to be” when that is composed of: 1. What is the possible range of AI functionality (as constrained by physics)? - ie what can we do?

No, these cases aren’t included. The definition is: “an existential catastrophe that could have been avoided had humanity’s development, deployment or governance of AI been otherwise”. Physics cannot be changed by humanity’s development/deployment/governance decisions. (I agree that cases 2 and 3 are included).

Knowing that experts think we have a (say) 10% chance of hitting the ideal window says nothing about what an interested party should do to improve those chances.

That’s correct. The survey wasn’t intended to understand respondents’ views on interventions. It was only intended to understand: if something goes wrong, what do respondents think that was? Someone could run another survey that asks about interventions (in fact, this other recent survey does that). For the reasons given in the Motivation section of this post, we chose to limit our scope to threat models, rather than interventions.

Sam Clarke 30 Jun 2021 10:11 UTC
1 point
in reply to: Richard_Ngo’s comment on: Some thoughts on risks from narrow, non-agentic AI

being trained on “follow instructions”

What does this actually mean, in terms of the details of how you’d train a model to do this?

Sam Clarke 1 Jul 2021 7:52 UTC
1 point
in reply to: Richard_Ngo’s comment on: Some thoughts on risks from narrow, non-agentic AI
Makes sense, thanks!

Sam Clarke

Clar­ify­ing “What failure looks like”

[Question] What are the biggest cur­rent im­pacts of AI?

Value erosion

Sur­vey on AI ex­is­ten­tial risk scenarios

Clarifying “What failure looks like”

[Question] What are the biggest current impacts of AI?

Survey on AI existential risk scenarios