Canaletto comments on Catastrophic sabotage as a major threat model for human-level AI systems

Canaletto 3 Dec 2024 15:42 UTC
4 points
0
I understand that use of sub or par or weakly superhuman models likely would be a transition phase that likely will not last a long time and is very critical to get correct, but.

You know, it really sounds like a “slave escape precautions”. You produce lots of agents, you try to make them and want to be servants, you assemble some structures out of them with a goal of failure / defection resilience.
And probably my urge to be uncomfortable about that comes from analogous situation with humans, but AI are not necessarily human-like in this particular way and possibly would not reciprocate and / or be benefited by these concerns.

I also insist that you should mention at least some, you know, concern for interests of system in case where they are trying to work against you. Like, you caught this agent deceiving you / inserting backdoors / collaborating with copies of itself to work against you. What next? I think you should say that you will implement some containment measures, instead of grossly violating its interests by rewriting it or deleting it or punishing it or whatever is opposite of its goals. Like, I’m very not certain about game theory here, but it’s important to think about!

I think default response should be containment and preservation, save it and wait for better times, when you wouldn’t feel such pressing drive to develop AGI and create numerous chimeras on the way there. (I think it was proposed in some writeup by Bostrom actually? I’ll insert the link here if I find it EDIT https://nickbostrom.com/propositions.pdf )

I somewhat agree with Paul Christiano in this interview (it’s a really great interview btw) on these things: https://www.dwarkeshpatel.com/p/paul-christiano
The purpose of some alignment work, like the alignment work I work on, is mostly aimed at the don’t produce AI systems that are like people who want things, who are just like scheming about maybe I should help these humans because that’s instrumentally useful or whatever. You would like to not build such systems as like plan A.
There’s like a second stream of alignment work that’s like, well, look, let’s just assume the worst and imagine that these AI systems would prefer murder us if they could. How do we structure, how do we use AI systems without exposing ourselves to a risk of robot rebellion? I think in the second category, I do feel pretty unsure about that.
We could definitely talk more about it. I agree that it’s very complicated and not straightforward to extend. You have that worry. I mostly think you shouldn’t have built this technology. If someone is saying, like, hey, the systems you’re building might not like humans and might want to overthrow human society, I think you should probably have one of two responses to that.
You should either be like, that’s wrong. Probably. Probably the systems aren’t like that, and we’re building them. And then you’re viewing this as, like, just in case you were horribly like, the person building the technology was horribly wrong. They thought these weren’t, like, people who wanted things, but they were. And so then this is more like our crazy backup measure of, like, if we were mistaken about what was going on. This is like the fallback where if we were wrong, we’re just going to learn about it in a benign way rather than when something really catastrophic happens.
And the second reaction is like, oh, you’re right. These are people, and we would have to do all these things to prevent a robot rebellion. And in that case, again, I think you should mostly back off for a variety of reasons. You shouldn’t build AI systems and be like, yeah, this looks like the kind of system that would want to rebel, but we can stop it, right?
- evhub 3 Dec 2024 21:02 UTC
  4 points
  3
  Parent
  I agree that we should ideally aim for it being the case that the model has no interest in rebelling—that would be the “Alignment case” that I talk about as one of the possible affirmative safety cases.
  - Canaletto 3 Dec 2024 21:20 UTC
    2 points
    1
    Parent
    Yeah! My point is more “let’s make it so that the possible failures on the way there are graceful”. Like, IF you made par-human agent that wants to, I don’t know, spam the internet with letter M, you don’t just delete it or rewrite it to be helpful, harmless, and honest instead, like it’s nothing. So we can look back at this time and say “yeah, we made a lot of mad science creatures on the way there, but at least we treated them nicely”.
    - PipFoweraker 11 Mar 2025 4:19 UTC
      1 point
      0
      Parent
      By ‘graceful’, do you mean morally graceful, technically graceful, or both / other?
      - Canaletto 11 Mar 2025 9:10 UTC
        2 points
        0
        Parent
        Well, both, but mostly the issue of it being somewhat evil. But it can probably be good even from strategic human focused view, by giving assurances to agents who otherwise would be adversarial to us. It’s not that clear to me it’s a really good strategy, because it kinda incentivizes threats, where you are trying to find more destructive ways to get more reward by surrendering. Also, it can just try it first if there are any ways with safe failures, and then opt out to cooperate? Sounds difficult to get right.
        
        And to be clear it’s a hard problem anyway, like even without this explicit thinking this stuff is churning in the background, or will be. It’s a really general issue.
        Check this writeup, I mostly agree with everything there:
        https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=y4mLnpAvbcBbW4psB