Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you’re no longer looking at a “broad, reasonable” distribution of space, but now a “very, specific” scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.
The problem here is that it shares a similar issue to optimization daemons/goal misgeneralization, etc, and a comment from Iceman sums it up perfectly:
“or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you’re maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.”
Or equivalently from lc: “you only start handing out status points after someone has successfully demonstrated the security failure, ideally in a deployed product or at the very least a toy program.”
This is to a large extent the issue I have with attempted breaks on alignment, in that pretty much no alignment break has been demonstrated, and the cases where they had, we have very mixed results to slight positive results at best.
I do worry though that it is mixing together pragmatics and epistemics (even though it does try to distinguish the two). Like there’s a distinction between when it’s reasonable to believe something and when it’s reasonable to act upon something.
For example, when I was working as a web developer, there’s lots of potential bugs where it would have made sense to believe that there was a decent chance we were vulnerable, but pragmatically we couldn’t spare the time to fix every potential security issue. It doesn’t mean that I should walk around saying: “Therefore they aren’t there” though.
I’ll admit, if someone randomly messaged you some of the AI risk arguments and no one else was worried about them, it’d probably be reasonable to conclude that there’s a flaw there and put them aside.
On the other hand, when even two deep learning Turing prize winners are starting to get concerned, and the stakes are so high, I think we should be a bit more cautious regarding dismissing the arguments out of hand.
On the other hand, when even two deep learning Turing prize winners are starting to get concerned, and the stakes are so high, I think we should be a bit more cautious regarding dismissing the arguments out of hand.
I agree, which is why I have an entire section or 2 about why I think ML/AI isn’t like computer security.
Regarding security mindset, I think that where it really kicks in is when you have a system utilising its intelligence to work around any limitations such that you’re no longer looking at a “broad, reasonable” distribution of space, but now a “very, specific” scenario that a powerful optimiser has pushed you towards. In that case, doing things like doubling the size may make your safety schemes if the AI now has the intelligence to get around it.
The problem here is that it shares a similar issue to optimization daemons/goal misgeneralization, etc, and a comment from Iceman sums it up perfectly:
“or trying to translate that into lesswrongesse, you do not have warrant to believe in something until you have an example of the thing you’re maybe worried about being a real problem because you are almost certain to be privileging the hypothesis.”
https://www.lesswrong.com/posts/99tD8L8Hk5wkKNY8Q/?commentId=xF5XXJBNgd6qtEM3q
Or equivalently from lc: “you only start handing out status points after someone has successfully demonstrated the security failure, ideally in a deployed product or at the very least a toy program.”
This is to a large extent the issue I have with attempted breaks on alignment, in that pretty much no alignment break has been demonstrated, and the cases where they had, we have very mixed results to slight positive results at best.
The POC || GTFO article was very interesting.
I do worry though that it is mixing together pragmatics and epistemics (even though it does try to distinguish the two). Like there’s a distinction between when it’s reasonable to believe something and when it’s reasonable to act upon something.
For example, when I was working as a web developer, there’s lots of potential bugs where it would have made sense to believe that there was a decent chance we were vulnerable, but pragmatically we couldn’t spare the time to fix every potential security issue. It doesn’t mean that I should walk around saying: “Therefore they aren’t there” though.
I’ll admit, if someone randomly messaged you some of the AI risk arguments and no one else was worried about them, it’d probably be reasonable to conclude that there’s a flaw there and put them aside.
On the other hand, when even two deep learning Turing prize winners are starting to get concerned, and the stakes are so high, I think we should be a bit more cautious regarding dismissing the arguments out of hand.
I agree, which is why I have an entire section or 2 about why I think ML/AI isn’t like computer security.