I posted something similar over on Zvi’s Substack, so I agree strongly here.
One point I think is interesting to explore—this release actually updates me slightly towards lowered risk of AI catastrophe. I think there is growing media attention towards a skeptical view of AI, the media is already seeing harms and we are seeing crowdsourced attempts to break, and more thinking about threat models. But the actual “worst harm” is still very low.
I think the main risk is a very discontinuous jump in capabilities. If we increase by relatively small deltas, then the “worst harm” will at some point be very bad press, but not ruinous to civilization. I’m thinking stock market flash-crash, “AI gets connected to the internet and gets used to hack people” or some other manipulation of a subsystem of society. Then we’d perhaps see public support to regulate the tech and/or invest much more heavily in safety. (Though the wrong regulation could do serious harm if not globally implemented.)
I think based on this, frequency of model publishing is important. I want the minimum capability delta between models. So shaming researchers into not publishing imperfect but relatively-harmless research (Galactica) seems like an extremely bad trend.
Another thought—an interesting safety benchmark would be “can this model code itself?”. If the model can make improvements on its own code then we clearly have lift-off. Can we get a signal on how far away that is? Something lol “what skill level is required to wield the model in this task”? Currently you need to be a capable coder to stitch together model outputs into working software, but it’s getting quite good at discussing small chunks of code if you can keep it on track.
I think we will probably pass through a point where an alignment failure could be catastrophic but not existentially catastrophic.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic (both deceptive alignment and irreversible reward hacking are noticeably harder to fix once an AI coup can succeed). I expect it will be possible to create toy models of alignment failures, and that you’ll get at least some kind of warning shot, but that you may not actually see any giant warning shots.
I think AI used for hacking or even to make a self-replicating worm is likely to happen before the end of days, but I don’t know how people would react to that. I expect it will be characterized as misuse, that the proposed solution will be “don’t use AI for bad stuff, stop your customers from doing so, provide inference as a service and monitor for this kind of abuse,” and that we’ll read a lot of headlines about how the real problem wasn’t the terminator but just humans doing bad things.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public’s level of fear of catastrophe. Now it seems to me that there’s a more plausible path to widespread appreciation of (and therefore work on) AI risk. To be clear though, I’m just updating that it’s less likely we’ll fail because we didn’t seriously try to find a solution, not that I have new evidence of a tractable solution.
I don’t know how people would react to that.
I think there are some quite plausibly terrifying non-existential incidents at the severe end of the spectrum. Without spending time brainstorming infohazards, Stuart Russel’s slaughterbots come to mind. I think it’s an interesting (and probably important) question as to how bad an incident would have to be to produce a meaningful response.
I expect it will be characterized as misuse, that the proposed solution will be “don’t use AI for bad stuff,
Here’s where I disagree (at least, the apparent confidence). Looking at the pushback that Galactica got, the opposite conclusion seems more plausible to me, that before too long we get actual restrictions that bite when using AI for good stuff, let alone for bad stuff. For example, consider the tone of this MIT Technology Review article:
This is for a demo of a LLM that has not harmed anyone, merely made some mildly offensive utterances. Imagine what the NYT will write when an AI from Big Tech is shown to have actually harmed someone (let alone kill someone). It will be a political bloodbath.
Anyway, I think the interesting part for this community is that it points to some socio-political approaches that could be emphasized to increase funding and researcher pool (and therefore research velocity), rather than the typical purely-technical explorations of AI safety that are posted here.
“Someone automated finding SQL injection exploits with google and a simple script” and “Someone found a zero-day by using chatGPT” doesn’t seem qualitatively different to the average human being. I think they just file it under “someone used coding to hack computers” and move on with their day. Headlines are going to be based on the impact of a hack, not how spooky the tech used to do it is.
I posted something similar over on Zvi’s Substack, so I agree strongly here.
One point I think is interesting to explore—this release actually updates me slightly towards lowered risk of AI catastrophe. I think there is growing media attention towards a skeptical view of AI, the media is already seeing harms and we are seeing crowdsourced attempts to break, and more thinking about threat models. But the actual “worst harm” is still very low.
I think the main risk is a very discontinuous jump in capabilities. If we increase by relatively small deltas, then the “worst harm” will at some point be very bad press, but not ruinous to civilization. I’m thinking stock market flash-crash, “AI gets connected to the internet and gets used to hack people” or some other manipulation of a subsystem of society. Then we’d perhaps see public support to regulate the tech and/or invest much more heavily in safety. (Though the wrong regulation could do serious harm if not globally implemented.)
I think based on this, frequency of model publishing is important. I want the minimum capability delta between models. So shaming researchers into not publishing imperfect but relatively-harmless research (Galactica) seems like an extremely bad trend.
Another thought—an interesting safety benchmark would be “can this model code itself?”. If the model can make improvements on its own code then we clearly have lift-off. Can we get a signal on how far away that is? Something lol “what skill level is required to wield the model in this task”? Currently you need to be a capable coder to stitch together model outputs into working software, but it’s getting quite good at discussing small chunks of code if you can keep it on track.
I think we will probably pass through a point where an alignment failure could be catastrophic but not existentially catastrophic.
Unfortunately I think some alignment solutions would only break down once it could be existentially catastrophic (both deceptive alignment and irreversible reward hacking are noticeably harder to fix once an AI coup can succeed). I expect it will be possible to create toy models of alignment failures, and that you’ll get at least some kind of warning shot, but that you may not actually see any giant warning shots.
I think AI used for hacking or even to make a self-replicating worm is likely to happen before the end of days, but I don’t know how people would react to that. I expect it will be characterized as misuse, that the proposed solution will be “don’t use AI for bad stuff, stop your customers from doing so, provide inference as a service and monitor for this kind of abuse,” and that we’ll read a lot of headlines about how the real problem wasn’t the terminator but just humans doing bad things.
Agreed. My update is coming purely from increasing my estimation for how much press and therefore funding AI risk is going to get long before to that point. 12 months ago it seemed to me that capabilities had increased dramatically, and yet there was no proportional increase in the general public’s level of fear of catastrophe. Now it seems to me that there’s a more plausible path to widespread appreciation of (and therefore work on) AI risk. To be clear though, I’m just updating that it’s less likely we’ll fail because we didn’t seriously try to find a solution, not that I have new evidence of a tractable solution.
I think there are some quite plausibly terrifying non-existential incidents at the severe end of the spectrum. Without spending time brainstorming infohazards, Stuart Russel’s slaughterbots come to mind. I think it’s an interesting (and probably important) question as to how bad an incident would have to be to produce a meaningful response.
Here’s where I disagree (at least, the apparent confidence). Looking at the pushback that Galactica got, the opposite conclusion seems more plausible to me, that before too long we get actual restrictions that bite when using AI for good stuff, let alone for bad stuff. For example, consider the tone of this MIT Technology Review article:
This is for a demo of a LLM that has not harmed anyone, merely made some mildly offensive utterances. Imagine what the NYT will write when an AI from Big Tech is shown to have actually harmed someone (let alone kill someone). It will be a political bloodbath.
Anyway, I think the interesting part for this community is that it points to some socio-political approaches that could be emphasized to increase funding and researcher pool (and therefore research velocity), rather than the typical purely-technical explorations of AI safety that are posted here.
“Someone automated finding SQL injection exploits with google and a simple script” and “Someone found a zero-day by using chatGPT” doesn’t seem qualitatively different to the average human being. I think they just file it under “someone used coding to hack computers” and move on with their day. Headlines are going to be based on the impact of a hack, not how spooky the tech used to do it is.