What I find incredible is how contributing to the development of existentially dangerous systems is viewed as a morally acceptable course of action within communities that on paper accept that AGI is a threat.
Both OpenAI and Anthropic are incredibly influential among AI safety researchers, despite both organisations being key players in bringing the advent of TAI ever closer.
Both organisations benefit from lexical confusion over the word “safety”.
The average person concerned with existential risk from AGI might assume “safety” means working to reduce the likelihood that we all die. They would be disheartened to learn that many “AI Safety” researchers are instead focused on making sure contemporary LLMs behave appropriately. Such “safety” research simply makes the contemporary technology more viable and profitable, driving investment and reducing timelines. There is to my knowledge no published research that proves these techniques will extend to controlling AGI in a useful way.*
OpenAI’s “Superalignment” plan is a more ambitious safety play.Their plan to “solve” alignment involves building a human level general intelligence within 4 years and then using this to automate alignment research.
But there are two obvious problems:
a human level general intelligence is already most of the way toward a superhuman general intelligence (simply give it more compute). Cynically, Superintelligence is a promise that OpenAI’s brightest safety researchers will be trying their hardest to bring about an AGI within 4 years.
The success of Superalignment means we are now in the position of trusting that a for-profit, private entity will only use the human level AI researchers to research safety, instead of making the incredibly obvious play of having virtual researchers research how to build the next generation better, smarter automated researchers.
To conclude, if it looks like a duck, swims like a duck and quacks like a duck, it’s a capabilities researcher.
*This point could (and probably should) be a post in itself. Why wouldn’t techniques that work on contemporary AI systems extend to AGI?
Pretend for a moment that you and I are silicon-based aliens who have recently discovered that carbon based lifeforms exist, and can be used to run calculations. Scientists have postulated that by creating complex enough carbon structures we could invent “thinking animals”. We anticipate that these strange creatures will be built in the near future and that they might be difficult to control.
As we can’t build thinking animals today, we are stuck studying single cell carbon organisms. A technique has just been discovered in which we can use a compound called “sugar” to influence the direction in which these simple organisms move.
Is it reasonable to then conclude that you will be able to predict and control the behaviour of much more complex, multicelled creature called a “human” by spreading sugar out on the ground?
Why wouldn’t techniques that work on contemporary AI systems extend to AGI?
If by “techniques that work on contemporary AIs” you mean RLHF/RLAIF, then I don’t know anyone claiming that the robustness and safety of these techniques will “extend to AGI”. I think that AGI labs will soon move in the direction of releasing an agent architecture rather that a bare LLM, and will apply reasoning verification techniques. From OpenAI’s side, see “Let’s verify step by step” paper. From DeepMind’s side, see this interview with Shane Legg.
What I find incredible is how contributing to the development of existentially dangerous systems is viewed as a morally acceptable course of action within communities that on paper accept that AGI is a threat.
I think this passage (and the whole comment) is unfair because it presents what AGI labs are pursuing (i.e., plans like “superalignment”) as obviously consequentially bad plans. But this is actually very far from obvious. I personally tend to conclude that these are consequentially good plans, conditioned on the absence of coordination on “pause and united, CERN-like effort about AGI and alignment” (and the presence of open-source maximalist and risk-dismissive players like Meta AI).
What I think is bad in labs’ behaviour (if true, which we don’t know, because such coordination efforts might be underway but we don’t know about them) is that the labs are not trying to coordinate (among themselves and with the support of governments for legal basis, monitoring, and enforcement) on “pause and united, CERN-like effort about AGI and alignment”. Instead, we only see the labs coordinating and advocating for RSP-like policies.
Another thing that I think is bad in labs’ behaviour is inadequately little funding to safety efforts. Thus, I agree with the call in “Managing AI Risks in the Era of Rapid Progress” for the labs to allocate at least a third of their budgets to safety efforts. These efforts, by the way, shouldn’t be narrowly about AI models. Indeed, this is a major point of Roko’s OP. Investments and progress in computer and system security, political, economic, and societal structures is inadequate. This couldn’t be the responsibility of AGI labs alone, obviously, but I think they have to own at a part of it. They actually do own it, a little: they fund and support efforts like proof of humanness, UBI studies, and have stuff and/or teams that are at least in part working on these issues. But I think AGI labs are doing about an order of magnitude less than they should on these fronts.
“Is it reasonable to then conclude that you will be able to predict and control the behaviour of much more complex, multicelled creature called a “human” by spreading sugar out on the ground?”
Yes. Last time I checked the obesity stats it seemed to work just fine...
Jokes aside, you are making an important point. As we speak we have no idea how to even control humans, even if we are humans ourselves (possibly) and should have a pretty good idea what makes us tick we are clueless. Of course we can control humans to a certain degree (society, force, drugs, etc etc), but there are and will always be rouge elements that are uncontrollable. Being able to control 99.99999999999% of all future AI’s won’t cut it. Its either 100% or an epic fail (I guess this is only time it is warranted to use the word epic when talking about fails).
I would question the idea of “control” being pivotal.
Even if every AI is controllable, there’s still the possibility of humans telling those AIs to bad things and thereby destabilizing the world and throwing it into an equilibrium where there are no more humans.
What I find incredible is how contributing to the development of existentially dangerous systems is viewed as a morally acceptable course of action within communities that on paper accept that AGI is a threat.
Both OpenAI and Anthropic are incredibly influential among AI safety researchers, despite both organisations being key players in bringing the advent of TAI ever closer.
Both organisations benefit from lexical confusion over the word “safety”.
The average person concerned with existential risk from AGI might assume “safety” means working to reduce the likelihood that we all die. They would be disheartened to learn that many “AI Safety” researchers are instead focused on making sure contemporary LLMs behave appropriately. Such “safety” research simply makes the contemporary technology more viable and profitable, driving investment and reducing timelines. There is to my knowledge no published research that proves these techniques will extend to controlling AGI in a useful way.*
OpenAI’s “Superalignment” plan is a more ambitious safety play.Their plan to “solve” alignment involves building a human level general intelligence within 4 years and then using this to automate alignment research.
But there are two obvious problems:
a human level general intelligence is already most of the way toward a superhuman general intelligence (simply give it more compute). Cynically, Superintelligence is a promise that OpenAI’s brightest safety researchers will be trying their hardest to bring about an AGI within 4 years.
The success of Superalignment means we are now in the position of trusting that a for-profit, private entity will only use the human level AI researchers to research safety, instead of making the incredibly obvious play of having virtual researchers research how to build the next generation better, smarter automated researchers.
To conclude, if it looks like a duck, swims like a duck and quacks like a duck, it’s a capabilities researcher.
*This point could (and probably should) be a post in itself. Why wouldn’t techniques that work on contemporary AI systems extend to AGI?
Pretend for a moment that you and I are silicon-based aliens who have recently discovered that carbon based lifeforms exist, and can be used to run calculations. Scientists have postulated that by creating complex enough carbon structures we could invent “thinking animals”. We anticipate that these strange creatures will be built in the near future and that they might be difficult to control.
As we can’t build thinking animals today, we are stuck studying single cell carbon organisms. A technique has just been discovered in which we can use a compound called “sugar” to influence the direction in which these simple organisms move.
Is it reasonable to then conclude that you will be able to predict and control the behaviour of much more complex, multicelled creature called a “human” by spreading sugar out on the ground?
If by “techniques that work on contemporary AIs” you mean RLHF/RLAIF, then I don’t know anyone claiming that the robustness and safety of these techniques will “extend to AGI”. I think that AGI labs will soon move in the direction of releasing an agent architecture rather that a bare LLM, and will apply reasoning verification techniques. From OpenAI’s side, see “Let’s verify step by step” paper. From DeepMind’s side, see this interview with Shane Legg.
I think this passage (and the whole comment) is unfair because it presents what AGI labs are pursuing (i.e., plans like “superalignment”) as obviously consequentially bad plans. But this is actually very far from obvious. I personally tend to conclude that these are consequentially good plans, conditioned on the absence of coordination on “pause and united, CERN-like effort about AGI and alignment” (and the presence of open-source maximalist and risk-dismissive players like Meta AI).
What I think is bad in labs’ behaviour (if true, which we don’t know, because such coordination efforts might be underway but we don’t know about them) is that the labs are not trying to coordinate (among themselves and with the support of governments for legal basis, monitoring, and enforcement) on “pause and united, CERN-like effort about AGI and alignment”. Instead, we only see the labs coordinating and advocating for RSP-like policies.
Another thing that I think is bad in labs’ behaviour is inadequately little funding to safety efforts. Thus, I agree with the call in “Managing AI Risks in the Era of Rapid Progress” for the labs to allocate at least a third of their budgets to safety efforts. These efforts, by the way, shouldn’t be narrowly about AI models. Indeed, this is a major point of Roko’s OP. Investments and progress in computer and system security, political, economic, and societal structures is inadequate. This couldn’t be the responsibility of AGI labs alone, obviously, but I think they have to own at a part of it. They actually do own it, a little: they fund and support efforts like proof of humanness, UBI studies, and have stuff and/or teams that are at least in part working on these issues. But I think AGI labs are doing about an order of magnitude less than they should on these fronts.
“Is it reasonable to then conclude that you will be able to predict and control the behaviour of much more complex, multicelled creature called a “human” by spreading sugar out on the ground?”
Yes. Last time I checked the obesity stats it seemed to work just fine...
Jokes aside, you are making an important point. As we speak we have no idea how to even control humans, even if we are humans ourselves (possibly) and should have a pretty good idea what makes us tick we are clueless. Of course we can control humans to a certain degree (society, force, drugs, etc etc), but there are and will always be rouge elements that are uncontrollable. Being able to control 99.99999999999% of all future AI’s won’t cut it. Its either 100% or an epic fail (I guess this is only time it is warranted to use the word epic when talking about fails).
I would question the idea of “control” being pivotal.
Even if every AI is controllable, there’s still the possibility of humans telling those AIs to bad things and thereby destabilizing the world and throwing it into an equilibrium where there are no more humans.