Though there’s a Q of whether AI can fail to produce this stuff without tipping its hand
I guess you’re imagining it keeps sandbagging the research and humans don’t realize? Could be, but that risks humans catching it not having an insight they could have had, and then training that away (perhaps eliminating its misalignment)
This is just like the dynamics of sabotaging alignment research that I know you’ve thought a lot about
I think it’s pretty easy for it to fail to produce this stuff without tipping its hand. Consider how, if OpenAI leadership really cared a lot about preventing concentration of power, they should be investing some % of their resources in pioneering good corporate governance structures for themselves & lobbying for those structures to be mandatory for all companies. The fact that they aren’t doing that is super suspicious....
Except no it isn’t, because there isn’t a strong consensus that they should obviously be doing that. So they can get away with basically doing nothing, or doing something vaguely related to democratic AI, or whatnot.
Similarly, OpenBrain+Consensus-1 could have all sorts of nifty d/acc projects that they are ostentatiously funding and doing, except for the ones that would actually work to stop their takeover plans. And there simply won’t be a consensus that the ones they aren’t doing, or aren’t doing well, are strong evidence that they are evil.
Let’s suppose OpenBrain human employees genuinely want to accelerate d-acc bc they don’t want to die.
They task Agent-1 with generating the 50 most promising avenues.
They can separately get human expert to generate the top 50 avenues. Then if the human experts come up with a great idea Agent-1 misses, that would be suspicious and some evidence of misalignment. And they could train against it so that Agent-1 learns not to sandbag (hopefully in a way that generalises pretty far).
I agree it’s unlikely OpenBrain would actually put in this much effort. And then even if they did, it’s pretty unclear how much it would help. But still, i feel more optimistic than i think you are
Let’s apply that same argument to OpenAI. Suppose that many current humans genuinely want to avoid a concentration-of-power world where e.g. the c-suite of a company or two, plus whoever is POTUS, can basically be a junta or Oversight Committee with crazy amounts of unaccountable power over the world, e.g. by aligning the AIs to themselves. They can come up with a bunch of ideas themselves, as well as evaluate the ideas OpenAI comes up with when asked these questions in interviews and so forth. They can get non-OpenAI experts to evaluate the top 50 ideas. Will those experts come up with a prioritized list the top ten ideas from which are already being done by OpenAI? Of course not. Quite plausibly none of the top ten ideas (as evaluated by external experts) are already being done by OpenAI. Are people raising a stink about this? If they did, would it work? No.
If an overwhelming majority of civil society plus the USG was pressuring OpenAI in this direction, I think it would have a substantial effect. If only a few non-profits did it, I think it would have little effect.
To make your analogy work, we need to tell whether the relationship between OpenBrain employees and their AIs is more like “USG + civil society vs. OpenAI” or more like “a few non-profits vs. OpenAI”. I’d say “OpenBrain vs. their AIs” is more like “USG + civil society vs. OpenAI”. So if all of OpenBrain is on board with d/acc and doing the thing Tom said, I think it would have a substantial effect on the AIs.
OK, fair. Well, if all of OpenBrain is on board with d/acc and doesn’t trust Agent5+, that’s a different situation. I was imagining e.g. that leadership trusts Agent5+ and thinks the status quo trajectory is fine (they are worried about other things like competitors and terrorists and china) and maybe a few lower-level employees are suspicious/fearful of Agent5+.
Not sure if i’m following the argument here, sorry.
I agree there wouldn’t be big external pressure on OpenBrain for not doing d-acc, just like there wouldn’t be in your example.
But my claim was that the OpenBrain employees will choose to do this bc they don’t want to die. Not sure what your response is to that. Maybe just that i’m being overly optimistic and the employees won’t bother
I’m saying the relationship of the public to OpenAI today is similar to the relationship of OpenBrain employees to Consensus-1, Agent-5+, etc. in 2028 of AI 2027. It’s an analogy. Your argument would be that OpenBrain employees who don’t trust Agent-5+ will be able to command Agent-5+ to build all sorts of d/acc tech and that if it doesn’t, they’ll get suspicious and shut down Agent-5+. I’m saying that’s not going to work for similar reasons to why e.g. the public / Congress aren’t demanding that OpenAI do all sorts of corporate governance reforms and getting suspicious when they just do safetywashing and applause lights. The public today doesn’t want OpenAI to amass huge amounts of unaccountable power, and the OpenBrain employees won’t want to die, but in neither case will they be able to distinguish between “OpenAI/Agent5 is behaving reasonably if somewhat different than I’d like” and “Holy shit it’s evil” and even though some % will indeed conclude “it’s evil” they won’t be able to build enough consensus / get enough buy-in.
Though there’s a Q of whether AI can fail to produce this stuff without tipping its hand
I guess you’re imagining it keeps sandbagging the research and humans don’t realize? Could be, but that risks humans catching it not having an insight they could have had, and then training that away (perhaps eliminating its misalignment)
This is just like the dynamics of sabotaging alignment research that I know you’ve thought a lot about
I think it’s pretty easy for it to fail to produce this stuff without tipping its hand. Consider how, if OpenAI leadership really cared a lot about preventing concentration of power, they should be investing some % of their resources in pioneering good corporate governance structures for themselves & lobbying for those structures to be mandatory for all companies. The fact that they aren’t doing that is super suspicious....
Except no it isn’t, because there isn’t a strong consensus that they should obviously be doing that. So they can get away with basically doing nothing, or doing something vaguely related to democratic AI, or whatnot.
Similarly, OpenBrain+Consensus-1 could have all sorts of nifty d/acc projects that they are ostentatiously funding and doing, except for the ones that would actually work to stop their takeover plans. And there simply won’t be a consensus that the ones they aren’t doing, or aren’t doing well, are strong evidence that they are evil.
Let’s suppose OpenBrain human employees genuinely want to accelerate d-acc bc they don’t want to die.
They task Agent-1 with generating the 50 most promising avenues.
They can separately get human expert to generate the top 50 avenues. Then if the human experts come up with a great idea Agent-1 misses, that would be suspicious and some evidence of misalignment. And they could train against it so that Agent-1 learns not to sandbag (hopefully in a way that generalises pretty far).
I agree it’s unlikely OpenBrain would actually put in this much effort. And then even if they did, it’s pretty unclear how much it would help. But still, i feel more optimistic than i think you are
Let’s apply that same argument to OpenAI. Suppose that many current humans genuinely want to avoid a concentration-of-power world where e.g. the c-suite of a company or two, plus whoever is POTUS, can basically be a junta or Oversight Committee with crazy amounts of unaccountable power over the world, e.g. by aligning the AIs to themselves. They can come up with a bunch of ideas themselves, as well as evaluate the ideas OpenAI comes up with when asked these questions in interviews and so forth. They can get non-OpenAI experts to evaluate the top 50 ideas. Will those experts come up with a prioritized list the top ten ideas from which are already being done by OpenAI? Of course not. Quite plausibly none of the top ten ideas (as evaluated by external experts) are already being done by OpenAI. Are people raising a stink about this? If they did, would it work? No.
If an overwhelming majority of civil society plus the USG was pressuring OpenAI in this direction, I think it would have a substantial effect. If only a few non-profits did it, I think it would have little effect.
To make your analogy work, we need to tell whether the relationship between OpenBrain employees and their AIs is more like “USG + civil society vs. OpenAI” or more like “a few non-profits vs. OpenAI”. I’d say “OpenBrain vs. their AIs” is more like “USG + civil society vs. OpenAI”. So if all of OpenBrain is on board with d/acc and doing the thing Tom said, I think it would have a substantial effect on the AIs.
OK, fair. Well, if all of OpenBrain is on board with d/acc and doesn’t trust Agent5+, that’s a different situation. I was imagining e.g. that leadership trusts Agent5+ and thinks the status quo trajectory is fine (they are worried about other things like competitors and terrorists and china) and maybe a few lower-level employees are suspicious/fearful of Agent5+.
Not sure if i’m following the argument here, sorry.
I agree there wouldn’t be big external pressure on OpenBrain for not doing d-acc, just like there wouldn’t be in your example.
But my claim was that the OpenBrain employees will choose to do this bc they don’t want to die. Not sure what your response is to that. Maybe just that i’m being overly optimistic and the employees won’t bother
I’m saying the relationship of the public to OpenAI today is similar to the relationship of OpenBrain employees to Consensus-1, Agent-5+, etc. in 2028 of AI 2027. It’s an analogy. Your argument would be that OpenBrain employees who don’t trust Agent-5+ will be able to command Agent-5+ to build all sorts of d/acc tech and that if it doesn’t, they’ll get suspicious and shut down Agent-5+. I’m saying that’s not going to work for similar reasons to why e.g. the public / Congress aren’t demanding that OpenAI do all sorts of corporate governance reforms and getting suspicious when they just do safetywashing and applause lights. The public today doesn’t want OpenAI to amass huge amounts of unaccountable power, and the OpenBrain employees won’t want to die, but in neither case will they be able to distinguish between “OpenAI/Agent5 is behaving reasonably if somewhat different than I’d like” and “Holy shit it’s evil” and even though some % will indeed conclude “it’s evil” they won’t be able to build enough consensus / get enough buy-in.