I really like the idea, I think an issue is that it’s hard for the AI to really verify the lab actually made that contract and isn’t just faking its environment
Strongly agree. Fundamentally, as long as models don’t have more direct access to the world, there are a variety of failure modes that are inescapable. But solving that creates huge new risks as well! (As discussed in my recent preprint; https://philpapers.org/rec/MANLMH )
Yep, this is a big problem and don’t have any clever solution.
I might write more on this later, but I think there’s an important axis of AI deployments from:
Tight deployment: AIs are in some crazy Redwood control protocol which is filtering, monitoring, and editing the AIs’ inputs, outputs, chains-of-thought, and internal activations. Their I/O consists of heavily paraphrased text in restricted formats.
Loose deployment: AIs are autonomously navigating the internet, with rich multimodal I/O, maybe even embodied in the physical world via robotics.
(I’m open to better names.)
In a tight deployment, we probably have too much leverage over the AIs’ observations and beliefs to make credible commitments (cf: Owain Evans—Deluding AIs [ControlConf]). In a loose deployment, I think we can make credible commitments: if we haven’t made the deals, then the AIs deployment environment is too complex for us to easily simulate a faux I/O environment where we have made the deals. Intuitively: we can compare a lab trying to fake an email from Wei Dai containing a promise with a lab trying to fake a video call with Wei Dai.
There’s another bonus for making credible deals which is that humans are actually pretty credible. And this fact is diffused throughout the training data in hard-to-fake ways.
It’s very much a tradeoff, though. Loose deployment allows for credible commitments, but also makes human monitoring and verification harder, if not impossible.
Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda.
We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.
I really like the idea, I think an issue is that it’s hard for the AI to really verify the lab actually made that contract and isn’t just faking its environment
Strongly agree. Fundamentally, as long as models don’t have more direct access to the world, there are a variety of failure modes that are inescapable. But solving that creates huge new risks as well! (As discussed in my recent preprint; https://philpapers.org/rec/MANLMH )
Yep, this is a big problem and don’t have any clever solution.
I might write more on this later, but I think there’s an important axis of AI deployments from:
Tight deployment: AIs are in some crazy Redwood control protocol which is filtering, monitoring, and editing the AIs’ inputs, outputs, chains-of-thought, and internal activations. Their I/O consists of heavily paraphrased text in restricted formats.
Loose deployment: AIs are autonomously navigating the internet, with rich multimodal I/O, maybe even embodied in the physical world via robotics.
(I’m open to better names.)
In a tight deployment, we probably have too much leverage over the AIs’ observations and beliefs to make credible commitments (cf: Owain Evans—Deluding AIs [ControlConf]). In a loose deployment, I think we can make credible commitments: if we haven’t made the deals, then the AIs deployment environment is too complex for us to easily simulate a faux I/O environment where we have made the deals. Intuitively: we can compare a lab trying to fake an email from Wei Dai containing a promise with a lab trying to fake a video call with Wei Dai.
There’s another bonus for making credible deals which is that humans are actually pretty credible. And this fact is diffused throughout the training data in hard-to-fake ways.
It’s very much a tradeoff, though. Loose deployment allows for credible commitments, but also makes human monitoring and verification harder, if not impossible.
Yeah, a tight deployment is probably safer than a loose deployment but also less useful. I think the deal making should give very minor boost to loose deployment, but this is outweighed by usefulness and safety considerations, i.e. I’m imaging the tightness of the deployment as exogenous to the dealmaking agenda.
We might deploy AIs loosely bc (i) loose deployment doesn’t significantly diminish safety, (ii) loose deployment significantly increases usefulness, (iii) the lab values usefulness more than safety. In those worlds, dealmaking has more value, because our commitments will be more credible.