Not clear it’s tractable to do safety research that will generalize this well without access to frontier models, for several reasons:
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
Safety methodologies as you scale up even if the technology remains the same.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Plausible to me, yes. But my experience with frontier models is that it’s possible-but-hard to prompt them into the kind of work I think is relevant. Perhaps that’s good enough to make your point hold, though. It does weaken my original argument if this line is correct.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
If you’re doing work that isn’t guaranteed to work for any future approach, you’re not doing work that can possibly be durable. At some point in some deployment of some model, the derivative of intelligence has spikes large enough to go through the safety barriers in a way that produces an adversarial glider. That adversarial glider can potentially come up with the least-convenient new technique. You need to be durable in the face of that, which means somehow being sure that you can detect those potential gliders before, and ideally also after, they happen.
To put it another way, what I’m trying to know is whether the thing you’re describing—incrementally doing control work in tandem with capability growth—is qualitatively a durable thing to be doing. That’s what I want a pause to figure out. The kind of work you’re describing is object-level to the kind of work I think needs to be done. Qualitatively, how do you design, in 1930, an aircraft research program that can be known ahead of time to not ever produce a fatality, given that you know you don’t know all the relevant dynamics?
Safety methodologies as you scale up even if the technology remains the same.
I don’t know how to parse this line.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
Tell me how you’re going to make the probability of generalization failure quickly asymptote to 0 and I’ll be happy, even if your answer is a description of an ongoing maintenance research program, as long as you can tell me why that research program produces an asymptote to zero. It’s okay if it takes multiple rounds and involves frontier models, it’s okay if it involves ongoing maintenance, but it’s not okay if the meta level of “will my ongoing maintenance become reliable?” is itself an ongoing question.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Agreed. I want to know how to be sure we’re in that regime. That’s what I would call “solving” the alignment problem. Once we can know whether we’re in that regime, I would feel much less at-risk. Right now it looks like we don’t even know what the full list of requirements for that regime are, though we’re a lot closer to knowing than we were even a year ago. I’m hopeful that it can be done.
But if we never can identify reliable meta-level safety principles that have appropriate conditionals so that, if their safeguards ever stop applying, the response is to reliably automatically pause until new safety principles are identified, then I can never feel fully comfortable.
Incidentally, this feels like as fine a time as any to mention I still think scale-down experiments are a critical test bed for all scale-up safety principles.
Not clear it’s tractable to do safety research that will generalize this well without access to frontier models, for several reasons:
Most of the research will be automated, unless the pause is soon, and then any kind of research will go much faster when done by frontier models.
Technologies might be different in the future. It might be like trying to increase the reliability of the Boeing 787′s jet engines while only having access to WWII-era propeller planes.
Safety methodologies as you scale up even if the technology remains the same.
When generalization has say a 25% chance of not working every generation, you need to empirically verify it works on frontier models, or the EV of your research is cut by 25%.
If we won a pause in the first place, it’s more likely we’ll be able to permanently implement a careful scaling regime where all models are extensively tested during training and before deployment
Plausible to me, yes. But my experience with frontier models is that it’s possible-but-hard to prompt them into the kind of work I think is relevant. Perhaps that’s good enough to make your point hold, though. It does weaken my original argument if this line is correct.
If you’re doing work that isn’t guaranteed to work for any future approach, you’re not doing work that can possibly be durable. At some point in some deployment of some model, the derivative of intelligence has spikes large enough to go through the safety barriers in a way that produces an adversarial glider. That adversarial glider can potentially come up with the least-convenient new technique. You need to be durable in the face of that, which means somehow being sure that you can detect those potential gliders before, and ideally also after, they happen.
To put it another way, what I’m trying to know is whether the thing you’re describing—incrementally doing control work in tandem with capability growth—is qualitatively a durable thing to be doing. That’s what I want a pause to figure out. The kind of work you’re describing is object-level to the kind of work I think needs to be done. Qualitatively, how do you design, in 1930, an aircraft research program that can be known ahead of time to not ever produce a fatality, given that you know you don’t know all the relevant dynamics?
I don’t know how to parse this line.
Tell me how you’re going to make the probability of generalization failure quickly asymptote to 0 and I’ll be happy, even if your answer is a description of an ongoing maintenance research program, as long as you can tell me why that research program produces an asymptote to zero. It’s okay if it takes multiple rounds and involves frontier models, it’s okay if it involves ongoing maintenance, but it’s not okay if the meta level of “will my ongoing maintenance become reliable?” is itself an ongoing question.
Agreed. I want to know how to be sure we’re in that regime. That’s what I would call “solving” the alignment problem. Once we can know whether we’re in that regime, I would feel much less at-risk. Right now it looks like we don’t even know what the full list of requirements for that regime are, though we’re a lot closer to knowing than we were even a year ago. I’m hopeful that it can be done.
But if we never can identify reliable meta-level safety principles that have appropriate conditionals so that, if their safeguards ever stop applying, the response is to reliably automatically pause until new safety principles are identified, then I can never feel fully comfortable.
Incidentally, this feels like as fine a time as any to mention I still think scale-down experiments are a critical test bed for all scale-up safety principles.