I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I’m not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).
I think this takes away a lot of the disadvantages, especially if you combine this with some amount of simulation of backends to give interactability (not an intractably hard task, but would take some funding and development time).
I basically agree, ensuring that failures are fine during training would sure be great. (And I also agree that if we have a setting where failure is fine, we want to use that for a bunch of evaluation/red-teaming/...). As two caveats, there are definitely limits to how powerful an AI system you can sandbox IMO, and I’m not sure how feasible sandboxing even weak-ish models is from the governance side (WebGPT-style training just seems really useful).
Yes, I agree that actually getting companies to consistently use the sandboxing is the hardest piece of the puzzle. I think there’s some promising advancements making it easier to have the benefits of not-sandboxing despite being in a sandbox. For instance: Using a snapshot of the entire web, hosted on an isolated network https://ai.facebook.com/blog/introducing-sphere-meta-ais-web-scale-corpus-for-better-knowledge-intensive-nlp/
I think this takes away a lot of the disadvantages, especially if you combine this with some amount of simulation of backends to give interactability (not an intractably hard task, but would take some funding and development time).