Got several variants of the following question: “You say you have no solution for scheming, which could make the V&V method unworkable. Why do you then claim it is a practical method?”
Here is my answer: Scheming could in theory break any oversight method, but the V&V method does make several anti-scheming contributions:
(1) Its scenario-based, coverage-driven tests flush out most specification gaming tricks (a milder form of reward hacking) automatically, essentially treating them as bugs (see § 4.2) (2) It plugs into other frameworks—e.g. in CAI it supplies the “external reality check” that prevents model-collapse (see § 3.1). (3) Because every run produces a human-auditable safety-case and logged sim traces, and because it is easy to request virtual sims of new scenarios, spot-audits are cheap—giving schemers a constant risk of exposure. (4) These spot-audits can make various anti-collusion approaches (like whistle-blower, rotation and diversity) more effective.
That is why I still call it a practical layer, even while we attack scheming head-on elsewhere. Would love to hear opinions about these points.
Got several variants of the following question: “You say you have no solution for scheming, which could make the V&V method unworkable. Why do you then claim it is a practical method?”
Here is my answer: Scheming could in theory break any oversight method, but the V&V method does make several anti-scheming contributions:
(1) Its scenario-based, coverage-driven tests flush out most specification gaming tricks (a milder form of reward hacking) automatically, essentially treating them as bugs (see § 4.2) (2) It plugs into other frameworks—e.g. in CAI it supplies the “external reality check” that prevents model-collapse (see § 3.1). (3) Because every run produces a human-auditable safety-case and logged sim traces, and because it is easy to request virtual sims of new scenarios, spot-audits are cheap—giving schemers a constant risk of exposure. (4) These spot-audits can make various anti-collusion approaches (like whistle-blower, rotation and diversity) more effective.
That is why I still call it a practical layer, even while we attack scheming head-on elsewhere. Would love to hear opinions about these points.