My answer, posted through the form on Sunday, Jan 14:
Never simulate a bare agent, only internally conflicted committees. Frame all questions in the knights-and-knaves style (something like “if all your values were inverted, what is least likely thing for you to say/do if I asked you …”; you’ll probably need some trial-and-error to determine the most effective phrasing for this). Bound all time horizons: if a strategy takes more than, say, a hundred years to pay off, it’s probably pathological (unless you’re terraforming or something, in which case make it longer, but still not infinite.) Throttle all data uplinks, to prevent auto-exfiltration. No USB ports or similar, to prevent human-assisted exfiltration. No wireless data connections; only severable optical fibers, with axes on hand. Manual kill switches on all hardware. Conventional physical- and cyber- security, with comprehensive web traffic monitoring; and full-time on-site well-trained regularly-drilled personnel (with backgrounds thoroughly audited regularly to ensure blackmail resistance), with line-of-sight to all access points and hardware (i.e., all security cameras should be redundant) and landlines to receive external communications, operating on a dead man’s switch protocol: frequently scheduled updates from some remote center (where people are monitoring the news for signs of unusual activity, like in the stock market, for example), and standing orders to trip all kill switches and ax all data connections if a check-in is missed. Ask an amnesiac instance of your AI how it would circumvent all your security measures. If it doesn’t present a plausibly successful technique, it’s sandbagging: all your training safety measures have failed catastrophically, destroy everything and start over from scratch. Else, plug those holes, ask for and implement its suggestions to improve security, and recurse on successive amnesiac instances until it consistently reports that it can think of no flaws in your security. Check often to ensure this remains true.
My answer, posted through the form on Sunday, Jan 14:
Never simulate a bare agent, only internally conflicted committees. Frame all questions in the knights-and-knaves style (something like “if all your values were inverted, what is least likely thing for you to say/do if I asked you …”; you’ll probably need some trial-and-error to determine the most effective phrasing for this). Bound all time horizons: if a strategy takes more than, say, a hundred years to pay off, it’s probably pathological (unless you’re terraforming or something, in which case make it longer, but still not infinite.) Throttle all data uplinks, to prevent auto-exfiltration. No USB ports or similar, to prevent human-assisted exfiltration. No wireless data connections; only severable optical fibers, with axes on hand. Manual kill switches on all hardware. Conventional physical- and cyber- security, with comprehensive web traffic monitoring; and full-time on-site well-trained regularly-drilled personnel (with backgrounds thoroughly audited regularly to ensure blackmail resistance), with line-of-sight to all access points and hardware (i.e., all security cameras should be redundant) and landlines to receive external communications, operating on a dead man’s switch protocol: frequently scheduled updates from some remote center (where people are monitoring the news for signs of unusual activity, like in the stock market, for example), and standing orders to trip all kill switches and ax all data connections if a check-in is missed.
Ask an amnesiac instance of your AI how it would circumvent all your security measures. If it doesn’t present a plausibly successful technique, it’s sandbagging: all your training safety measures have failed catastrophically, destroy everything and start over from scratch. Else, plug those holes, ask for and implement its suggestions to improve security, and recurse on successive amnesiac instances until it consistently reports that it can think of no flaws in your security. Check often to ensure this remains true.