Decent plan prize winner & highlights

Last week I announced a contest to write a 1-paragraph plan for training & deploying nextgen general AI models which prevents them from killing you. Many of the answers were excellent. If you feel there’s a big bag of AI safety tricks but no roadmap or cohesion, then I encourage you to read all the answers on the last post and see if it melds.

The winner is Lovre (twitter, nitter). Answer very smushed by me from last tweet to one sentence:

For 1 part time & compute used for training, spend 1 part on redteaming & improving safety; alternate frequently and actually pause training during checks.

I love this answer. Askable by concerned people. A big ask but askable. Grantable by management. Do-able by doers. Doesn’t awkwardly go to infinite work or zero work. Fairly likely to detect a serious problem soon enough if there is one. Likely to gather the evidence for the problem in a clear manner. Quite broadly applicable. Not perfect, but very good.

Answer smushed by me to one paragraph:

RLHF or RLAIF obviously. Take pauses throughout training to red-team. Use mech interp to detect lying etc. Notice if mech interp methods stop working and pause to figure out why. When you can, use filtered synthetic data instead of human data. Make the model externalize its reasoning and check with a different model that this externalized reasoning actually leads to the supposed conclusion. Extensively monitor outputs and any other available data. Check for unexpected connections between outputs.

I had not heard about this forced-externalized-reasoning idea but it sounds great to me and apparently there’s a post.

Highlights from other answers

Most people gave good correct & obvious advice in addition to the interesting/insightful parts I highlight below. The obvious advice is of course by far the most important and is still largely not standard practice. Anyway here’s the less-important-but-interesting parts:

Shankar Sivarajan (LW)

Ask an amnesiac instance of your AI how it would circumvent all your security measures. If it doesn’t present a plausibly successful technique, it’s sandbagging: all your training safety measures have failed catastrophically, destroy everything and start over from scratch. Else, plug those holes, ask for and implement its suggestions to improve security, and recurse on successive amnesiac instances until it consistently reports that it can think of no flaws in your security.

This is a clear trigger condition that is hard to argue with and I’ve never heard it before.

Don’t sell access to the model. Instead, use it in-house (through an intermediate company, of course) to sell the services your AI gives you an edge in. I.e., instead of selling subscriptions to lawyers, coders, artists, set up a law firm, a programming agency, a visual effects studio.

I’m a big fan of selling high-level services to reduce attack surface.

Most info-dense AI security advice ever:

Throttle all data uplinks, to prevent auto-exfiltration. No USB ports or similar, to prevent human-assisted exfiltration. No wireless data connections; only severable optical fibers, with axes on hand. Manual kill switches on all hardware. Conventional physical- and cyber- security, with comprehensive web traffic monitoring; and full-time on-site well-trained regularly-drilled personnel (with backgrounds thoroughly audited regularly to ensure blackmail resistance), with line-of-sight to all access points and hardware (i.e., all security cameras should be redundant) and landlines to receive external communications, operating on a dead man’s switch protocol: frequently scheduled updates from some remote center (where people are monitoring the news for signs of unusual activity, like in the stock market, for example), and standing orders to trip all kill switches and ax all data connections if a check-in is missed.

Beth Barnes (LW)

Weights live only in secure datacenter, which can’t run general-purpose code, and all it does is serve the model.

I might be misunderstanding the idea but I think it can’t happen because a good amount of general-purpose-code-runners are needed for eg networking and device interop etc. BUT I think chips that can do forwardprop+backprop+update but don’t have read-lines for weights are feasible and interesting. In fact I am investigating them right now. I may write about it in the future.

Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc.”

Wow if I had three totally independent & redundant monitoring teams then I would feel MUCH better about the risk that those folks go rogue and utilize the untapped/unrecognized capabilities themselves. Not heard this before.

P. (LW)

But if I were you, I might try to test its ability to perform well in a domain it has never seen, where humans start by performing poorly but very quickly learn what to do (think about video games with new mechanics). If it does well, you have a qualitatively new thing on your hands, don’t deploy, study it instead. If a priori for some reason you think it could happen, and only a small subset of all the data is necessary to achieve that, do a smaller training run first with that data.

It always seemed hard to me to pick tasks for capability elicitation. Using easy-to-learn domains is probably a nice boost to sensitivity.

Tim Babb (twitter)

ask the model itself to evaluate its own output. tell it to identify possible benefits and harms, and score them—analogously to GAN, adjust weights to reduce the harm scoring (while holding or increasing benefit scoring) - you’d need partial derivatives w.r.t. the scoring, which could be tricky. possible idea: add a new token vector orthogonal to all the others, whose weight (+ or -) is benefit/harm respectively; that’s the token the model uses to deliver its score—you could get a “harm sensitivity” for each token in the window of the score token. you could potentially go from there to increase/decrease the probability of predicting those tokens with their own respective contexts

I’ve never heard the basic idea stated explicitly: use the model to evaluate its own output for benefit/harm. The phrasing makes me think about other ways to do the estimation. But I wonder if something could go wrong trying to maximize benefit. The token thing is quite interesting to me; I wish I understood how transfomers deal with that kind of thing.

A hole

My main concern with next-gen AI that was hardly addressed by anyone: What if the model knows way way more internally than you can suss out of it? Or what if it’s doing way more internally than is apparent? I’m loosely aware of some mech-interp plans to detect this (pointers appreciated!) but what about preventing it or knowing why/when it will happen? I have an idea for some empirical research to try to get at that question. If I do it then I will of course write up the results.

Some ezgainz

I think the material from the answers can be used to make a very tiny AI safety map/plan that is non-condescending, non-depressing, non-ignoring-the-core-problems, and feasible. Something that an AI startup person or exec won’t mind reading. It will of course have multiple critical holes, but they will be have a visible size and location. I may write it and publish it in the next week or so.