[Question] Which possible AI systems are relatively safe?

Presumably some kinds of AI systems, architectures, methods, and ways of building complex systems out of ML models are safer or more alignable than others. Holding capabilities constant, you’d be happier to see some kinds of systems than others.

For example, Paul Christiano suggests “LM agents are an unusually safe way to build powerful AI systems.” He says “My guess is that if you hold capability fixed and make a marginal move in the direction of (better LM agents) + (smaller LMs) then you will make the world safer. It straightforwardly decreases the risk of deceptive alignment, makes oversight easier, and decreases the potential advantages of optimizing on outcomes.”

My quick list is below; I’m interested in object-level suggestions, meta observations, reading recommendations, etc. I’m particularly interested in design-properties rather than mere safety-desiderata, but safety-desiderata may inspire lower-level design-properties.

All else equal, it seems safer if an AI system:

  • Has humans in the loop (even better to the extent that they participate in or understand its decisions, rather than just approving inscrutable decisions);

  • Decomposes tasks into subtasks in comprehensible ways, and in particular if the interfaces between subagents performing subtasks are transparent and interpretable;

  • Is more supervisable or amenable to AI oversight (what low-level properties determine this besides interpretable-ness and decomposing-tasks-comprehensibly?);

  • Is feedforward-y rather than recurrent-y (because recurrent-y systems have hidden states? so this is part of interpretability/​overseeability?);

  • Is myopic;

  • Lacks situational awareness;

  • Lacks various dangerous capabilities (coding, weapon-building, human-modeling, planning);

  • Is more corrigible (what lower-level desirable properties determine corrigibility? what determines whether systems have those properties?) (note to self: see 1, 2, 3, 4, and comments on 5);

  • Is legible and process-based;

  • Is composed of separable narrow tools;

  • Can’t be run on general-purpose hardware.

These properties overlap a lot. Also note that there are nice-properties at various levels of abstraction, like both “more interpretable” and [whatever low-level features make systems more interpretable].

If a path (like LM agents) or design feature is relatively safe, it would be good for labs to know that. An alternative framing for this question is: what should labs do to advance safer kinds of systems?

Obviously I’m mostly interested in properties that might not require much extra-cost and capabilities-sacrifice relative to unsafe systems. A method or path for safer AI is ~useless if it’s far behind unsafe systems.