Charlie Steiner comments on Creating a Standard for TAI Governance

Charlie Steiner 14 Sep 2025 1:28 UTC
4 points
1
I like it.
You’ve listed mostly things that countries should do out of self-interest, without much need for international cooperation. (A little bit disanalogous with the UN SDGs.) This is fine, but I think there could also be some useful principles for international regulation of AI that countries could agree to in principle, to pave the way for cooperation even in an atmosphere of competitive rhetoric.
Under Develop Safe AI, it’s possible “Alignment” should be broken down into a few chunks, though I’m not sure. There’s a current paradigm called “alignment” that uses supervised finetuning + reinforcement learning on large models, where new reward functions, and new ways of leveraging human demonstrations/feedback, all have a family resemblance. And then there’s everything else—philosophy of preferences, decision theory for alignment, non-RL alignment of LLMs, neuroscience of human preferences, speculative new architectures that don’t fit in the current paradigm. Labels might just be something like “Alignment via finetuning” vs. “Other alignment.”
Under Societal Resilience to Disruption, I think “Epistemic Security Measures” could be fleshed out more. The first thing that pops to mind is letting people tell whether some content or message is from an AI, and empowering people to filter for human content / messages. (Proposals range from legislation outlawing impersonating a human, to giving humans unique cryptographic identifiers, to something something blockchain something Sam Altman.)
But you might imagine more controversial and dangerous measures—like using your own AI propaganda-bot to try to combat all external AI propaganda-bots, or instituting censorship based on the content of the message and not just whether its sender is human (which could be a political power play, or as mission creep trying to combat non-AI disinformation under the banner of “Epistemic Security,” or because you expect AIs or AI-empowered adversaries to have human-verified accounts spreading their messages). I think the category I’m imagining (which may be different than the category you’re imagining) might benefit from a more specific label like “Security from AI Manipulation.”
- Gwyn Glasser 17 Sep 2025 11:52 UTC
  3 points
  0
  Parent
  Thanks for all this!
  On international cooperation: I’m not sure exactly what kinds of principles you’re thinking of—my first thoughts go to the UN Governing AI For Humanity Principles. If it’s something like that, I’d say that they are outside the scope of the proposal above for now. However, the Standard as imagined here does aim to meaningfully support international cooperation, despite appealing to things that countries should do out of self interest:
  (i) The Standard itself works as a high-level strategy for an international governance organisation. For example, an international governing body could suggest or prescribe some minimal objectives across the 3 goals or 9 subgoals for nations to achieve by a given time.
  (ii) Without an international governance org, the Standard could make collaboration easier if endorsed by multiple states. Those states would have common language and objectives that would be more likely to overlap.
  (iii) Even if only one nation endorses the standard, or even a city or a state, the standard includes international collaboration / cooperation as one of the possible actions under “preventing bad actors.” This is in recognition of the fact that other people’s TAI will impact your security. So endorsing the Standard should encourage an internationally-minded approach.
  
  These could lead to the formulation of principles later. Not entirely sure if that addresses your point. Interested to hear more!
  
  On alignment and and societal resilience: Thanks! These are great points, I’ll definitely look into expanding on these areas as you suggest. I’m leaning away from the more controversial mitigations like AI propaganda; the standard aims to avoid controversiality where possible, and just focus on mitigations that are likely to be widely accepted as viable in principle (even if there are diverging opinions on specific details of implementation).