Tao Lin comments on Externalized reasoning oversight: a research direction for language model alignment

Tao Lin 5 Aug 2022 1:01 UTC
LW: 10 AF: 8
2
AF
Externalized reasoning models suffer from the “legibility penalty”—the fact that many decisions are easier to make than to justify or explain. I think this is a significant barrier for authentic train of thought competitiveness, although not for particularly legible domains, such as math proofs and programming (Illegible knowledge goes into math proofs, but you trust the result regardless so it’s fine).
Another problem is that standard training procedures only incentivize the model to use reasoning steps produced by a single human. This means, for instance, if you ask a question involving two very different domains of knowledge, a good language model wouldn’t expose it’s knowledge about both of them, as that’s OOD for its training dataset. This may appear in an obvious fashion, as if multiple humans collaborated on the train of thought, or might appear in a way that’s harder to interpret. If you just want to expose this knowledge, you could train on amplified human reasoning (ie from human teams) though.
Also, if you ever train the model on conclusion correctness, you incentivize semantic drift between its reasoning and human language—the model would prefer to pack in more information per token than humans, and might want to express not-normally-said-by-human concepts (one type is fuzzy correlations, which models know a lot of). Even if you penalize KL divergence between human language and the reasoning, this doesn’t necessarily incentivize authentic human-like reasoning, just its appearance.
In general I’m unsure whether authentic train of thought is better than just having the model imitate specific concrete humans in ordinary language modelling—if you start a text by a known smart, truthful person, you get out an honest prediction over what that person believes.
- Ivan Vendrov 5 Aug 2022 16:57 UTC
  4 points
  1
  Parent
  Agreed, the competitiveness penalty from enforcing internal legibility is the main concern with externalized reasoning / factored cognition. The secular trend in AI systems is towards end-to-end training and human-uninterpretable intermediate representations; while you can always do slightly better at the frontier by adding some human-understandable components like chain of thought (previously beam search & probabilistic graphical models), in the long run a bigger end-to-end model will win out.
  One hope that “externalized reasoning” can buck this trend rests on the fact that success in “particularly legible domains, such as math proofs and programming” is actually enough for transformative AI—thanks to the internet and especially the rise of remote work, so much of the economy is legible. Sure, your nuclear-fusion-controller AI will have a huge competitiveness penalty if you force it to explain what it’s doing in natural language, but physical control isn’t where we’ve seen AI successes anyway.
  Side note:
  standard training procedures only incentivize the model to use reasoning steps produced by a single human.
  I don’t think this is right! The model will have seen enough examples of dialogue and conversation transcripts; it can definitely generate outputs that involve multiple domains of knowledge from prompts like
  An economist and a historian are debating the causes of WW2.
  Economist:
  - Tao Lin 5 Aug 2022 21:25 UTC
    1 point
    0
    Parent
    in the “economist and historian” case, it will only synthesize their knowledge together as much as those humans would, and humans are pretty suboptimal at integrating others’ opinions.
- Nora Belrose 5 Aug 2022 21:09 UTC
  0 points
  0
  Parent
  When I think about “human-like reasoning” I’m mostly thinking about the causal structure of that reasoning— each step is causally connected to the other reasoning steps in the right kind of way. Luckily it seems like there are lots of ways that you could actually try to enforce this in the model. We can break the usual compute graph of the LM and, say, corrupt some of the tokens in the chain of thought or some of the hidden representations, and see what happens, similar to what the ROME paper did.
  I’m actively thinking about how to put this into practice right now and I’m relatively optimistic about the idea.