TurnTrout comments on Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

TurnTrout 19 Oct 2023 0:38 UTC
LW: 3 AF: 2
−1
AF
as we scale existing technology up or change details of NN architectures, gradient methods, etc
I think most practical alignment techniques have scaled quite nicely, with CCS maybe being an exception, and we don’t currently know how to scale the interp advances in OP’s paper.
Blessings of scale (IIRC): RLHF, constitutional AI / AI-driven dataset inclusion decisions / meta-ethics, activation steering / activation addition (LLAMA2-chat results forthcoming), adversarial training / redteaming, prompt engineering (though RLHF can interfere with responsiveness),…
I think the prior strongly favors “scaling boosts alignability” (at least in “pre-deceptive” regimes, though I have become increasingly skeptical of that purported phase transition, or at least its character).
“Weak methods” means confidence is achieved more empirically
I’d personally say “empirically promising methods” instead of “weak methods.”