we’ll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. through evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.
How about an argument in the shape of:
we’ll get good evidence of human-like alignment-relevant concepts/values well-represented internally (e.g. Scaling laws for language encoding models in fMRI, A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations); in addition to all the cumulating behavioral evidence
we’ll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. through evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
we have some evidence for priors in favor of fine-tuning favoring strategies which make use of more accessible concepts, e.g. Predicting Inductive Biases of Pre-Trained Models, Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features.
For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn’t think of the papers you linked as much evidence here.
For 2, that would for sure do it, but it doesn’t feel like much of a reduction.
3 sounds like it’s maybe definitionally true? At the very least, I don’t doubt it much.
Interesting, I’m genuinely curious what you’d expect better evidence to look like for 1.
So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.