Bogdan Ionut Cirstea comments on Quick takes on “AI is easy to control”

Bogdan Ionut Cirstea 3 Dec 2023 11:19 UTC
5 points
9
How about an argument in the shape of:
1. we’ll get good evidence of human-like alignment-relevant concepts/values well-represented internally (e.g. Scaling laws for language encoding models in fMRI, A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations); in addition to all the cumulating behavioral evidence
2. we’ll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. through evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
3. we have some evidence for priors in favor of fine-tuning favoring strategies which make use of more accessible concepts, e.g. Predicting Inductive Biases of Pre-Trained Models, Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features.
What links here?
- Bogdan Ionut Cirstea's comment on Deep Forgetting & Unlearning for Safely-Scoped LLMs by scasper (6 Dec 2023 1:21 UTC; 3 points)
- DanielFilan 3 Dec 2023 21:28 UTC
  3 points
  0
  Parent
  For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn’t think of the papers you linked as much evidence here.
  
  For 2, that would for sure do it, but it doesn’t feel like much of a reduction.
  
  3 sounds like it’s maybe definitionally true? At the very least, I don’t doubt it much.
  - Bogdan Ionut Cirstea 3 Dec 2023 22:01 UTC
    1 point
    0
    Parent
    Interesting, I’m genuinely curious what you’d expect better evidence to look like for 1.
    - DanielFilan 3 Dec 2023 23:10 UTC
      2 points
      0
      Parent
      So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
      - Bogdan Ionut Cirstea 4 Dec 2023 11:32 UTC
        1 point
        0
        Parent
        Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.