Garrett Baker comments on D0TheMath’s Shortform

Garrett Baker 13 Sep 2023 23:13 UTC
2 points
0
Look at that! People have used interpretability to make a mesa layer! https://arxiv.org/pdf/2309.05858.pdf
- Thomas Kwa 14 Sep 2023 1:01 UTC
  6 points
  0
  Parent
  This might do more for alignment. Better that we understand mesa-optimization and can engineer it than have it mysteriously emerge.
  - Garrett Baker 14 Sep 2023 1:10 UTC
    2 points
    0
    Parent
    Good point! Overall I don’t anticipate these layers will give you much control over what the network ends up optimizing for, but I don’t fully understand them yet either, so maybe you’re right.
    
    Do you have specific reason to think moding the layers will easily let you control the high-level behavior, or is it just a justified hunch?
    - Thomas Kwa 14 Sep 2023 1:34 UTC
      4 points
      0
      Parent
      Not in isolation, but that’s just because characterizing the ultimate goal / optimization target of a system is way too difficult for the field right now. I think the important question is whether interp brings us closer such that in conjunction with more theory and/or the ability to iterate, we can get some alignment and/or corrigibility properties.
      I haven’t read the paper and I’m not claiming that this will be counterfactual to some huge breakthrough, but understanding in-context learning algorithms definitely seems like a piece of the puzzle. To give a fanciful story from my skim, the paper says that the model constructs an internal training set. Say we have a technique to excise power-seeking behavior from models by removing the influence of certain training examples. If the model’s mesa-optimization algorithms operate differently, our technique might not work until we understand this and adapt the technique. Or we can edit the internal training set directly rather than trying to indirectly influence it.
- Garrett Baker 13 Sep 2023 23:16 UTC
  3 points
  0
  Parent
  Evan Hubinger: In my paper, I theorized about the mesa optimizer as a cautionary tale
  
  Capabilities researchers: At long last, we have created the Mesa Layer from classic alignment paper Risks From Learned Optimization (Hubinger, 2019).
- Garrett Baker 13 Sep 2023 23:20 UTC
  2 points
  0
  Parent
  @TurnTrout @cfoster0 you two were skeptical. What do you make of this? They explicitly build upon the copying heads work Anthropic’s interp team has been doing.
  - TurnTrout 18 Sep 2023 17:05 UTC
    4 points
    0
    Parent
    As garrett says—not clear that this work is net negative. Skeptical that it’s strongly net negative. Haven’t read deeply, though.