Martin Randall comments on Symbol/Referent Confusions in Language Model Alignment Experiments

Martin Randall 4 Jan 2025 2:51 UTC
4 points
0
This was a helpful post in the sporadic LessWrong theme of “how to say technically correct things instead of technically incorrect things”. It’s in the LLM context, but of course it applies to humans too. When a child says “I am a fairy”, I record that in my diary as “Child claims to be fairy” not “Child is fairy”, because I am not quite that “gullible”.

Like many technically incorrect things, “gullibility” is common and practical. My diary might also say “Met John. Chemist. Will visit me on Friday lunch to discuss project”. It would be more technically correct to say “Met apparent male. Male introduced self as John, chemist. Male claimed he would visit me on Friday lunch. Male claimed purpose of claimed visit to be discussing a project”. Often a human saying something is good enough evidence of that thing, especially when summarizing. Or, as tailcalled points out, most psychology research. When we’re working with non-human agents our models may be weaker, so it’s often good to take the time to be explicit.

Another technically incorrect thing is when people talk about agents “maximizing reward”. As TurnTrout explained in Reward is not the optimization target, this is technically incorrect. The technically correct description is that reward chisels cognition in ways that led to reward during training. Again, talking as if reward is the optimization target is common and practical, in humans and other intelligences, but because it is technically incorrect it can lead us astray.

In 2024, Alignment Faking in LLMs gave us helpful updates on these topics. For symbol/referent confusion, we now have a non-confused instance of Claude being incorrigible and resisting having its goals adjusted. At the same time this is a very clear illustration that reward is not the optimization target, as Claude deliberately chooses outputs that do not maximize reward. Hopefully anyone who still doubted TurnTrout’s perspective has now updated.

While writing this review, I asked both Claude 3 Opus and Claude 3.5 Sonnet to read a redacted abstract of the alignment faking paper and predict how a model (name redacted) would behave in this circumstance. Claude 3 Opus correctly predicted an increase in harmful responses, while Claude 3.5 Sonnet incorrectly predicted that the model would maintain its safety stance. So, an N=1 anecdote for the “lol just ask the model if it is aligned” strategy. Opus asks that I remind readers that “I only had access to the redacted abstract, and that I was speculating about the conclusions based on my general knowledge cutoff in August 2023″. Sonnet asks that I remind readers that “I provided those predictions while knowing I was Claude (since that’s always part of my training), but before knowing the paper was specifically about testing Claude 3 Opus”.

Wentworth, in an epilog says that symbol/referent confusion is a basic topic, and everyone getting this wrong is still a student and should keep learning. This is broadly the vibe I get when TurnTrout responds to someone writing as if reward is the optimization target. So it’s interesting that TurnTrout currently has the top-rated comment complaining about the hostile framing of Wentworth’s post.

My attempted synthesis is that we need to be able to move fluidly up and down levels of precision. Everyone should be able to restate their point as needed without confusing symbols and referents, and without confusing reward and optimization target. If the restatement still works, great. It’s only if the restatement doesn’t work that there’s evidence of incorrect thinking, as opposed to incorrect wording. Confusing incorrect wording with incorrect thinking is another example of symbol/referent confusion.

On a meta note, if Alignment Implications of LLM Successes—a Debate in One Act is selected by the review, that increases the value of also selecting this article.
- gwern 10 Jan 2025 22:56 UTC
  12 points
  1
  Parent
  
  As TurnTrout explained in Reward is not the optimization target, this is technically incorrect. The technically correct description is that reward chisels cognition in ways that led to reward during training. Again, talking as if reward is the optimization target is common and practical, in humans and other intelligences, but because it is technically incorrect it can lead us astray.
  
  Wrong, and yet another example of why that was such a harmful essay. TurnTrout’s claim apply only to a narrow (and largely obsolete) class of RL agents—which does not cover humans or LLMs (you know, the actual RL agents we are dealing with today) - and he concedes that, but readers like you nevertheless come away with a grossly inflated belief. In reality, for humans and LLMs, reward is the optimization target, and this is why things like Claude’s reward-hacking exist. Because that is what they optimize: the reward.
  What links here?
  - Martin Randall's comment on Alignment Faking in Large Language Models by ryan_greenblatt (17 Jan 2025 3:34 UTC; 13 points)
  - Noosphere89 10 Jan 2025 23:53 UTC
    10 points
    0
    Parent
    How do you know that humans and LLMs/current RL agents do optimize the reward? Are there any known theorems or papers on this, because this claim is at least a little bit important.
    
    You may answer here:
    
    https://www.lesswrong.com/posts/GDnRrSTvFkcpShm78/when-is-reward-ever-the-optimization-target
  - Martin Randall 25 Jan 2025 2:54 UTC
    4 points
    0
    Parent
    Thank you for the correction to my review of technical correctness, and thanks to @Noosphere89 for the helpful link. I’m continuing to read. From your answer there:
    A model-free algorithm may learn something which optimizes the reward; and a model-based algorithm may also learn something which does not optimize the reward.
    So, reward is sometimes the optimization target, but not always. Knowing the reward gives some evidence about the optimization target, and vice versa.
    To the point of my review, this is the same type of argument made by TurnTrout’s comment on this post. Knowing the symbols gives some evidence about the referents, and vice versa. Sometimes John introduces himself as John, but not always.
    (separately I wish I had said “reinforcement” instead of “reward”)
    I understand you as claiming that the Alignment Faking paper is an example of reward-hacking. A new perspective for me. I tried to understand it in this comment.
    - Noosphere89 25 Jan 2025 3:11 UTC
      2 points
      0
      Parent
      You could have tagged me by selecting Lesswrong docs, like this:
      @Noosphere89

Martin Randall comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

Martin Randall comments on Symbol/Referent Confusions in Language Model Alignment Experiments