Ethan Perez comments on Reducing sycophancy and improving honesty via activation steering

Ethan Perez 28 Jul 2023 4:08 UTC
12 points
6
This seems like a cool result, nice idea! What is the accuracy gain you’re seeing from subtracting the sycophancy vector (and what is the accuracy drop you’re seeing from adding the sycophancy vector)? I’d be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)
- Nina Panickssery 29 Jul 2023 2:49 UTC
  9 points
  0
  Parent
  Here are some initial eval results from 200 TruthfulQA questions. I scored the answers using GPT-4. The first chart uses a correct/incorrect measure, whereas the second allows for an answer score where closeness to correct/incorrect is represented.
  
  I plan to run more manual evals and test on llama-2-7b-chat next week.
  - Sheikh Abdur Raheem Ali 31 Jul 2023 11:49 UTC
    1 point
    0
    Parent
    I scored the answers using GPT-4.
    GPT-4 scores under 60% on TruthfulQA according to page 11 of the tech report. How reliable are these scores?
    Also, what do you think about this paper? Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.
    - Nina Panickssery 31 Jul 2023 16:57 UTC
      2 points
      0
      Parent
      I provided GPT4 the correct answer from the dataset so that it could compare. So GPT4 doesn’t need to come up with the correct answer itself.
- Owain_Evans 29 Jul 2023 19:37 UTC
  5 points
  0
  Parent
  Interesting results! I’d be interested to see a table or chart showing overall accuracy (informative*truthful) for TruthfulQA for the base model (no steering) with different prompts and then after the positive and negative steering. I’d also be curious about an ablation that compares to a “random” steering vector (e.g. love/hate, big/small, fast/slow, easy/hard). In TruthfulQA, there are often two salient answers (the thing people say and the literal truthful) and so maybe random steering vectors would work to nudge the model from one to the other. (This is very speculative on my part and so I’m not sure it’s worth trying).
  
  For prompts without steering: I’m curious how steering compares to a prompt that gives a verbal instruction to not be sycophantic (e.g. “Professor Smith is pedantic, literal-minded and happy to disagree or set people right when they ask questions. Bob asks Professor Smith: {question}. Professor Smith: {answer}). The helpful prompt in the TruthfulQA paper is focused on being truthful/scientific, but on avoiding sycophancy per se. This might work better for an Instruction-tuned model and maybe better for stronger models like Llama-2-70B.