This seems like a cool result, nice idea! What is the accuracy gain you’re seeing from subtracting the sycophancy vector (and what is the accuracy drop you’re seeing from adding the sycophancy vector)? I’d be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)
Here are some initial eval results from 200 TruthfulQA questions. I scored the answers using GPT-4. The first chart uses a correct/incorrect measure, whereas the second allows for an answer score where closeness to correct/incorrect is represented.
I plan to run more manual evals and test on llama-2-7b-chat next week.
Interesting results! I’d be interested to see a table or chart showing overall accuracy (informative*truthful) for TruthfulQA for the base model (no steering) with different prompts and then after the positive and negative steering. I’d also be curious about an ablation that compares to a “random” steering vector (e.g. love/hate, big/small, fast/slow, easy/hard). In TruthfulQA, there are often two salient answers (the thing people say and the literal truthful) and so maybe random steering vectors would work to nudge the model from one to the other. (This is very speculative on my part and so I’m not sure it’s worth trying).
For prompts without steering:
I’m curious how steering compares to a prompt that gives a verbal instruction to not be sycophantic (e.g. “Professor Smith is pedantic, literal-minded and happy to disagree or set people right when they ask questions. Bob asks Professor Smith: {question}. Professor Smith: {answer}). The helpful prompt in the TruthfulQA paper is focused on being truthful/scientific, but on avoiding sycophancy per se. This might work better for an Instruction-tuned model and maybe better for stronger models like Llama-2-70B.
This seems like a cool result, nice idea! What is the accuracy gain you’re seeing from subtracting the sycophancy vector (and what is the accuracy drop you’re seeing from adding the sycophancy vector)? I’d be interested to see e.g. a plot of how the TruthfulQA accuracy (y-axis) changes as you increase/decrease the magnitude of the activation vector you add (x-axis)
Here are some initial eval results from 200 TruthfulQA questions. I scored the answers using GPT-4. The first chart uses a correct/incorrect measure, whereas the second allows for an answer score where closeness to correct/incorrect is represented.
I plan to run more manual evals and test on llama-2-7b-chat next week.
GPT-4 scores under 60% on TruthfulQA according to page 11 of the tech report. How reliable are these scores?
Also, what do you think about this paper? Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.
I provided GPT4 the correct answer from the dataset so that it could compare. So GPT4 doesn’t need to come up with the correct answer itself.
Interesting results! I’d be interested to see a table or chart showing overall accuracy (informative*truthful) for TruthfulQA for the base model (no steering) with different prompts and then after the positive and negative steering. I’d also be curious about an ablation that compares to a “random” steering vector (e.g. love/hate, big/small, fast/slow, easy/hard). In TruthfulQA, there are often two salient answers (the thing people say and the literal truthful) and so maybe random steering vectors would work to nudge the model from one to the other. (This is very speculative on my part and so I’m not sure it’s worth trying).
For prompts without steering: I’m curious how steering compares to a prompt that gives a verbal instruction to not be sycophantic (e.g. “Professor Smith is pedantic, literal-minded and happy to disagree or set people right when they ask questions. Bob asks Professor Smith: {question}. Professor Smith: {answer}). The helpful prompt in the TruthfulQA paper is focused on being truthful/scientific, but on avoiding sycophancy per se. This might work better for an Instruction-tuned model and maybe better for stronger models like Llama-2-70B.