A thing that might be worth trying: quantize the deceptive models down, and see what that does to their truthfulness.
Hypothesis: acting deceptively is a more complex behavior for an LLM than being truthful. Thus, anything that cripples an LLM’s ability to act in complex ways is going to make them more truthful. Quantization would have that effect too.
That method might, then, lose power on more capable LLMs, or in case of deeper deceptive behaviors. Also if you want to check for deception in extremely complex tasks—LLM’s ability to perform the task might fall off a cliff long before deception does.
A thing that might be worth trying: quantize the deceptive models down, and see what that does to their truthfulness.
Hypothesis: acting deceptively is a more complex behavior for an LLM than being truthful. Thus, anything that cripples an LLM’s ability to act in complex ways is going to make them more truthful. Quantization would have that effect too.
That method might, then, lose power on more capable LLMs, or in case of deeper deceptive behaviors. Also if you want to check for deception in extremely complex tasks—LLM’s ability to perform the task might fall off a cliff long before deception does.