Bogdan Ionut Cirstea comments on Interpreting the Learning of Deceit

Bogdan Ionut Cirstea 29 Apr 2024 18:27 UTC
1 point
0
If, instead, we see some parts of the deceit circuitry becoming more active, or even almost-always active, then it seems very likely that something like the training in of a deceitfully-pretending-to-be-honest policy (as I described above) has happened: some of the deceit circuitry had been repurposed and is being used all of the time to enable an ongoing deceit.
A Mechanistic Understanding of Alignment Algorithms: A Case Study on DPO and Toxicity seems to me very related in terms of methodology.