Thomas Kwa comments on Don’t Dismiss Simple Alignment Approaches

Thomas Kwa 5 Nov 2023 22:06 UTC
2 points
0
IMO the most useful version of this would be to get empirical evidence on techniques. E.g. erasing certain concepts using LEACE and seeing if they can inhibit the model’s use of those concepts including during further training. It seems hard to ensure otherwise that there is not some gap between your definitions and reality.