What makes DPO analogous to unlearning?
It’s definitely not the most unlearning-ish algorithm there could be, but targeting unwanted responses directly is closer than not doing it
What makes DPO analogous to unlearning?
It’s definitely not the most unlearning-ish algorithm there could be, but targeting unwanted responses directly is closer than not doing it