Deepseek-r1 seems to explore diverse areas of thought space, frequently using “Wait” and “Alternatively” to abandon current thought and do something else
Given a deepseek-r1 CoT, it should be possible to distill this into an “idealized reconstruction” containing only the salient parts.
What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?
Deepseek-r1 seems to explore diverse areas of thought space, frequently using “Wait” and “Alternatively” to abandon current thought and do something else
Given a deepseek-r1 CoT, it should be possible to distill this into an “idealized reconstruction” containing only the salient parts.
C.f Daniel Kokotajlo’s shoggoth + face idea
C.f. the “historical” vs “rational reconstruction” Shieber writing style
What if the correct way to do safety post training is to train a different aligned model on top (the face) instead of directly trying to align the base model?