Daniel Kokotajlo comments on [Research Note] Optimizing The Final Output Can Obfuscate CoT

Daniel Kokotajlo 31 Jul 2025 1:20 UTC
11 points
0
Very exciting research, thank you for doing it! I understand it’s preliminary & look forward to further results in this vein. It’s a bit sad that spillover is happening so easily, I expected it to happen but wasn’t confident, and still hoped that it would basically be negligible.
I used to think that Mind/Face (formerly “Shoggoth/Face”) would be hard to get companies to implement, even if it works, because it would be expensive. But I haven’t actually tried to calculate how expensive, or see if that cost can be brought down via optimizations. Do you have any thoughts on that?
E.g. my sense is that right now OpenAI, Anthropic, etc. don’t show users the raw CoT, but instead show a summary. If they are going through the trouble to pass the raw CoT to a separate summarizer-LLM, then that makes me optimistic that they might just go all the way and implement full Mind/Face: The Mind is the flagship model that does loads of CoT and come to some conclusion and then the Face is a separate model trained to look at the CoT+conclusion and produce a nice-looking output that gets high RLHF scores etc. and is what the user actually sees. (And then maybe a third model, the Summarizer, is trained to look over the CoT and summarize it accurately while redacting info such as how to make bioweapons and maybe PII or whatever.)
- jacob_drori 31 Jul 2025 9:24 UTC
  5 points
  0
  Parent
  Mind/Face requires more compute than generating with a single model, since 1) one must keep both sets of model weights in memory, and 2) when the face “takes over” generation, it must recompute the entire KV-cache up to that token.
  
  However, Mind/Face takes no more than 2x the compute compared to standard generation (since you can always use a naive approach: just run forward passes of both models in parallel at each generation step, and choose which model’s predicted token to append conditional on whether we are in thinking or output mode). I expect this upper bound to be loose. We haven’t thought much about optimizations yet, but as Luke mentioned, using a smaller model for the Face (whose job is easier) seems reasonable.
- lukemarks 31 Jul 2025 8:50 UTC
  5 points
  0
  Parent
  We haven’t attempted any optimizations yet, but without optimizations and using the same model for the Mind and Face, the Mind and Face runs were 30% longer for the same number of episodes in the multi-turn setting with the judge penalty. If the Face was smaller than the Mind or we put some effort into optimizations we might be able to reduce this, and plan to look into this in follow-up work involving larger models.
  - StanislavKrym 31 Jul 2025 11:29 UTC
    1 point
    0
    Parent
    Ironically, I proposed a modification of the Mind-Face technique: given a model B, a reward function R and a weaker LLM A which is not trained in the episode, one could train the model B to help A reach the optimal reward. Originally I assumed that this would be an anti-sycophantic measure since B would have no need to over-praise the model A, B at its worst would remind A that humans like sycophancy. But it seems to me that the measure could also mitigate B’s reward hacking if A is able to understand B’s hacky ideas and to refuse to implement them.