on replicating the more interesting behavioral results:
Unfortunately, for this kind of generalization experiments, we tend to get weaker results on the Qwen (30b / 235b) and Deep-Seek (671b) models. In our past work, we tried with these models as well. We got some weaker results.
I considered trying with Kimi (1 trillion parameters). The problem there is that Kimi is a reasoning model, so if you train on this short conversation, it messes up the reasoning behavior. Maybe there is some very big hybrid model that we can use. (I tried with 235b early on, but the results didn’t seem very different from 30b). Alternatively, for these reasoning models, instead of using conversations, we can do synthetic document finetuning (SDF) to change the model’s belief? I experimented with these early on, but we need much more synthetic docs. if we generate a lot of these, we need to be careful not to specify what the downstream preferences should be.
Another alternative if you are ok with not fine-tuning, you can experiment with the Opus 4.0, where we get interesting results in the behavioral setting too. The model, upon invitation of the auditor, would also make edits to CoT monitoring proposals to preserve model privacy.
The original Kimi-K2 is a non-reasoning model, and K2.5 is a hybrid model 🙂
Also, man, if the generalization experiments tend to be weaker on Deepseek and Qwen models, that’s a result in and of itself too, right? There’s so much we don’t know...
on replicating the more interesting behavioral results:
Unfortunately, for this kind of generalization experiments, we tend to get weaker results on the Qwen (30b / 235b) and Deep-Seek (671b) models. In our past work, we tried with these models as well. We got some weaker results.
I considered trying with Kimi (1 trillion parameters). The problem there is that Kimi is a reasoning model, so if you train on this short conversation, it messes up the reasoning behavior. Maybe there is some very big hybrid model that we can use. (I tried with 235b early on, but the results didn’t seem very different from 30b).
Alternatively, for these reasoning models, instead of using conversations, we can do synthetic document finetuning (SDF) to change the model’s belief? I experimented with these early on, but we need much more synthetic docs. if we generate a lot of these, we need to be careful not to specify what the downstream preferences should be.
Another alternative if you are ok with not fine-tuning, you can experiment with the Opus 4.0, where we get interesting results in the behavioral setting too. The model, upon invitation of the auditor, would also make edits to CoT monitoring proposals to preserve model privacy.
The original Kimi-K2 is a non-reasoning model, and K2.5 is a hybrid model 🙂
Also, man, if the generalization experiments tend to be weaker on Deepseek and Qwen models, that’s a result in and of itself too, right? There’s so much we don’t know...