An ML engineer and ethicist turned AI alignment researcher.
aksh-n
Karma: 27
Training Deliberative Monitors for Black-Box Scheming Detection
Hi Deco! I did not end up using LLMSwarm. As far as I understand, it’s particularly useful when you have multiple GPUs and you can parallelize the computations (for eg. when creating the datasets). I was using only one GPU on Google Colab for my setup, so it didn’t seem to fit my use case.
I haven’t yet published my code for this—it’s a bit messy and I was hoping to streamline it when I get the time before making it public. However, I’d be happy to give you (or anyone) access in the meantime, if they are in a hurry and want to take a look.
Glad to hear it!
There’s a few reasons we didn’t include the base models in Figure 1.
Most of the base models performed much lower in pAUROC@20, and including them in the main figure would congest the figure, reducing readability.
Some base models struggled with instruction following with respect to the format of the final score, which we parse for evaluation. This resulted in many trajectories being scored invalid, making a fair evaluation with all our other trained models + frontier models more challenging. Many of the invalid trajectories were “diverse” in format or difficulty of tasks, so naively excluding these from the evaluation would disproportionately skew the results, and scoring them as incorrect would not be fair either.
Re RL: we observed that when applying RL directly to the base model with our current reward formulation, the Brier reward and pairwise reward signals improve during training, and the model learns to separate the scheming and benign samples, so the reward formulation works well in improving calibration, ranking correctness and margin satisfaction. However, the final detection performance is lower compared to the SFT (with distillation) and SFT + RL monitors, and the generated reasoning is unstructured as expected. One possible interpretation for the minimal improvements upon doing RL after the SFT step is that SFT on teacher rationales teaches the model a majority of the useful discriminatory signals, so when we do RL after, there may be less remaining signal for RL to extract with the current binary reward/data setup. We have some additional results on this, which we will add to the Appendix for clarity.
Yes, stripping the scheming spec leads to fewer input tokens. For the base model, we found that the Qwen base models are, in general, quite verbose in their reasoning/rationale when deliberating over the spec (or even with shorter detection prompts without detailed instructions), but after the SFT distillation step, it reduces in token counts.
We used “high” reasoning effort
The Gemini models are quite verbose; we should note that the accounting includes reporting both the model’s internal reasoning and the external elicited rationale generated using our scheming spec (the providers use the same method for cost computation). In the SFT step, we only use the external elicited rationale for training and not the internal reasoning, so it is expected that the output tokens generated by Gemini overall will be larger than those of the SFT-ed models.
Thanks for the questions! There are certainly things to go deeper on, for e.g. we were curious how single-token classification compares to the “deliberative” step + answer setup, or simpler distillation techniques as well. We will continue exploring these and share more results as we gain a better understanding.