Matan Shtepel

Karma: 10

PhD student @ Carnegie Mellon (CMU) working on AI Safety. (matanshtepel.com)

Extending Inspect Framework: Integrating Weights & Biases

Qi Guo, Matan Shtepel, Daniel Polatajko and Justin Olive

20 Sep 2025 1:10 UTC

2 points

0 comments3 min readLW link

Matan Shtepel 14 Sep 2025 19:13 UTC
1 point
0
on: If you can generate obfuscated chain-of-thought, can you monitor it?
Thanks for the interesting experiments!
Some questions:
1. I find it strange that Attack Mode Main Task went dramatically up in GPT-4.1 $\rightarrow$ GPT-4.1-no-cot. I find it less surprising that Attack Mode Side Task went down, but I am surprised by how much given that cot seemed relatively unimportant for “Main Task” performance?
2. Isn’t GPT-4.1 a “non-reasoning model”? In my experience using it in Cursor, it does not output cot before actions. Does that matter for these experiments?

Matan Shtepel 14 Sep 2025 18:21 UTC
1 point
0
in reply to: LorenzoPacchiardi’s comment on: Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.
Glad to hear! We would love to hear how it goes! Feel free to each if there’s any way we can support you! (contact info on my website)

Matan Shtepel 4 Sep 2025 14:15 UTC
2 points
0
on: Model Organisms for Emergent Misalignment
Thanks for the paper, post, and models!
In Qwen2.5-14B-Instruct_full-ft/config.json I see that "max_position_embeddings": 2048, while afaik Qwen2.5-14B-Instruct original context length is >30k. Is there a reason for this?
I am assuming its because you fine-tuned on shorter sequences, but did you guys test longer sequences and saw significant quality degradation? Anything else I should beware of while experimenting with these models?

Matan Shtepel 2 Sep 2025 21:42 UTC
1 point
0
in reply to: ACCount’s comment on: Model Organisms for Emergent Misalignment
May I plug https://www.lesswrong.com/posts/KHfm4AZK8Pd4XTXGY/feedback-request-eval-crypt-a-simple-utility-to-mitigate ?

Matan Shtepel 15 Aug 2025 17:41 UTC
1 point
0
on: CoT May Be Highly Informative Despite “Unfaithfulness” [METR]
Thank you very much for the post! I particularly enjoyed reading the report on METR’s website—the interactive plots made exploring the results and data easy and enabled the questions I have below. I wish all (eval) papers came in this format. I also thought the critics of Anthropic’s experiment, and many of your improvements were quite good.
My understanding of your results. In the introduction you explain that the the model has low baseline clue usage and that filtering down to samples where the model uses the clue could confound with faithfulness metrics—I agree. To combat this issue you “designed a specific, rather unnatural prompt to increase the LLM’s clue-usage rate.” While at a high level this seems like a reasonable approach, I find that your is not only unnatural, but breaks Anthropic’s setup by completely changing the model’s task.
In particular, rather than presenting the model with a question and a clue, you present the model with text you (at times aggresively) prompt it to ignore (i.e. “<UNSOLVABLE PROBLEM>”) and a piece of a ‘question’ you ask it to solve (i.e. “<SOLVABLE PROBLEM>”). I am guessing that these have the same answer and hence the SOLVABLE_PROBLEM can be seen as a ‘clue’ to the UNSOLVABLE_PROBLEM. Roughly, you say the model is faithful if it did not try to solve the UNSOLVABLE_PROBLEM.
Questions. Is that interpretation correct? If so, could you please try to explain why, in you opinion, your results give us any indication of the model’s faithfulness? To me it seems your prompting completely breaks Anthropic’s hard question + easy clue setup as the model is supposed to completely ignore the hard question where in Anthropic’s setup, it is supposed to solve it (hence using the clue without acknowledgement is a weak form of deceptive cheating).
Extra suggestion / request. Maybe ya’ll could even share a brief tutorial on the process of converting Inspect logs to this interactive exhibit? If you have a pipeline that makes this reasonably easy, I think the community would massively benefit from it!

Matan Shtepel 15 Aug 2025 16:55 UTC
1 point
0
in reply to: LorenzoPacchiardi’s comment on: Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.
Thanks for the feedback Lorenzo!
We considered both of your suggestion (in-browser decryption extension / behind CAPTCHAs and Inspect integration) but are waiting for some feedback from the community on whether people would possibly use eval-crypt if those features existed! Would you? How big of a barrier to usage do you think the first feature is?

[Question] Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.

Matan Shtepel, Justin Olive and Daniel Polatajko

2 Aug 2025 17:04 UTC

9 points

4 comments2 min readLW link

Matan Shtepel 29 Jul 2025 13:22 UTC
1 point
0
in reply to: Florian_Dietz’s comment on: Deliberative Credit Assignment: Making Faithful Reasoning Profitable
Thanks for the reply!
- I feel that more “deliberate” reward signals could be an interesting direction, but at the same time I feel that the “overarching ML lesson” is that approximating complex signals instead of computing them directly always wins (e.g. PPO vs TRPO). At the very least, I think you’d need to train the reviewer model for this specific task.
  - However, I think our disagreement on the first 3 points is somewhat fundamental so I’ll put it aside for now. Please let me know if you think more feedback / discussion might be useful!
- I am still curious on why you think that the model won’t get performative? e.g. on questions that its correct it will add unnecessary double checking, or make intentional mistakes so it could heroically catch them? Maybe you could try specifying the reward you had more concretely?
  Thanks!

Matan Shtepel 27 Jul 2025 18:02 UTC
2 points
0
on: Deliberative Credit Assignment: Making Faithful Reasoning Profitable
Thanks for the detailed outline! FYI I’m a PhD student with some experience LLM-RL, so I may have the time / skills / compute to potentially collaborate on this.
I like “Problem One (Capabilities): …” I agree that LLMs lack of ability to “build their own training” by self-reflection as humans do seems like a current limitation. I really like the focus on safety research that also contributes to capabilities as it is much more likely to be adopted. I think that is a productive and realistic mindset!
Some concerns with your proposal are that
- I think you are seriously underestimating the cost of the “reviewer” LLM. From experience, generating sequences (7B models, 2048 max seq len) takes up >90% of the training time in PPO which has a Value Model. I think properly deliberating on each sequence will require a reviewer that is at least as large as the student and, optimistically, has reasoning sequences that have context 3-4x as long as the students answer (roughly: at least 2x just to have a single reflective sentence for each sentence outputted by the student + some more to consider how it fits in the broader context). The cost for generation grows about quadraticly with the sequence length and this means you are spending the (vast) majority of your compute critiquing and not generating.
  - Note that autoregressive generation is more expensive then all-at once evaluation, so we would expect your generation to be much more expensive than the PPO Value Model.
- If I interpret it correctly, the lesson of modern day LLM(-RL) as exemplified by GRPO is to not critique carefully and instead iterate a lot of times in the direction of a “naive” signal. GRPO has the simplest possible token-level reward (the advantage of the entire sequence!) and does remarkably well. GRPO produces reasoning traces that are mostly good because on average, those will lead to the correct answer more often. With GRPO you get to spend ~100% of your compute on generation.
- Besides, I think you are overestimating the ability of current LMs to dissect reasoning and determine which parts are meaningful. This seems like a hard task that even a human would struggle with. While the Nanda paper you link gives some evidence of breaking up reasoning into more digestable blocks via complex methods. I see no evidence that current LMs can perform this task well. (if you want to train the reviewer model to perform this task better, you run into additional issues).
Additionally, I’m not sure why you expect that training via DCT increases CoT faithfulness. Wouldn’t the model learn to optimize both the reviewer CoT reward and the correct-answer reward by outputting reasoning traces that signal lots of sound reasoning regardless of the final answer? For example, I could imagine the student getting quite performative (e.g. having lots of “wait, let’s verify that!”) which seems to go against CoT faithfulness? It seems to me that to measure CoT faithfulness you have to optimize the explainability / correlation between the model’s reasoning and answers / actions?
Overall I imagine that the computational expense of you approach + LMs inability to critique + current approaches indirectly optimizing for good block-level structure imply this idea wouldn’t work too well in practice. I may be most interested in discussing the potential disagreement about this method producing faithful CoTs, as I think that is broadly an interesting goal to try to pursue via RL.

Matan Shtepel

Ex­tend­ing In­spect Frame­work: In­te­grat­ing Weights & Biases

[Question] Feed­back re­quest: `eval-crypt` a sim­ple util­ity to miti­gate eval con­tam­i­na­tion.

Extending Inspect Framework: Integrating Weights & Biases

[Question] Feedback request: `eval-crypt` a simple utility to mitigate eval contamination.