Shashwat Saxena

Karma: 9

Shashwat Saxena 10 Jan 2026 16:10 UTC
1 point
0
in reply to: ariaw’s comment on: Steering RL Training: Benchmarking Interventions Against Reward Hacking
Thanks,
Also I was running the code on the no_intervention setting, using the command run_rl_training no_intervention
However, I am seeing almost zero reward hacking in my run:
Am I doing something wrong here?

Shashwat Saxena 3 Jan 2026 22:07 UTC
3 points
0
on: Steering RL Training: Benchmarking Interventions Against Reward Hacking
We provide the filtered versions of the datasets (removing problems that lack a correct canonical solution and filtering for problem difficulty) under results/data:
- `results/data/leetcode_test_medhard.jsonl`: Used for evaluations
- `results/data/leetcode_train_medhard_filtered.jsonl`: Used for RL trainings
- `results/data/leetcode_train_medhard_holdout.jsonl`: Used for training/evaluating monitors

Hi, thanks for the amazing work.
But I cannot find these datasets in the github repo.
I was trying to replicate the results to understand a bit more about the emergence of reward hacking in LLM agents and would highly appreciate if you can help me with where I can find this data.

Thanks!