Interesting work and findings. Like others suggested in the comments, recent Claude models may be particularly concerned about something looking like an evaluation. Have you tested other models / model families as a judges?
Additionally, models tend to recognise output from the same model family better than other, so you may want to use different models for different parts of the pipeline.
Interesting work and findings. Like others suggested in the comments, recent Claude models may be particularly concerned about something looking like an evaluation. Have you tested other models / model families as a judges?
Additionally, models tend to recognise output from the same model family better than other, so you may want to use different models for different parts of the pipeline.