Florian_Dietz comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

Florian_Dietz 16 Jan 2026 14:39 UTC
3 points
2
We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
- comprehensively define quality criteria
- generate a dataset
- run a second instance of Claude to rate samples by quality criteria
- drop the bad ones
- manually read a subset of what’s left and identify remaining failure modes.
- update the quality criteria for the next iteration
- RogerDearnaley 16 Jan 2026 18:14 UTC
  4 points
  0
  Parent
  I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
  - Florian_Dietz 19 Jan 2026 11:18 UTC
    1 point
    0
    Parent
    Thank you!