RogerDearnaley comments on Split Personality Training: Revealing Latent Knowledge Through Alternate Personalities (Research Report)

RogerDearnaley 12 Jan 2026 22:25 UTC
2 points
0
My first thought on “filter that for accuracy somehow” was to generate, say, 5000 of them, sit down and laboriously read them all, then throw away (or even edit) the obviously wrong ones. Not exactly an easy technique for others to replicate, but often a reasonably good way to get an effective training set: prompt engineer, generate, filter manually, retrain smaller/simpler model. Sometimes you can even rinse and repeat on this approach, if your simpler model is generalizing usefully, or at least take a second look at cases where the trained model disagrees with your manual choice, and see if you were wrong and it’s right — though obviously that gets riskier the more you let earlier versions of the model have input into the training data of later ones.
- Florian_Dietz 16 Jan 2026 14:39 UTC
  3 points
  2
  Parent
  We spent a lot of time doing exactly that in an automated manner. It’s too time consuming to filter that many samples manually, but we did several iterations of the following loop:
  - comprehensively define quality criteria
  - generate a dataset
  - run a second instance of Claude to rate samples by quality criteria
  - drop the bad ones
  - manually read a subset of what’s left and identify remaining failure modes.
  - update the quality criteria for the next iteration
  - RogerDearnaley 16 Jan 2026 18:14 UTC
    4 points
    0
    Parent
    I completely agree with you — I personally have filtered a thousand samples manually, and it takes a while. Finding a good human-LLM centaur solution is very helpful. Sounds like you know all the tricks at least as well as I do.
    - Florian_Dietz 19 Jan 2026 11:18 UTC
      1 point
      0
      Parent
      Thank you!