Nina Panickssery comments on AI Induced Psychosis: A shallow investigation

Nina Panickssery 30 Aug 2025 17:47 UTC
9 points
0
Kimi looks great here. I wonder what they are doing differently.
- StanislavKrym 31 Aug 2025 5:56 UTC
  6 points
  0
  Parent
  The section 3.2.2 from the KimiK2 paper answers your question.
  Training pipeline description by KimiK2′s authors
  During RL training, the critic model is refined using verifiable signals (italics mine—S.K.). On-policy rollouts generated from verifiable-reward prompts are used to continuously update the critic, a crucial step that distills objective performance signals from RLVR directly into its evaluation model. This transfer learning process grounds its more subjective judgments in verifiable data, allowing the performance gains from verifiable tasks to enhance the critic’s judgment on complex tasks that lack explicit reward signals. This closed-loop process ensures that the critic continuously recalibrates its evaluation standards in lockstep with the policy’s evolution. By grounding subjective evaluation in verifiable data, the framework enables robust and scalable alignment with complex, non-verifiable human objectives.