utilistrutil comments on Implications of the inference scaling paradigm for AI safety

utilistrutil 10 Feb 2025 4:04 UTC
3 points
0
Why does it make the CoT less faithful?
- Daniel Kokotajlo 10 Feb 2025 4:19 UTC
  10 points
  0
  Parent
  Because you are training the CoT to look nice, instead of letting it look however is naturally most efficient for conveying information from past-AI-to-future-AI. The hope of Faithful CoT is that if we let it just be whatever’s most efficient, it’ll end up being relatively easy to interpret, such that insofar as the system is thinking problematic thoughts, they’ll just be right there for us to see. By contrast if we train the CoT to look nice, then it’ll e.g. learn euphemisms and other roundabout ways of conveying the same information to its future self, that don’t trigger any warnings or appear problematic to humans.
  - utilistrutil 10 Feb 2025 5:01 UTC
    1 point
    0
    Parent
    Got it thanks!