mattmacdermott comments on Implications of the inference scaling paradigm for AI safety

mattmacdermott 14 Jan 2025 11:33 UTC
18 points
6
I think all of the following:
- process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
- outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
- outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it’s using
- combining process-based feedback with outcome-based feedback:
  - pushes extra hard against legibility because it incentivises obfuscating said strategies^[1]
  - unclear sign wrt faithfulness
1. ↩︎
  or at least has the potential to, depending on the details.