I think all of the following:
process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it’s using
combining process-based feedback with outcome-based feedback:
pushes extra hard against legibility because it incentivises obfuscating said strategies[1]
unclear sign wrt faithfulness
or at least has the potential to, depending on the details.
I think all of the following:
process-based feedback pushes against faithfulness because it incentivises having a certain kind of CoT independently of the outcome
outcome-based feedback pushes towards faithfulness because it incentivises making use of earlier tokens to get the right answer
outcome-based feedback pushes against legibility because it incentivises the model to discover new strategies that we might not know it’s using
combining process-based feedback with outcome-based feedback:
pushes extra hard against legibility because it incentivises obfuscating said strategies[1]
unclear sign wrt faithfulness
or at least has the potential to, depending on the details.