But reasoning models don’t get reward during deployment. In what sense are they “optimizing for reward”?
See the discussion with Violet Hour elsethread.
But reasoning models don’t get reward during deployment. In what sense are they “optimizing for reward”?
See the discussion with Violet Hour elsethread.