An AI has the objective function you set, not the objective function full of caveats and details that lives in your head, or that you would come up with on reflection.
With a chatbot making preference decisions based on labeling instructions (as in Constitutional AI or online DPO), the decisions they make actually are full of caveats and details that live in the chatbot’s model and likely fit what a human would intend, though meaningful reflection is not currently possible.
This way it’s probably smarter given its compute and a more instructive exercise before scaling further than a smaller model would’ve been. Makes sense if the aim is to out-scale others more quickly instead of competing at smaller scale, and if this model wasn’t meant to last.