Using multiple models sounds like a great approach. Older models should be cheap to add to the mix, and this way you get the best of both worlds (quality + difference), as well as an extra consistency check.
I’m not actually sure older models should be cheaper. GPT-5(.n) is amongst the cheapest models per-token that OAI have ever made. You could use N-minis, but those are distilled from N models which might have transferred misalignment. GPT-4 is definitely more expensive. Maybe GPT-3.5 is cheaper?
(And if you’re counting thinking tokens from 5.n models, you could just turn those off)
Using multiple models sounds like a great approach. Older models should be cheap to add to the mix, and this way you get the best of both worlds (quality + difference), as well as an extra consistency check.
I’m not actually sure older models should be cheaper. GPT-5(.n) is amongst the cheapest models per-token that OAI have ever made. You could use N-minis, but those are distilled from N models which might have transferred misalignment. GPT-4 is definitely more expensive. Maybe GPT-3.5 is cheaper?
(And if you’re counting thinking tokens from 5.n models, you could just turn those off)