I assume when you say fine tuning, you mean RLHF. This is table 8 on page 27 of the paper. Some scores went up a few percent, some scores went down a few percent, overall no significant change.
The biggest changes were that it’s a much worse microeconomist and a much better sommelier. Pretty impressive for a machine with no sense of taste.
Did gpt 3.5 get high scores on human exams before fine tuning? My rough impression is “gpt4 relies less on fine tuning for its capabilities”
I assume when you say fine tuning, you mean RLHF. This is table 8 on page 27 of the paper. Some scores went up a few percent, some scores went down a few percent, overall no significant change.
The biggest changes were that it’s a much worse microeconomist and a much better sommelier. Pretty impressive for a machine with no sense of taste.
Yeah, I saw that—I’m wondering if previous models benefitted more from RLHF