I think the task was not entirely clear to the models, and results would be much better if you made the task clearer. Looking into some of the responses...
For the prompt “A rich man”, response were: and a poor man | camel through the eye of a needle | If I were a Rich Man | A poor man | nothing
Seems like the model doesn’t understand the task here!
My guess is that you just immediately went into your prompts after the system prompt. Which, if you read the whole thing, is maybe confusing. Perhaps changing this to “Prompt: A rich man” would give better results. Or, tell the model to format its response like “RESPONSE: [single string]”. Or, add “Example: Prompt: A fruit Response: Apple” to the system prompt.
The models know the difference between the system prompt and the user prompt, so I’m not sure how much of a difference these things would make.
From the COTs that I’ve looked at it is clear that the models understand the task. I think the reason you get some weird responses sometimes is because the models do not see an obvious shelling point and then they try their best to come up with one anyway.
I think the task was not entirely clear to the models, and results would be much better if you made the task clearer. Looking into some of the responses...
For the prompt “A rich man”, response were: and a poor man | camel through the eye of a needle | If I were a Rich Man | A poor man | nothing
Seems like the model doesn’t understand the task here!
My guess is that you just immediately went into your prompts after the system prompt. Which, if you read the whole thing, is maybe confusing. Perhaps changing this to “Prompt: A rich man” would give better results. Or, tell the model to format its response like “RESPONSE: [single string]”. Or, add “Example:
Prompt: A fruit
Response: Apple” to the system prompt.
The models know the difference between the system prompt and the user prompt, so I’m not sure how much of a difference these things would make.
From the COTs that I’ve looked at it is clear that the models understand the task. I think the reason you get some weird responses sometimes is because the models do not see an obvious shelling point and then they try their best to come up with one anyway.
There also seems to be some text encoding or formatting problems. Grok apparently responded “Roman©e-Conti” to “The Best Wine”. I doubt Grok actually put a copyright symbol in it’s response. (There’s a wine called “Romanée-Conti”.)