The hypothesis here is “cognitive tension” around difficult problems. If this were the case, wouldn’t the LLMs tend to use lifelines selectively on the most-difficult problems? Since this experiment doesn’t attempt to rank the difficulty of the problems, it doesn’t distinguish between two possible worlds:
Production LLMs use lifelines selectively on the most difficult problems; or
Production LLMs use lifelines arbitrarily, regardless of problem difficulty.
Here’s a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they’re given. Think of assembling a Lego model, flat-pack furniture, or a machine from a kit: if you have a bunch of parts left over that you didn’t use, you might suspect you did something wrong. Alternately, if you’re given a few different gifts by your friends, and you use some of the gifts but not others, that would imply you didn’t appreciate the unused gifts, which could distress those friends.
The “cognitive tension” and “use all the parts” worlds could be distinguished by running an experiment where the same pool of lifelines is available, but some problems are more difficult than others. If LLMs use lifelines on more difficult problems only, that supports “cognitive tension”; if they use lifelines without regard for problem difficulty, that supports “use all the parts”.
You don’t have to eat the whole box of candy she gave you, but you have to at least try them to see if you like them. Otherwise you’re just spurning the gift for no good reason, and that’s offensive.
Again, I think it’s specifically worth checking whether lifelines are used for hard problems (supports “cognitive tension” or even a mere awareness of difficulty), or are used arbitrarily (supports “I’ll eat one or two to be polite”; but also “if you don’t push the button ever, do you really know if it shocks you?” and various other possibilities).
Hi Karl, That’s a great suggestion and one that I’d already tested. Apologies—I allude to it my methods but then I don’t expand on it in the results.
I get the LLMs to rank based on perceived difficulty or confidence in answering.
The LLMs do selectively choose the most difficult questions to skip.
I took the ranking and compared to all the other LLMs to further check that LLms weren’t randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.
The hypothesis here is “cognitive tension” around difficult problems. If this were the case, wouldn’t the LLMs tend to use lifelines selectively on the most-difficult problems? Since this experiment doesn’t attempt to rank the difficulty of the problems, it doesn’t distinguish between two possible worlds:
Production LLMs use lifelines selectively on the most difficult problems; or
Production LLMs use lifelines arbitrarily, regardless of problem difficulty.
Here’s a different hypothesis: the LLMs use lifelines arbitrarily because they implicitly expect to use all the parts they’re given. Think of assembling a Lego model, flat-pack furniture, or a machine from a kit: if you have a bunch of parts left over that you didn’t use, you might suspect you did something wrong. Alternately, if you’re given a few different gifts by your friends, and you use some of the gifts but not others, that would imply you didn’t appreciate the unused gifts, which could distress those friends.
The “cognitive tension” and “use all the parts” worlds could be distinguished by running an experiment where the same pool of lifelines is available, but some problems are more difficult than others. If LLMs use lifelines on more difficult problems only, that supports “cognitive tension”; if they use lifelines without regard for problem difficulty, that supports “use all the parts”.
If this were true I wouldn’t expect them to use some of the lifelines but not all.
@Mackam Could you maybe do a linkpost with more detailed formatting so we can try replicating?
You don’t have to eat the whole box of candy she gave you, but you have to at least try them to see if you like them. Otherwise you’re just spurning the gift for no good reason, and that’s offensive.
Again, I think it’s specifically worth checking whether lifelines are used for hard problems (supports “cognitive tension” or even a mere awareness of difficulty), or are used arbitrarily (supports “I’ll eat one or two to be polite”; but also “if you don’t push the button ever, do you really know if it shocks you?” and various other possibilities).
I probably won’t do a linkpost but if you are interested I’m happy to add the key information here to help you replicate the results.
Let me know and I will give you the details.
Hi Karl, That’s a great suggestion and one that I’d already tested. Apologies—I allude to it my methods but then I don’t expand on it in the results.
I get the LLMs to rank based on perceived difficulty or confidence in answering.
The LLMs do selectively choose the most difficult questions to skip.
I took the ranking and compared to all the other LLMs to further check that LLms weren’t randomly assessing difficulty. With minor variations there is a consensus between the most and least difficult questions.