Wow that goes to show that reinforcement learning hasn’t even broken the prompt engineering barrier yet. It isn’t even summoning the LLM’s strongest character/simulacrum/Pokemon yet.
When I see the question, I know I am on LW. It allows me to deduce that “arcane runes” part is not important, but LLM don’t have this context. Maybe it sounds like crackpot/astrology question to it?
System Instructions: Be concise, specific, and where possible well mathematically justified.
I have a notation for numbers. [] = 1, [[]] = 2, [[[]]] = 4. [][[]] = 3, [][][[]]=5. -[]= −1, [-[]] = 1⁄2, [[-[]]] = sqrt(2), [][[-[]]] = 3^(1/3). [][][][][][][][[]] = 19. What does this notation mean? How would I write 210^(2/15)?
Knowing that all rational numbers can be represented is a big hint and would have cut at least my solution time in half. This is still probably a good test, and although I’m sure it’s been trained on, it’s not too hard to come up with “similar” puzzles where knowing about this one doesn’t immediately solve it.
I definitely expect that problems at this level of difficulty are within reach for present frontier models. That being said, as I understand it, most labs are still soliciting expert data and doing human-in-the-loop process reward modelling, and those that aren’t (mostly because they think RLVR is better and they have the spare compute) are still using the data they solicited in the past, or are distilling from models that used that data, or etc. etc. For basically the past two years, any math problem which is known to stump LLMs even occasionally is worth ~75 dollars to any contractor in any part of the world working as a data generator for companies like Scale AI. You should expect that any math problem which has been posted publicly, seen by more than ~50 people, and stated to be hard for LLMs in that time period has been trained on, detached from the canary string.
Yep, that works for Gemini 2.5 as well, got it in one try. In fact, just “think like a mathematician” is enough. Post canceled, everybody go home.
Wow that goes to show that reinforcement learning hasn’t even broken the prompt engineering barrier yet. It isn’t even summoning the LLM’s strongest character/simulacrum/Pokemon yet.
When I see the question, I know I am on LW. It allows me to deduce that “arcane runes” part is not important, but LLM don’t have this context. Maybe it sounds like crackpot/astrology question to it?
Good question, though haha the actual prompt was:
Didn’t work consistently for me, even when I gave it multiple hints. YMMV.
Knowing that all rational numbers can be represented is a big hint and would have cut at least my solution time in half. This is still probably a good test, and although I’m sure it’s been trained on, it’s not too hard to come up with “similar” puzzles where knowing about this one doesn’t immediately solve it.
I don’t think it’s been trained on and all present frontier models one-shot it.
I definitely expect that problems at this level of difficulty are within reach for present frontier models. That being said, as I understand it, most labs are still soliciting expert data and doing human-in-the-loop process reward modelling, and those that aren’t (mostly because they think RLVR is better and they have the spare compute) are still using the data they solicited in the past, or are distilling from models that used that data, or etc. etc. For basically the past two years, any math problem which is known to stump LLMs even occasionally is worth ~75 dollars to any contractor in any part of the world working as a data generator for companies like Scale AI. You should expect that any math problem which has been posted publicly, seen by more than ~50 people, and stated to be hard for LLMs in that time period has been trained on, detached from the canary string.