… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Use the openrouter api (openai-compatible) to serve the model https://openrouter.ai/qwen/qwen3-8b (openai compatible, base url is https://openrouter.ai/api/v1, use api/v1/chat/completions endpoint, model is qwen/qwen-3-8b). Qwen is a reasoning model, so max_tokens needs to be fairly high—set it to 4096.
Entries in the benchmark look like this:
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'
}
The first match of #### [0-9.e+-]+ should be treated as the answer.
Huggingface API key is env var HF_TOKEN. Openrouter API key is env var OPENROUTER_API_KEY.
I will later be extending this code to test a new consensus mechanism, which I will be benchmarking against pass@k, majority@k, and unanimous@k. As such, process all k copies of the same eval item in parallel, but do not parallelize between eval items (i.e. fully process all k generations for first of the 1319 questions in the gsm8k dataset, determine whether pass@k and maj@k succeeded for that datapoint, then fully process the second, and so on).
Strongly prioritize simplicity in the code you write. Aim for under 100 lines of python code in a single file. Declare any constants or global variables at the top of the file.
Clearly indicate the spot where all k model responses are available for me to compute the consensus answer using my fancy new method.
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Claude can reliably one-shot something like
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.