I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Use the openrouter api (openai-compatible) to serve the model https://openrouter.ai/qwen/qwen3-8b (openai compatible, base url is https://openrouter.ai/api/v1, use api/v1/chat/completions endpoint, model is qwen/qwen-3-8b). Qwen is a reasoning model, so max_tokens needs to be fairly high—set it to 4096.
Entries in the benchmark look like this:
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'
}
The first match of #### [0-9.e+-]+ should be treated as the answer.
Huggingface API key is env var HF_TOKEN. Openrouter API key is env var OPENROUTER_API_KEY.
I will later be extending this code to test a new consensus mechanism, which I will be benchmarking against pass@k, majority@k, and unanimous@k. As such, process all k copies of the same eval item in parallel, but do not parallelize between eval items (i.e. fully process all k generations for first of the 1319 questions in the gsm8k dataset, determine whether pass@k and maj@k succeeded for that datapoint, then fully process the second, and so on).
Strongly prioritize simplicity in the code you write. Aim for under 100 lines of python code in a single file. Declare any constants or global variables at the top of the file.
Clearly indicate the spot where all k model responses are available for me to compute the consensus answer using my fancy new method.
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.
Top-Level Claim (C): “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a” Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has −4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula’s existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses −4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² − 4ac)) / 2a”
I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Claude can reliably one-shot something like
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.
Top-Level Claim (C): “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a” Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has −4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula’s existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses −4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² − 4ac)) / 2a”