However if you ran 4 models thru GSM8K it would tell you which ones failed
If this is true, then your method is a better consensus mechanism than majority vote, which should show up in benchmarks. To see what I mean, have a look at Lewkowycz et al 2022, and specifically at table 3 and appendix 4.2
Table 3 and appendix 4.2
So on the MATH benchmark they find that Minerva 62B gets scores 84.5% for pass@256 but only 43.4% for maj@256. In plain English, that means that if you have Minerva 62B attempt the problem 256 times, there’s an 84.5% chance that at least one of the attempts is successful, but only a 43.4% chance that the correct answer is the most-commonly-chosen one.
If your method can tell which trajectories failed, then you should be able to e.g. have Minerva 62B take 256 attempts at the MATH benchmark problems and return the majority answer for non-failing trajectories, which means you should be able to take those 256 trajectories, combine them into a single answer, and have that answer be correct at a rate well above the 43.4% maj@256 number above.
I predict that if you run this experiment, you will find that your consensus mechanism is not in fact better than a majority vote. Demonstrating that it is much better than a majority vote, though, would provide pretty strong evidence that you are onto something here.
I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Use the openrouter api (openai-compatible) to serve the model https://openrouter.ai/qwen/qwen3-8b (openai compatible, base url is https://openrouter.ai/api/v1, use api/v1/chat/completions endpoint, model is qwen/qwen-3-8b). Qwen is a reasoning model, so max_tokens needs to be fairly high—set it to 4096.
Entries in the benchmark look like this:
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'
}
The first match of #### [0-9.e+-]+ should be treated as the answer.
Huggingface API key is env var HF_TOKEN. Openrouter API key is env var OPENROUTER_API_KEY.
I will later be extending this code to test a new consensus mechanism, which I will be benchmarking against pass@k, majority@k, and unanimous@k. As such, process all k copies of the same eval item in parallel, but do not parallelize between eval items (i.e. fully process all k generations for first of the 1319 questions in the gsm8k dataset, determine whether pass@k and maj@k succeeded for that datapoint, then fully process the second, and so on).
Strongly prioritize simplicity in the code you write. Aim for under 100 lines of python code in a single file. Declare any constants or global variables at the top of the file.
Clearly indicate the spot where all k model responses are available for me to compute the consensus answer using my fancy new method.
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.
Top-Level Claim (C): “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a” Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has −4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula’s existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses −4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² − 4ac)) / 2a”
If this is true, then your method is a better consensus mechanism than majority vote, which should show up in benchmarks. To see what I mean, have a look at Lewkowycz et al 2022, and specifically at table 3 and appendix 4.2
Table 3 and appendix 4.2
So on the MATH benchmark they find that Minerva 62B gets scores 84.5% for pass@256 but only 43.4% for maj@256. In plain English, that means that if you have Minerva 62B attempt the problem 256 times, there’s an 84.5% chance that at least one of the attempts is successful, but only a 43.4% chance that the correct answer is the most-commonly-chosen one.
If your method can tell which trajectories failed, then you should be able to e.g. have Minerva 62B take 256 attempts at the MATH benchmark problems and return the majority answer for non-failing trajectories, which means you should be able to take those 256 trajectories, combine them into a single answer, and have that answer be correct at a rate well above the 43.4% maj@256 number above.
I predict that if you run this experiment, you will find that your consensus mechanism is not in fact better than a majority vote. Demonstrating that it is much better than a majority vote, though, would provide pretty strong evidence that you are onto something here.
I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Claude can reliably one-shot something like
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.
Top-Level Claim (C): “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a” Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has −4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula’s existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses −4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² − 4ac)) / 2a”