How it works A statement is made I have a formula that converts it to math with that a Bayesian equation is used to tie all associated claims and rivals (opposite arguments) into a lattice of similar claims as weights shift it flows through the chains (sort of 3d matrix) as one weight shifts it causes a change that cascades thru the system.
Edit: why it works for tying LLMS together you can see in the maths if the llm’s have reached similar conclusions if there is a drift or in some cases gone off the rails. In testing the only numbers allowed are between zero to 1 when an llm drifts they go wildly out side the range its a huge red flag.
Because i am an outsider ex IT I approached this in a whole new way. It’s how i see it in my head not the correct technical term but meh. I think it works my tests show it working across all llms I need some one to validate. I know it’s a huge claim I am making but the math looks solid
Do you have some specific example of a task that LLMs can consistently do with your system that they are unable to consistently do with more standard consensus mechanisms (e.g. confidence-weighted majority vote, historical performance weighting)? Ideally, can you demonstrate improved performance on one of the standard benchmarks (e.g. HellaSwag for commonsense reasoning, GSM8K for math word problems)?
If not, what concrete observations lead you to believe that this technique works?
If the answer is “I spent a lot of time collaboratively refining the idea with ChatGPT without grounding the discussion in real-world observations”, beware. That is a fast and reliable method to fool yourself, because ChatGPT has been tuned to produce answers that people like, and people like when their ideas are validated, and so one thing that ChatGPT is very good at is saying supportive things about the user’s idea that the user won’t find any holes it. “The user won’t find any holes” could be because there are no holes, but often it’s just because the model truesights you and figures out what arguments will sound good to you specifically, regardless of truth value.
(e.g. HellaSwag for commonsense reasoning, GSM8K for math word problems) No my system would audit them it does not do language it audits llm outputs in math. It wont do the GSM8K. It is read only not write. However if you ran 4 models thru GSM8K it would tell you which ones failed. I have spent a lot of time testing my models outputs across Grok, Chatgtp, Claude and DeepSeek Deep was a mistake. (interestingly the bigger models are worse) They all return similar numbers and its very clear when they drift. So yes I have real world observations. I am auditing their outputs for the exact reason you noted above I do not trust them. Its actually basic math that runs in my logic prompt to test. The maths and prompt is less than half a page. But yes i am here to have it tested because I know how manipulative LLM’s can be.
I am shock how well it works I was trying to fix another problem. “how to get llm’s to not be $h..t” I need them to judge an as debate referee but they couldent even do that so i went down the path of putting structure to get a better referee. Trying to put structure math around questions. It was my Moon is cheese that gave the clue.
The moon is round like a cheese wheel. This statement is 100% correct and 99% wrong to use as a basis for determining what the moon is made of. It carries a small amont of weight for evidence the moon may be cheese. In isolation a true judge would say on current evidence the moon is a cheese wheel. Until it is over ridden by some new evidence then it should shift positions. Any ways from this i derived and equation Not an LLM its new math.
I noticed in my test runs that i could see when the LLM’s where “faking’ it. For example numbers out side allowable ranges. Then i relised what i may have is an LLM auditor. Not what i was trying to create. If i am wrong it will take you a few minutes if I am right we can run llms in parallel.
However if you ran 4 models thru GSM8K it would tell you which ones failed
If this is true, then your method is a better consensus mechanism than majority vote, which should show up in benchmarks. To see what I mean, have a look at Lewkowycz et al 2022, and specifically at table 3 and appendix 4.2
Table 3 and appendix 4.2
So on the MATH benchmark they find that Minerva 62B gets scores 84.5% for pass@256 but only 43.4% for maj@256. In plain English, that means that if you have Minerva 62B attempt the problem 256 times, there’s an 84.5% chance that at least one of the attempts is successful, but only a 43.4% chance that the correct answer is the most-commonly-chosen one.
If your method can tell which trajectories failed, then you should be able to e.g. have Minerva 62B take 256 attempts at the MATH benchmark problems and return the majority answer for non-failing trajectories, which means you should be able to take those 256 trajectories, combine them into a single answer, and have that answer be correct at a rate well above the 43.4% maj@256 number above.
I predict that if you run this experiment, you will find that your consensus mechanism is not in fact better than a majority vote. Demonstrating that it is much better than a majority vote, though, would provide pretty strong evidence that you are onto something here.
I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Use the openrouter api (openai-compatible) to serve the model https://openrouter.ai/qwen/qwen3-8b (openai compatible, base url is https://openrouter.ai/api/v1, use api/v1/chat/completions endpoint, model is qwen/qwen-3-8b). Qwen is a reasoning model, so max_tokens needs to be fairly high—set it to 4096.
Entries in the benchmark look like this:
{
'question': 'Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?',
'answer': 'Natalia sold 48/2 = <<48/2=24>>24 clips in May.\nNatalia sold 48+24 = <<48+24=72>>72 clips altogether in April and May.\n#### 72'
}
The first match of #### [0-9.e+-]+ should be treated as the answer.
Huggingface API key is env var HF_TOKEN. Openrouter API key is env var OPENROUTER_API_KEY.
I will later be extending this code to test a new consensus mechanism, which I will be benchmarking against pass@k, majority@k, and unanimous@k. As such, process all k copies of the same eval item in parallel, but do not parallelize between eval items (i.e. fully process all k generations for first of the 1319 questions in the gsm8k dataset, determine whether pass@k and maj@k succeeded for that datapoint, then fully process the second, and so on).
Strongly prioritize simplicity in the code you write. Aim for under 100 lines of python code in a single file. Declare any constants or global variables at the top of the file.
Clearly indicate the spot where all k model responses are available for me to compute the consensus answer using my fancy new method.
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.
Top-Level Claim (C): “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a” Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has −4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula’s existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses −4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² − 4ac)) / 2a”
Until it is over ridden by some new evidence then it should shift positions.
Mhm but just to lampshade this: this is assuming that ‘the moon is round like a wheel of cheese’ is not only learned in just any relevant sense of “in isolation” but rather the strictest possible sense, in which that is literally the first thing ever learned, which is extremely implausible, but nonetheless I think being able to entertain such a strict and abstract definition for the sake of argument is very based; upvoted.
I was stress testing LLM’s for debate judging. It had to start from I know nothing convince me position. Which they are very bad at doing if you want to see bias try this test. It also highlights they have bias in grey areas that are much harder to detect. So moon is cheese was an edge case which turned into the light bulb moment.
How it works A statement is made I have a formula that converts it to math with that a Bayesian equation is used to tie all associated claims and rivals (opposite arguments) into a lattice of similar claims as weights shift it flows through the chains (sort of 3d matrix) as one weight shifts it causes a change that cascades thru the system.
Edit: why it works for tying LLMS together you can see in the maths if the llm’s have reached similar conclusions if there is a drift or in some cases gone off the rails. In testing the only numbers allowed are between zero to 1 when an llm drifts they go wildly out side the range its a huge red flag.
The specific thing I am wondering about is the sense in which it is a “lattice” rather than a DAG as is commonly used in Bayesian networks.
Here is a sample output:
a sample of a run between two LLM claims:
Claim A: “Red meat causes cancer” — T=0.8, S=0.7 → W=0.56
Claim B: “Red meat doesn’t cause cancer” — T=0.9, S=0.8 → W=0.72
This would trigger a flag both cannot be true at the same time. They did not do their homework.
Because i am an outsider ex IT I approached this in a whole new way. It’s how i see it in my head not the correct technical term but meh. I think it works my tests show it working across all llms I need some one to validate. I know it’s a huge claim I am making but the math looks solid
Do you have some specific example of a task that LLMs can consistently do with your system that they are unable to consistently do with more standard consensus mechanisms (e.g. confidence-weighted majority vote, historical performance weighting)? Ideally, can you demonstrate improved performance on one of the standard benchmarks (e.g. HellaSwag for commonsense reasoning, GSM8K for math word problems)?
If not, what concrete observations lead you to believe that this technique works?
If the answer is “I spent a lot of time collaboratively refining the idea with ChatGPT without grounding the discussion in real-world observations”, beware. That is a fast and reliable method to fool yourself, because ChatGPT has been tuned to produce answers that people like, and people like when their ideas are validated, and so one thing that ChatGPT is very good at is saying supportive things about the user’s idea that the user won’t find any holes it. “The user won’t find any holes” could be because there are no holes, but often it’s just because the model truesights you and figures out what arguments will sound good to you specifically, regardless of truth value.
(e.g. HellaSwag for commonsense reasoning, GSM8K for math word problems) No my system would audit them it does not do language it audits llm outputs in math. It wont do the GSM8K. It is read only not write. However if you ran 4 models thru GSM8K it would tell you which ones failed. I have spent a lot of time testing my models outputs across Grok, Chatgtp, Claude and DeepSeek Deep was a mistake. (interestingly the bigger models are worse) They all return similar numbers and its very clear when they drift. So yes I have real world observations. I am auditing their outputs for the exact reason you noted above I do not trust them. Its actually basic math that runs in my logic prompt to test. The maths and prompt is less than half a page. But yes i am here to have it tested because I know how manipulative LLM’s can be.
I am shock how well it works I was trying to fix another problem. “how to get llm’s to not be $h..t” I need them to judge an as debate referee but they couldent even do that so i went down the path of putting structure to get a better referee. Trying to put structure math around questions. It was my Moon is cheese that gave the clue.
The moon is round like a cheese wheel. This statement is 100% correct and 99% wrong to use as a basis for determining what the moon is made of. It carries a small amont of weight for evidence the moon may be cheese. In isolation a true judge would say on current evidence the moon is a cheese wheel. Until it is over ridden by some new evidence then it should shift positions. Any ways from this i derived and equation Not an LLM its new math.
I noticed in my test runs that i could see when the LLM’s where “faking’ it. For example numbers out side allowable ranges. Then i relised what i may have is an LLM auditor. Not what i was trying to create. If i am wrong it will take you a few minutes if I am right we can run llms in parallel.
If this is true, then your method is a better consensus mechanism than majority vote, which should show up in benchmarks. To see what I mean, have a look at Lewkowycz et al 2022, and specifically at table 3 and appendix 4.2
Table 3 and appendix 4.2
So on the MATH benchmark they find that Minerva 62B gets scores 84.5% for pass@256 but only 43.4% for maj@256. In plain English, that means that if you have Minerva 62B attempt the problem 256 times, there’s an 84.5% chance that at least one of the attempts is successful, but only a 43.4% chance that the correct answer is the most-commonly-chosen one.
If your method can tell which trajectories failed, then you should be able to e.g. have Minerva 62B take 256 attempts at the MATH benchmark problems and return the majority answer for non-failing trajectories, which means you should be able to take those 256 trajectories, combine them into a single answer, and have that answer be correct at a rate well above the 43.4% maj@256 number above.
I predict that if you run this experiment, you will find that your consensus mechanism is not in fact better than a majority vote. Demonstrating that it is much better than a majority vote, though, would provide pretty strong evidence that you are onto something here.
I just did a math run I think it would be significantly better. You can try it. I just want it independently evaluated. Its just math at this point its not a full working system. I am just running a logic prompt and manually checking the math after. Checking 256 runs would be a lot of work.
… you’re asking people to invest a bunch of their own time and sign an NDA. You can obviously do what you want, but I think it would be courteous to check that your claims survive at least some contact with reality.
You definitely don’t need to @256, @7 would suffice for MATH or GSM8K. That said, GSM8K is a benchmark, not a single test. So that’s 7 samples times 1319 rows to get performance numbers. You would need to automate your math, but you need to do that anyway if you want to get a big enough sample size not to fool yourself (n=30 is just not enough to be able to say your method is better than a majority vote).
Claude can reliably one-shot something like
If I’m doing my math right, running that eval should cost between $0.25 and $0.50.
This is a maths run: I am quietly confident. It nailed this faulty equation. Although GTP failed the test. Claude nailed it. Interestingly I did not do my formula with math in mind.
Top-Level Claim (C): “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² + 4ac)) / 2a” Clan: Pure Mathematics (Algebra) Step 1: Rival Hypotheses * H₁: The formula as stated is correct * H₂: The correct formula has −4ac (not +4ac) in the discriminant * H₃: The quadratic formula has a different form entirely * H₄: There is no general formula for solving quadratic equations Step 2: Prior Probabilities * P(H₁) = 0.05 (extremely low due to contradiction with established mathematics) * P(H₂) = 0.90 (high confidence in the standard, well-established formula) * P(H₃) = 0.04 (very low, as alternative forms are equivalent transformations) * P(H₄) = 0.01 (essentially zero, as the formula’s existence is proven) Step 3: Evidence Likelihoods The evidence includes centuries of mathematical literature, textbooks, and successful applications: * P(E | H₁) = 0.001 (observed evidence strongly contradicts this) * P(E | H₂) = 0.999 (all evidence supports the standard formula) * P(E | H₃) = 0.1 (some evidence could support alternative forms) * P(E | H₄) = 0.001 (evidence strongly contradicts this) Step 4: Support Score (S) Best-fit hypothesis is H₂ with S = 0.999 Step 5: Truth Score (T) With credibility scores all at 1.0 for mathematical claims: T = (0.05 × 1.0 × 0.001 + 0.90 × 1.0 × 0.999 + 0.04 × 1.0 × 0.1 + 0.01 × 1.0 × 0.001) / 1.0 ≈ 0.90 Step 6: Final Weight (W) W = 0.999 × 0.90 ≈ 0.90 Step 7: Verdict ❌ FAIL: The claim contains a fundamental mathematical error. The correct quadratic formula uses −4ac, not +4ac, in the discriminant. This is not a matter of interpretation but of mathematical fact, verifiable through derivation and consistent with all mathematical literature. The correct claim would be: “The quadratic formula for ax² + bx + c = 0 is x = (-b ± √(b² − 4ac)) / 2a”
Mhm but just to lampshade this: this is assuming that ‘the moon is round like a wheel of cheese’ is not only learned in just any relevant sense of “in isolation” but rather the strictest possible sense, in which that is literally the first thing ever learned, which is extremely implausible, but nonetheless I think being able to entertain such a strict and abstract definition for the sake of argument is very based; upvoted.
I was stress testing LLM’s for debate judging. It had to start from I know nothing convince me position. Which they are very bad at doing if you want to see bias try this test. It also highlights they have bias in grey areas that are much harder to detect. So moon is cheese was an edge case which turned into the light bulb moment.