Might it be simpler to remove the source of model non-determism which causes different results with temperature 0? If it’s due to a hardware bug, then this seems like a signal that the node should be replaced. If it’s due to a software bug then this should be patched.
Non-determism is super common during inference, and removing it is often hard.
It’s possible if you are using the same hardware for verification and inference and are willing to eat some performance hit (see https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), but it’s much more tricky if you are using 2 different kinds of hardware (for floating point non-associativity reasons).
Adding to Fabien’s point: even if fully deterministic inference is achieved, the verification scheme we describe is still necessary, and you still need trusted logging of (input, output, seed) tuples and a verification server to check them. Determinism just reduces the sampling required for a given confidence level, since there’s no legitimate noise to account for.
As a very simple proof-of-concept that this does happen: we queried Openai-4o-mini the same prompt “Once upon a time,” with fixed seed, and we get ~5 different generations.
There are lots of reasons for non-determinism, but one important one is due to floating-point non-associativity (a+b+c != c+ b +a), and ML models don’t specify the ordering of matrix multiplications. Thus, if you run the same model on 2 different machines, you can get different logits (and thus different sampled tokens), because different gpus have different preferences of matmul orderings.
(We give a more formal discussion of sources of non-determinism in section 4.4 of the full paper (link) )
Might it be simpler to remove the source of model non-determism which causes different results with temperature 0? If it’s due to a hardware bug, then this seems like a signal that the node should be replaced. If it’s due to a software bug then this should be patched.
Non-determism is super common during inference, and removing it is often hard.
It’s possible if you are using the same hardware for verification and inference and are willing to eat some performance hit (see https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/), but it’s much more tricky if you are using 2 different kinds of hardware (for floating point non-associativity reasons).
Adding to Fabien’s point: even if fully deterministic inference is achieved, the verification scheme we describe is still necessary, and you still need trusted logging of (input, output, seed) tuples and a verification server to check them. Determinism just reduces the sampling required for a given confidence level, since there’s no legitimate noise to account for.
As a very simple proof-of-concept that this does happen: we queried Openai-4o-mini the same prompt “Once upon a time,” with fixed seed, and we get ~5 different generations.
There are lots of reasons for non-determinism, but one important one is due to floating-point non-associativity (a+b+c != c+ b +a), and ML models don’t specify the ordering of matrix multiplications. Thus, if you run the same model on 2 different machines, you can get different logits (and thus different sampled tokens), because different gpus have different preferences of matmul orderings.
(We give a more formal discussion of sources of non-determinism in section 4.4 of the full paper (link) )