It seems like Occam’s Razor just logically follows from the basic premises of probability theory. Assume the “complexity” of a hypothesis is how many bits it takes to specify under a particular method of specifying hypotheses, and that hypotheses can be of any length.
I think the method can’t be arbitrary for your argument to work. If we were to measure the “complexity” of an hypothesis by how many outcomes the event set of the hypothesis contains (or by how many disjuncts some non-full disjunctive normal form has), then the more “complex” hypotheses are less likely.
Of course that’s an implausible definition of complexity. For example, using conjunctive normal forms for measuring complexity is overall more intuitive, since is considered more complex than , and it would indeed lead to more “complex” hypothesis being less likely. But universally quantified statements (like laws of nature) correspond to very long (perhaps infinitely long) CNFs, which would suggest that they are highly complex and unlikely. But intuitively, laws of nature can be simple and likely.
This was basically the problem Wittgenstein later identified with his definition of logical probability in the Tractatus a hundred years ago. Laws would have logical probability 0, which is absurd. It would mean no amount of finite evidence can confirm them to any degree. Later Rudolf Carnap worked on this problem, but I’m pretty sure it is still considered to be unsolved. If someone does solve it, and finds an a priori justification for Ockham’s razor, that could be used as a solution to the problem of induction, the core problem of epistemology.
So finding a plausible definition of hypothesis complexity, for which also Ockham’s razor holds, is a very hard open problem.
A hypothesis corresponds to an “event” in Kolmogorov probability theory. Such an event is the set of mutually exclusive and exhaustive “outcomes” under which the hypothesis is true. An outcome is basically a possible world. So a hypothesis can be thought of as the set of (or the disjunction of) possible worlds in which the hypothesis is true. So the probability of an event/hypothesis is exactly the sum of the probabilities of its outcomes.
That being said, this way of seeing things is problematic, since most hypotheses are true in infinitely many “possible worlds”. It is not clear how we could sum infinitely many probabilities. And if we don’t use possible worlds as outcomes, but use arbitrary DNFs, the decomposition (partition) becomes non-unique, because outcomes are no longer distinguishable from normal events. Possible worlds at least have the special feature of being “atomic” in the sense that they can’t be decomposed into mutually exclusive and exhaustive disjuncts.
Ideally we would want finitely many non-arbitrary outcomes for each hypothesis. Then we could apply the principle of indifference, assign equal probabilities, and create the sum. But that doesn’t seem to work.
So the whole disjunctive approach seems not exactly promising. An alternative is Wittgenstein-style logical atomism, which supposes there are something like “atomic propositions”, presumably about sense data, which are all independent of each other. Hypotheses/events are then supposed to be logical combinations of these. This approach is riddled with technical difficulties as well, it is unclear how it could be made precise.