Since the parameters in your implementation are 32-bit floats, you assign a complexity cost of 32 ⋅ 2^n bits to n-th order Markov chains, and look at the sum of fit (log loss) and complexity.
Something about this feels wrong. The precision of your floats shouldn’t be what determines the complexity of your Markov chain; the expressivity of an nth-order Markov chain will almost always be worse than that of a (n+1)th-order Markov chain, even if the latter has access to higher precision floats than the former. Also, in the extreme case where you’re working with real numbers, you’d end up with the absurd conclusion that every Markov chain has infinite complexity, which is obviously nonsensical.
This does raise the question of how to assign complexity to Markov chains; it’s clearly going to be linear in the number of parameters (and hence exponential in the order of the chain), which means the general form k ⋅ 2^n seems correct… but the value you choose for the coefficient k seems underdetermined.
You’re right! This is something that the literature has details on, that I chose to simplify/gloss-over in pursuit of my goal of “write Actual Executable Code implementing minimum-description-length model selection and then tell a cute story about it.”
Chapter 5 of Peter D. Grünwald’s The Minimum Description Length Principle gives the complexity term (assuming that the parameter space is compact) as kd+LN(k)+LN(d), where k is the number of parameters, d is the bits of percision per parameter, and LN is the codelength function for a universal prior on the integers (the theoretical ideal that Levenshtein coding and the Elias ω coding approach). Basically, in order to specify your parameters, you need to first say how many of them there are and to what precision, and for that, you need a prefix-free code for the integers, which is itself kind of intricate: you can’t give every integer the same codelength, so you end up doing something like “first give the order of magnitude (in, um, unary), and then pick out a specific integer in that range”, but recursively (so that you first give the order of magnitude of the order of magnitude if it gets too big, &c.).
(Oh, and Grünwald warns that this still isn’t optimal and that question of the best encoding is left to §15.3, but I’m not there yet—it’s a big textbook, okay?)
I was originally imagining implementing the integer coding, but I decided it was too complicated for the cute story I wanted to tell about model selection, particularly since I wanted to provide Actual Executable Code with minimal boilerplate and minimal dependencies. (Rust just gives me f32 and f64, not variable-length floats.) I think this kind of handwaving should be admissible in the cute-blog-story genre as long as the author addresses objections in the comment section.
Something about this feels wrong. The precision of your floats shouldn’t be what determines the complexity of your Markov chain; the expressivity of an nth-order Markov chain will almost always be worse than that of a (n+1)th-order Markov chain, even if the latter has access to higher precision floats than the former. Also, in the extreme case where you’re working with real numbers, you’d end up with the absurd conclusion that every Markov chain has infinite complexity, which is obviously nonsensical.
This does raise the question of how to assign complexity to Markov chains; it’s clearly going to be linear in the number of parameters (and hence exponential in the order of the chain), which means the general form k ⋅ 2^n seems correct… but the value you choose for the coefficient k seems underdetermined.
You’re right! This is something that the literature has details on, that I chose to simplify/gloss-over in pursuit of my goal of “write Actual Executable Code implementing minimum-description-length model selection and then tell a cute story about it.”
Chapter 5 of Peter D. Grünwald’s The Minimum Description Length Principle gives the complexity term (assuming that the parameter space is compact) as kd+LN(k)+LN(d), where k is the number of parameters, d is the bits of percision per parameter, and LN is the codelength function for a universal prior on the integers (the theoretical ideal that Levenshtein coding and the Elias ω coding approach). Basically, in order to specify your parameters, you need to first say how many of them there are and to what precision, and for that, you need a prefix-free code for the integers, which is itself kind of intricate: you can’t give every integer the same codelength, so you end up doing something like “first give the order of magnitude (in, um, unary), and then pick out a specific integer in that range”, but recursively (so that you first give the order of magnitude of the order of magnitude if it gets too big, &c.).
(Oh, and Grünwald warns that this still isn’t optimal and that question of the best encoding is left to §15.3, but I’m not there yet—it’s a big textbook, okay?)
I was originally imagining implementing the integer coding, but I decided it was too complicated for the cute story I wanted to tell about model selection, particularly since I wanted to provide Actual Executable Code with minimal boilerplate and minimal dependencies. (Rust just gives me
f32
andf64
, not variable-length floats.) I think this kind of handwaving should be admissible in the cute-blog-story genre as long as the author addresses objections in the comment section.