It’s true that the probability of a microstate is determined by energy and temperature, but the Maxwell-Boltzmann equation assumes that temperature is constant for all particles. Temperature is a distinguishing feature of two distributions, not of two particles within a distribution, and least-temperature is not a state that systems tend towards.
As an aside, the canonical ensemble that the Maxwell-Boltzmann distribution assumes is only applicable when a given state is exceedingly unlikely to be occupied by multiple particles. The strange behavior of condensed matter that I think you’re referring to (Bose-Einstein condensates) is a consequence of this assumption being incorrect for bosons, where a stars-and-bars model is more appropriate.
It is not true that information theory requires the conservation of information. The Ising Model, for example, allows for particle systems with cycles of non-unity gain. This effectively means that it allows particles to act as amplifiers (or dampeners) of information, which is a clear violation of information conservation. This is the basis of critical phenomena, which is a widely accepted area of study within statistical mechanics.
I think you misunderstand how models are fit in practice. It is not standard practice to determine the absolute information content of input, then to relay that information to various explanators. The information content of input is determined relative to explanators. However, there are training methods that attempt to reduce the relative information transferred to explanators, and this practice is called regularization. The penalty-per-relative-bit approach is taken by a method called “dropout”, where a random “cold” model is trained on each training sample, and the final model is a “heated” aggregate of the cold models. “Heating” here just means cutting the amount of information transferred from input to explanator by some fraction.
How is inverse temperature a penalty on models? If you’re referring to the inverse temperature in the Maxwell-Boltzmann distribution, the temperature is considered a constant, and it gives the likelihood of a particle having a particular configuration, not the likelihood of a distribution.
Also, I’m not sure it’s clear what you mean by “information to specify [a model]”. Does a high inverse temperature mean a model requires more information, because it’s more sensitive to small changes and therefore derives more information from them, or does it mean that the model requires less information, because it derives less information from inputs?
The entropy of the Maxwell-Boltzmann distribution I think is proportional to log-temperature, so high temperature (low sensitivity to inputs) is preferred if you go strictly by that. People that train neural networks generally do this as well to prevent overtraining, and they call it regularization.
If you are referring to the entropy of a model, you penalize a distribution for requiring more information by selecting the distribution that maximizes entropy subject to whatever invariants your model must abide by. This is typically done through the method of Lagrange multipliers.