I don’t think this is relevant, but there are theoretical uses for maximizing expected log probability, and maximizing expected log probability is not the same as maximizing expected probability, since they interact with the expectation differently.
If you have lots of training data, the probability that the model assigns the training data is very small. You can’t represent such small numbers with the commonly used floating point types in Python/Java/etc.
It’s more practical to compute the log probability of the training data (by summing the log probabilities assigned to the training examples rather than multiplying the original probabilities).
It doesn’t, modulo practical concerns that ofer brings up below. Also the math is often nicer in log space (since you have a sum over log probabilities of data points, instead of a product over probabilities of data points). But yes, formally they are equivalent.
log is a monotonous function, so how does this differ from choosing parameters to maximize the probability?
I don’t think this is relevant, but there are theoretical uses for maximizing expected log probability, and maximizing expected log probability is not the same as maximizing expected probability, since they interact with the expectation differently.
If you have lots of training data, the probability that the model assigns the training data is very small. You can’t represent such small numbers with the commonly used floating point types in Python/Java/etc.
It’s more practical to compute the log probability of the training data (by summing the log probabilities assigned to the training examples rather than multiplying the original probabilities).
It doesn’t, modulo practical concerns that ofer brings up below. Also the math is often nicer in log space (since you have a sum over log probabilities of data points, instead of a product over probabilities of data points). But yes, formally they are equivalent.