Could you expand on why you think that information / entropy doesn’t match what you mean by “amount of optimization done”?
E.g. suppose you’re training a neural network via gradient descent. If you start with weights drawn from some broad distribution, after training they will end up in some narrower distribution. This seems like a good metric of “amount of optimization done to the neural net.”
I think there are two categories of reasons why you might not be satisfied—false positive and false negative. False positives would be “I don’t think much optimization has been done, but the distribution got a lot narrower,” and false negatives would be “I think more optimization is happening, but the distribution isn’t getting any narrower.” Did you have a specific instance of one of these cases in mind?
First, the reason why I wasn’t happy with entropy as a metric is because it doesn’t allow (straightforward) comparison of different types of optimization, as I discussed. Entropy of a probability distribution output isn’t comparable to the entropy over states that Eliezer defines, for example.
Second, I’m not sure false positive and false negative are the right conceptual tools here. I can easily show examples of each—gradient descent can fail horribly in many ways, and luck of specific starting parameters on specific distributions can lead to unreasonably rapid convergence, but in both cases, it’s a relationship between the algorithm and the space being optimized.
I think Eliezer’s definition still basically makes sense as a measure of optimization power, but the model of optimization which inspired it (basically, optimization-as-random-search) doesn’t make sense.
Though, I would very likely have a better way of measuring optimization power if I understood what was really going on better.
Could you expand on why you think that information / entropy doesn’t match what you mean by “amount of optimization done”?
E.g. suppose you’re training a neural network via gradient descent. If you start with weights drawn from some broad distribution, after training they will end up in some narrower distribution. This seems like a good metric of “amount of optimization done to the neural net.”
I think there are two categories of reasons why you might not be satisfied—false positive and false negative. False positives would be “I don’t think much optimization has been done, but the distribution got a lot narrower,” and false negatives would be “I think more optimization is happening, but the distribution isn’t getting any narrower.” Did you have a specific instance of one of these cases in mind?
A couple points.
First, the reason why I wasn’t happy with entropy as a metric is because it doesn’t allow (straightforward) comparison of different types of optimization, as I discussed. Entropy of a probability distribution output isn’t comparable to the entropy over states that Eliezer defines, for example.
Second, I’m not sure false positive and false negative are the right conceptual tools here. I can easily show examples of each—gradient descent can fail horribly in many ways, and luck of specific starting parameters on specific distributions can lead to unreasonably rapid convergence, but in both cases, it’s a relationship between the algorithm and the space being optimized.
I think Eliezer’s definition still basically makes sense as a measure of optimization power, but the model of optimization which inspired it (basically, optimization-as-random-search) doesn’t make sense.
Though, I would very likely have a better way of measuring optimization power if I understood what was really going on better.
I think I agree about Eliezer’s definition, that it’s theoretically correct, but I definitely agree that I need to understand this better.