That’s a part of the disagreement. In the past you clearly thought that Occam’s razor was an “obvious” constraint that might work. Possibly you thought it was a unique such constraint. Then you found this result, and made a large update in the other direction. That’s why you say the result is big—rejecting a constraint that you already didn’t expect to work wouldn’t feel very significant.
On the other hand, I don’t think that Occam’s razor is unique such constraint. So when I see you reject it, I naturally ask “what about all the other obvious constraints that might work?”. To me this result reads like “0 didn’t solve our equation therefore the solution must be very hard”. I’m sure that you have strong arguments against many other approaches, but I haven’t seen them, and I don’t think the one in OP generalizes well.
I’d need to see these constraints explicitly formulated before I had any confidence in them.
This is a bit awkward. I’m sure that I’m not proposing anything that you haven’t already considered. And even if you show that this approach is wrong, I’d just try to put a band-aid on it. But here is an attempt:
First we’d need a data set of human behavior with both positive and negative examples (e.g. “I made a sandwitch”, “I didn’t stab myself”, etc). So it would be a set of tuples of state s, action a and +1 for positive examples, −1 for negative ones. This is not trivial to generate, especially it’s not clear how to pick negative examples, but here too I expect that the obvious solutions are all fine. By the way, I have no idea how the examples are formalized, that seems like a problem, but it’s not unique to this approach, so I’ll assume that it’s solved.
Next, given a pair (p, R), we would score it by adding up the following:
1. p(R) should accurately predict human behavior. So we want a count of p(R)(s)=a for positive cases and p(R)(s)!=a for negative cases.
2. R should also predict human behavior. So we want to sum R(s, a) for positive examples, minus the same sum for negative examples.
3. Regularization for p.
4. Regularization for R.
Here we are concerned about overfitting R, and don’t care about p as much, so terms 1 and 4 would get large weights, and terms 2, 3 would get smaller weights.
Finally we throw machine learning at the problem to maximize this score.
That’s a part of the disagreement. In the past you clearly thought that Occam’s razor was an “obvious” constraint that might work. Possibly you thought it was a unique such constraint. Then you found this result, and made a large update in the other direction. That’s why you say the result is big—rejecting a constraint that you already didn’t expect to work wouldn’t feel very significant.
On the other hand, I don’t think that Occam’s razor is unique such constraint. So when I see you reject it, I naturally ask “what about all the other obvious constraints that might work?”. To me this result reads like “0 didn’t solve our equation therefore the solution must be very hard”. I’m sure that you have strong arguments against many other approaches, but I haven’t seen them, and I don’t think the one in OP generalizes well.
This is a bit awkward. I’m sure that I’m not proposing anything that you haven’t already considered. And even if you show that this approach is wrong, I’d just try to put a band-aid on it. But here is an attempt:
First we’d need a data set of human behavior with both positive and negative examples (e.g. “I made a sandwitch”, “I didn’t stab myself”, etc). So it would be a set of tuples of state s, action a and +1 for positive examples, −1 for negative ones. This is not trivial to generate, especially it’s not clear how to pick negative examples, but here too I expect that the obvious solutions are all fine. By the way, I have no idea how the examples are formalized, that seems like a problem, but it’s not unique to this approach, so I’ll assume that it’s solved.
Next, given a pair (p, R), we would score it by adding up the following:
1. p(R) should accurately predict human behavior. So we want a count of p(R)(s)=a for positive cases and p(R)(s)!=a for negative cases.
2. R should also predict human behavior. So we want to sum R(s, a) for positive examples, minus the same sum for negative examples.
3. Regularization for p.
4. Regularization for R.
Here we are concerned about overfitting R, and don’t care about p as much, so terms 1 and 4 would get large weights, and terms 2, 3 would get smaller weights.
Finally we throw machine learning at the problem to maximize this score.