I used scikit-learn to build several random-forest regressors mapping attributes + house to Ofspev rating, and verified that early on the Helm tended to allocate students to the house for which the regressor predicted the best rating, and that at the end it didn’t. Then I Sorted … excuse me, Allocated … the students to the houses for which the regressors predicted the best rating. In cases where they disagreed I tried to eyeball the distributions and use my judgement :-).
Resulting allocation:
Serpentyne gets C, F, K. Humblescrumble gets E, I, L, M, P, Q, R, T. Dragonslayer gets D, G, H, N. Thought-Talon gets A, B, J, O, S.
Most of these results are pretty clear-cut in that every prediction for the winning house was better than any prediction for any other house. Notable exceptions were E (who might do well in Thought-Talon or maybe in Dragonslayer; certainly not Serpentyne, though) and Q (for whom all houses gave rather similar predictions).
With these allocations I cautiously predict the following Ofspev ratings: A 36..39, B 15..18, C 28..30, D 17..19, E 17..19, F 34..40, G 21..23, H 15..19, I 12..14, J 27..30, K 23..26, L 22..26, M 24..29, O 18..23, P 22..25, Q 30..32, R 40..44, S 29..32, T 28..34. These intervals are probably too narrow; they are determined by the range of variation among the 16 regressors I used, but the overall prediction errors for these regressors are wider than those ranges.
Possible reasons why this might suck:
Early on we only get information about how students perform in the house to which a skilled Helm allocated them. This means that we have less information about how they would have performed in other houses.
I am assuming that our only goal is to make our students as impactful as possible. It may be that the Helm actually has other aims (e.g., to make them happy or psychologically well-adjusted), in which case we should be trying to reproduce early-Helm allocations instead.
I just used an out-of-the-box regressor that seemed plausible. I haven’t tried to tune its parameters.
I have made a cursory attempt to understand what my black boxes are doing
by feeding in all the 2^5 attribute-vectors where each one is either 10 (low) or 40 (high) and seeing what the predictions for each house look like. Crudely, it seems as if: students do well in Serpentyne when they have high Intellect and either Reflexes or Patience; in Humblescrumble when they have high Intellect and Integrity, with Patience serving as a partial stand-in for either; in Dragonslayer when they have high Courage and Reflexes; in Thought-Talon when they have high Intellect and Patience. Students high in all five attributes do exceptionally well in Dragonslayer and Thought-Talon; for Thought-Talon but not for Dragonslayer it’s almost as good to be high in everything except Reflexes.
I think I could quantify those observations (and maybe a few second-order effects I didn’t mention explicitly) and get an explicit model that would serve the Helm pretty well in practice, though I doubt it would outperform the brute-force random forests.
Further noodling around with ad hoc models suggests that
in at least some cases, some of the students’ attributes are best thought of as having limits such that increases above the limits make no difference. Specifically, I played around a little with Serpentyne and it seems that we probably want to look at min(40,Intellect) and min(65,Reflexes) rather than using those values unaltered. The limits might well be different for different houses (analogy: intelligence is probably an advantage both for theoretical physicists and for taxi drivers, but most likely being 1-in-a-million smart rather than “just” 1-in-a-thousand is more beneficial for the theoretical physicists); so far this is just the result of idly looking at one particular house.
Continuing with the principle “when in doubt, use brute force”,
I did the same thing with gradient-boosted trees; these had somewhat more prediction error on each validation set (oh, I forgot to mention that each regressor was trained on 90% of the data and evaluated on the remaining 10%). And with SVMs using radial basis functions; these were comparable in accuracy to the random forests. (Note: There’s much less diversity in my ensemble of SVMs, because the only difference between them is the training/validation split, whereas for RFs and GBTs there is randomness in the fitting process itself.)
Did this make a difference to my predictions or suggestions?
Not much; usually all three agreed; where they didn’t, usually the SVM agreed with the RF. However, the SVM regressors fairly confidently want to put K in Dragonslayer, and the GBT ones less confidently agree. On the other hand, they predict less loss from putting K in the RFs’ suggestion of Serpentyne than the RFs do from putting K in Dragonslayer, so it’s a tough call. I’ll switch to putting K in Dragonslayer. And for Q, the RFs and GBTs are fairly indifferent between all houses and slightly prefer Humblescrumble (for the RFs) and Thought-Talon (for the GBTs), but the SVMs think that Dragonslayer and Thought-Talon are much better than the other two, and give the nod to Dragonslayer. Looking at all their numbers, I’ll move Q into Dragonslayer.
So my revised allocations are:
Serpentyne gets C, F. Humblescrumble gets E, I, L, M, P, R, T. Dragonslayer gets D, G, H, K, N, Q. Thought-Talon gets A, B, J, O, S.
And my revised predicted ratings with these allocations are:
A 36..39, B 15..18, C 28..30, D 17..20, E 17..19, F 33..40, G 21..24, H 15..20, I 12..15, J 24..30, K 23..26, L 21..26, M 24..29, N 19..21, O 18..23, P 22..25, Q 27..34, R 40..44, S 29..32, T 28..34.
It occurs to me that there is a possible source of bias in the approach I am taking:
perhaps not everyone gets into Swineboils, so that e.g. if we looked for correlations between the attributes we would get spurious negative correlations because people who are bad at everything don’t get in. If such effects are strong, then our model will have bias if we apply it to the population at large. That’s OK because we aren’t applying it to the population at large, we’re applying it to the students who got in … but if e.g. Swineboils is less selective than it used to be because there are way fewer applications, then the bias will distort our predictions for this year’s students. (This is the phenomenon sometimes called “Berkson’s paradox”.)
I don’t expect this bias to be very large, but I have made no attempt to check that expectation against reality. Still less have I made any attempt to correct for it, and probably I won’t.
First-order attempt:
I used scikit-learn to build several random-forest regressors mapping attributes + house to Ofspev rating, and verified that early on the Helm tended to allocate students to the house for which the regressor predicted the best rating, and that at the end it didn’t. Then I Sorted … excuse me, Allocated … the students to the houses for which the regressors predicted the best rating. In cases where they disagreed I tried to eyeball the distributions and use my judgement :-).
Resulting allocation:
Serpentyne gets C, F, K. Humblescrumble gets E, I, L, M, P, Q, R, T. Dragonslayer gets D, G, H, N. Thought-Talon gets A, B, J, O, S.
Most of these results are pretty clear-cut in that every prediction for the winning house was better than any prediction for any other house. Notable exceptions were E (who might do well in Thought-Talon or maybe in Dragonslayer; certainly not Serpentyne, though) and Q (for whom all houses gave rather similar predictions).
With these allocations I cautiously predict the following Ofspev ratings: A 36..39, B 15..18, C 28..30, D 17..19, E 17..19, F 34..40, G 21..23, H 15..19, I 12..14, J 27..30, K 23..26, L 22..26, M 24..29, O 18..23, P 22..25, Q 30..32, R 40..44, S 29..32, T 28..34. These intervals are probably too narrow; they are determined by the range of variation among the 16 regressors I used, but the overall prediction errors for these regressors are wider than those ranges.
Possible reasons why this might suck:
Early on we only get information about how students perform in the house to which a skilled Helm allocated them. This means that we have less information about how they would have performed in other houses.
I am assuming that our only goal is to make our students as impactful as possible. It may be that the Helm actually has other aims (e.g., to make them happy or psychologically well-adjusted), in which case we should be trying to reproduce early-Helm allocations instead.
I just used an out-of-the-box regressor that seemed plausible. I haven’t tried to tune its parameters.
I have made a cursory attempt to understand what my black boxes are doing
by feeding in all the 2^5 attribute-vectors where each one is either 10 (low) or 40 (high) and seeing what the predictions for each house look like. Crudely, it seems as if: students do well in Serpentyne when they have high Intellect and either Reflexes or Patience; in Humblescrumble when they have high Intellect and Integrity, with Patience serving as a partial stand-in for either; in Dragonslayer when they have high Courage and Reflexes; in Thought-Talon when they have high Intellect and Patience. Students high in all five attributes do exceptionally well in Dragonslayer and Thought-Talon; for Thought-Talon but not for Dragonslayer it’s almost as good to be high in everything except Reflexes.
I think I could quantify those observations (and maybe a few second-order effects I didn’t mention explicitly) and get an explicit model that would serve the Helm pretty well in practice, though I doubt it would outperform the brute-force random forests.
Further noodling around with ad hoc models suggests that
in at least some cases, some of the students’ attributes are best thought of as having limits such that increases above the limits make no difference. Specifically, I played around a little with Serpentyne and it seems that we probably want to look at min(40,Intellect) and min(65,Reflexes) rather than using those values unaltered. The limits might well be different for different houses (analogy: intelligence is probably an advantage both for theoretical physicists and for taxi drivers, but most likely being 1-in-a-million smart rather than “just” 1-in-a-thousand is more beneficial for the theoretical physicists); so far this is just the result of idly looking at one particular house.
Continuing with the principle “when in doubt, use brute force”,
I did the same thing with gradient-boosted trees; these had somewhat more prediction error on each validation set (oh, I forgot to mention that each regressor was trained on 90% of the data and evaluated on the remaining 10%). And with SVMs using radial basis functions; these were comparable in accuracy to the random forests. (Note: There’s much less diversity in my ensemble of SVMs, because the only difference between them is the training/validation split, whereas for RFs and GBTs there is randomness in the fitting process itself.)
Did this make a difference to my predictions or suggestions?
Not much; usually all three agreed; where they didn’t, usually the SVM agreed with the RF. However, the SVM regressors fairly confidently want to put K in Dragonslayer, and the GBT ones less confidently agree. On the other hand, they predict less loss from putting K in the RFs’ suggestion of Serpentyne than the RFs do from putting K in Dragonslayer, so it’s a tough call. I’ll switch to putting K in Dragonslayer. And for Q, the RFs and GBTs are fairly indifferent between all houses and slightly prefer Humblescrumble (for the RFs) and Thought-Talon (for the GBTs), but the SVMs think that Dragonslayer and Thought-Talon are much better than the other two, and give the nod to Dragonslayer. Looking at all their numbers, I’ll move Q into Dragonslayer.
So my revised allocations are:
Serpentyne gets C, F. Humblescrumble gets E, I, L, M, P, R, T. Dragonslayer gets D, G, H, K, N, Q. Thought-Talon gets A, B, J, O, S.
And my revised predicted ratings with these allocations are:
A 36..39, B 15..18, C 28..30, D 17..20, E 17..19, F 33..40, G 21..24, H 15..20, I 12..15, J 24..30, K 23..26, L 21..26, M 24..29, N 19..21, O 18..23, P 22..25, Q 27..34, R 40..44, S 29..32, T 28..34.
It occurs to me that there is a possible source of bias in the approach I am taking:
perhaps not everyone gets into Swineboils, so that e.g. if we looked for correlations between the attributes we would get spurious negative correlations because people who are bad at everything don’t get in. If such effects are strong, then our model will have bias if we apply it to the population at large. That’s OK because we aren’t applying it to the population at large, we’re applying it to the students who got in … but if e.g. Swineboils is less selective than it used to be because there are way fewer applications, then the bias will distort our predictions for this year’s students.
(This is the phenomenon sometimes called “Berkson’s paradox”.)
I don’t expect this bias to be very large, but I have made no attempt to check that expectation against reality. Still less have I made any attempt to correct for it, and probably I won’t.