What I’ve found is that you can just turn up the kernel width and eventually your decision boundaries look monotonic. You could probably do so without losing much in-sample safety.
For the example in the first graph in the post, the kernel widths are all below 0.32.[1]
If you turn the kernel widths up to 0.8, you get a smooth and monotonic decision boundary
In this case, the in-sample safety decreases by a tiny amount, from 93.79% to 93.19%. (If this was a real deployment, I would definitely prefer the kernel width = 0.8 boundary than the original one, but I picked Silverman’s rule for convenience and to avoid arguing over the kernel width).
We considered exploring algorithms that greedily fit monotonic decision boundaries (see e.g., Figure 1 from this paper)
I don’t remember super well how that went, but I think we avoided it because
The kernel method is pretty good already.
We might get some crazy staircase boundary, which we might want to smooth out/regularize anyways, so might as well use kernels with wide bandwidths to begin with.
Different kernels had different widths. Like the 2d bandwidth for misaligned outputs is 0.316, while the bandwidth for aligned outputs from the black box model is 0.22.
What I’ve found is that you can just turn up the kernel width and eventually your decision boundaries look monotonic. You could probably do so without losing much in-sample safety.
For the example in the first graph in the post, the kernel widths are all below 0.32.[1]
If you turn the kernel widths up to 0.8, you get a smooth and monotonic decision boundary
In this case, the in-sample safety decreases by a tiny amount, from 93.79% to 93.19%. (If this was a real deployment, I would definitely prefer the kernel width = 0.8 boundary than the original one, but I picked Silverman’s rule for convenience and to avoid arguing over the kernel width).
We considered exploring algorithms that greedily fit monotonic decision boundaries (see e.g., Figure 1 from this paper)
I don’t remember super well how that went, but I think we avoided it because
The kernel method is pretty good already.
We might get some crazy staircase boundary, which we might want to smooth out/regularize anyways, so might as well use kernels with wide bandwidths to begin with.
Different kernels had different widths. Like the 2d bandwidth for misaligned outputs is 0.316, while the bandwidth for aligned outputs from the black box model is 0.22.