You’ve dropped a factor of α about half-way through your calculation. And then you’ve multiplied by X−1 between two lines separated by “=”; the idea is that both sides are zero so it kinda-sorta makes sense but it’s super-misleading. If you restore the factor of α then your last equation ends up as h=Y(X+αX−1)−1.
But even this is wrong, I’m afraid. You can’t multiply by X−1 there at all. There is no X−1; X is not (except by coincidence, and in an ML application if this coincidence happens then you don’t have anything like enough data) a square matrix and in general it has no inverse.
There are problems earlier in the derivation, too, which I think are encouraged by some of your nonstandard notation. E.g., you write hxt rather than h⋅xt or xTth, and this has fooled you into writing down something wrong for what you write as ddhE. That’s also nonstandard notation; it’s defensible but again it makes it easy to get things wrong by mixing up left and right multiplications. Let’s do it with more standard and explicit notation, which will make it harder to make mistakes:
The YTY is constant and its derivative is zero. The terms linear in h are one another’s transposes and readily yield 2(XTY)j. The second quadratic term is just α∑h2k whose ∂∂hj is 2hj. The first quadratic term is similarly ∑(Xh)2k which equals ∑k(∑lXklhl)2 whose ∂∂hj is ∑k2(∑lXklhl)∂∂hj(∑lXklhl)=∑k2(∑lXklhl)Xkj=(2XTXh)j.
So what ends up being zero is the jth component of 2XT(Xh−Y)+2αh and if you like you can write ddh(E+C)=2[XT(Xh−Y)+αh]. But again you need to be very clear about what you mean by that; ddhf(h) means “the A such that to first order f(h)=const+Ah” and so actually the Right Thing to use for the “derivative” is the transpose of what I wrote down above.
You’ve dropped a factor of α about half-way through your calculation. And then you’ve multiplied by X−1 between two lines separated by “=”; the idea is that both sides are zero so it kinda-sorta makes sense but it’s super-misleading. If you restore the factor of α then your last equation ends up as h=Y(X+αX−1)−1.
But even this is wrong, I’m afraid. You can’t multiply by X−1 there at all. There is no X−1; X is not (except by coincidence, and in an ML application if this coincidence happens then you don’t have anything like enough data) a square matrix and in general it has no inverse.
There are problems earlier in the derivation, too, which I think are encouraged by some of your nonstandard notation. E.g., you write hxt rather than h⋅xt or xTth, and this has fooled you into writing down something wrong for what you write as ddhE. That’s also nonstandard notation; it’s defensible but again it makes it easy to get things wrong by mixing up left and right multiplications. Let’s do it with more standard and explicit notation, which will make it harder to make mistakes:
0=∂∂hj(E+C)=∂∂hj[(Xh−Y)T(Xh−Y)+αhTh]=∂∂hj[hTXTXh−hTXTY−YTXh+YTY+αhTh]
The YTY is constant and its derivative is zero. The terms linear in h are one another’s transposes and readily yield 2(XTY)j. The second quadratic term is just α∑h2k whose ∂∂hj is 2hj. The first quadratic term is similarly ∑(Xh)2k which equals ∑k(∑lXklhl)2 whose ∂∂hj is ∑k2(∑lXklhl)∂∂hj(∑lXklhl)=∑k2(∑lXklhl)Xkj=(2XTXh)j.
So what ends up being zero is the jth component of 2XT(Xh−Y)+2αh and if you like you can write ddh(E+C)=2[XT(Xh−Y)+αh]. But again you need to be very clear about what you mean by that; ddhf(h) means “the A such that to first order f(h)=const+Ah” and so actually the Right Thing to use for the “derivative” is the transpose of what I wrote down above.
Finishing off the correct derivation, we have
XT(Xh−Y)+αh=0 so (XTX+α)h=Y so h=(XTX+α)−1Y.