I didn’t really understand how you’re computing things, I hope the details of the simple example get filled out.
Like, when you say “2 layer model,” do you mean 1 hidden layer (so two weight matrices)? And when you say you trained a bunch of single neuron models, you mean that each had a single hidden neuron (with that neuron having the same number of inputs and outputs as the neurons in the original model)? And you trained the single-neuron models to predict the difference between ground truth and the original network’s output? Wow, it’s surprising that the distribution is the same! And then when you say you combined the single neuron models, did you just sum the outputs? Wow, it’s surprising that this undoes the subtraction of the original network’s outputs!
I await the next installment : )
I didn’t really understand how you’re computing things, I hope the details of the simple example get filled out.
Like, when you say “2 layer model,” do you mean 1 hidden layer (so two weight matrices)? And when you say you trained a bunch of single neuron models, you mean that each had a single hidden neuron (with that neuron having the same number of inputs and outputs as the neurons in the original model)? And you trained the single-neuron models to predict the difference between ground truth and the original network’s output? Wow, it’s surprising that the distribution is the same! And then when you say you combined the single neuron models, did you just sum the outputs? Wow, it’s surprising that this undoes the subtraction of the original network’s outputs!