Interesting idea, I had not considered this approach before!
I’m not sure this would solve feature absorption though. Thinking about the “Starts with E-” and “Elephant” example: if the “Elephant” latent absorbs the “Starts with E-” latent, the “Starts with E-” feature will develop a hole and not activate anymore on the input “elephant”. After the latent is absorbed, “Starts with E-” wouldn’t be in the list to calculate cumulative losses for that input anymore.
Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.
Interesting idea, I had not considered this approach before!
I’m not sure this would solve feature absorption though. Thinking about the “Starts with E-” and “Elephant” example: if the “Elephant” latent absorbs the “Starts with E-” latent, the “Starts with E-” feature will develop a hole and not activate anymore on the input “elephant”. After the latent is absorbed, “Starts with E-” wouldn’t be in the list to calculate cumulative losses for that input anymore.
Matryoshka works because it forces the early-indexed latents to reconstruct well using only themselves, whether or not later latents activate. I think this pressure is key to stopping the later-indexed latents from stealing the job of the early-indexed ones.