Experts are pre-wired to have a certain size, components can vary in size from tiny query-key lookup for a single fact to large modules.
IIRC, MOE networks use a gating function to decide which experts to query. If you ignored this gating and just use all the experts, I think that’d break the model. In contrast, you can use all APD components on a forward pass if you want. Most of them just won’t affect the result much.
MOE experts don’t completely ignore ‘simplicity’ as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.
Kind of? I’d say the big difference are
Experts are pre-wired to have a certain size, components can vary in size from tiny query-key lookup for a single fact to large modules.
IIRC, MOE networks use a gating function to decide which experts to query. If you ignored this gating and just use all the experts, I think that’d break the model. In contrast, you can use all APD components on a forward pass if you want. Most of them just won’t affect the result much.
MOE experts don’t completely ignore ‘simplicity’ as we define it in the paper though. A single expert is simpler than the whole MOE network in that it has lower rank/ fewer numbers are required to describe its state on any given forward pass.