This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD
From my understanding from this paper, lottery tickets are invariant to optimiser, datatype, and other model properties (in this experimental setting), suggesting lottery tickets encode some basic properties of the task.
It seems unlikely lottery tickets based on fundamental task properties would change with continual learning without other problems emerging (catastrophic forgetting).
This seems right to me. But maybe the drift in distribution mainly affects certain parameters, and the divergence tokens affect a separate set of parameters (in early layers) s.t. the downstream effect still persists even after being in OOD
From my understanding from this paper, lottery tickets are invariant to optimiser, datatype, and other model properties (in this experimental setting), suggesting lottery tickets encode some basic properties of the task.
It seems unlikely lottery tickets based on fundamental task properties would change with continual learning without other problems emerging (catastrophic forgetting).