Is it the case the one kind of SSL is more effective for a particular modality, than another? E.g., is masked modeling better for text-based learning, and noise-based learning more suited for vision?
Is it the case the one kind of SSL is more effective for a particular modality, than another? E.g., is masked modeling better for text-based learning, and noise-based learning more suited for vision?