I can somewhat see where you’re coming from about a new method being orders of magnitude more data efficient in RL, but I very strongly bet on transformers being core even after such a paradigm shift. I’m curious whether you think the transformer architecture and text input/output need to go, or whether the new training procedure / architecture fits in with transformers because transformers are just the best information mixing architecture.
My guess the main issue of current transformers turns out to be the fact that they don’t have a long-term state/memory, and I think this is a pretty critical part of how humans are able to learn on the job as effectively as they do.
The trouble as I’ve heard it is the other approaches which incorporate a state/memory for the long-run are apparently much harder to train reasonably well than transformers, plus first-mover effects.
I can somewhat see where you’re coming from about a new method being orders of magnitude more data efficient in RL, but I very strongly bet on transformers being core even after such a paradigm shift. I’m curious whether you think the transformer architecture and text input/output need to go, or whether the new training procedure / architecture fits in with transformers because transformers are just the best information mixing architecture.
My guess the main issue of current transformers turns out to be the fact that they don’t have a long-term state/memory, and I think this is a pretty critical part of how humans are able to learn on the job as effectively as they do.
The trouble as I’ve heard it is the other approaches which incorporate a state/memory for the long-run are apparently much harder to train reasonably well than transformers, plus first-mover effects.