New GPT-3 competitor

AI21 has trained a new language model, Jurassic-1, whose largest version has 178 billion parameters (GPT-3 had 175 billion). This paper gives limited technical details.

There already were several models that used far more parameters than GPT-3, but they were either mixture of expert models or only word embeddings. They required much less compute to train/​use, but were less powerful than a dense transformer like GPT-3 or the new Jurassic-1.

The interesting thing about Jurassic-1 is that it really doesn’t go much beyond GPT-3. It has a larger vocabulary and slightly optimized architecture. Jurassic-1 only has a bit more parameters than GPT-3, whereas prior trends indicated that any GPT-3 successor would use at least an order of magnitude more parameters. Since GPT-3, much work has gone towards improving transformer architecture (e.g., linear time self attention and neural architecture search), but little of that is visible in Jurassic-1. Maybe companies don’t think it’s economically viable to scale beyond GPT-3 or run many experiments with different architectures at that scale?

Also, Jurassic-1 is a unidirectional model, like GPT-3 (meaning it’s forced to process text from left-to-right). This means GPT-3 can only process a given word using the context provided by the previous words. This causes unidirectional models problems for most tasks other than text generation. For example, other than GPT-3, all the top models in the SuperGLUE benchmark leaderboard are bidirectional models. It’s interesting AI21 chose to compete with OpenAI using a model that provides the same class of service (text generation) as GPT-3, rather than specialize in, e.g., text classification, where a bidirectional model would be better.