Probably? Though it’s hard to say since so little information about the model architecture was given to the public. That said, PaLM is also around around 10x the size as GPT-3 and GPT-4 seems better than it (though this is likely due to GPT-4′s training following Chinchilla-or-better scaling laws).
See my edit to my comment above. Sounds like GPT-3 was actually 250x more compute than GPT-2. And Claude / GPT-4 are about 50x more compute than that? (Though unclear to me how much insight the Anthropic folks had into GPT-4′s training before the announcement. So possible the 50x number is accurate for Claude and not for GPT-4.)
Probably? Though it’s hard to say since so little information about the model architecture was given to the public. That said, PaLM is also around around 10x the size as GPT-3 and GPT-4 seems better than it (though this is likely due to GPT-4′s training following Chinchilla-or-better scaling laws).
See my edit to my comment above. Sounds like GPT-3 was actually 250x more compute than GPT-2. And Claude / GPT-4 are about 50x more compute than that? (Though unclear to me how much insight the Anthropic folks had into GPT-4′s training before the announcement. So possible the 50x number is accurate for Claude and not for GPT-4.)