Thank you for this post. It looks like the people at Anthropic have put a lot of thought into this which is good to see.
You mention that there are often surprising qualitative differences between larger and smaller models. How seriously is Anthropic considering a scenario where there is a sudden jump in certain dangerous capabilities (in particular deception) at some level of model intelligence? Does it seem plausible that it might not be possible to foresee this jump from experiments on even slighter weaker models?
Very interesting, after reading chinchilla’s wild implications I was hoping someone would write something like this!
If I understand point 6 correctly, then you are proposing that Hoffman’s scaling laws lead to shorter timelines because data-efficiency can be improved algorithmically. To me it seems that it might just as well make timelines longer to depend on algorithmic innovations as opposed to the improvements in compute that would help increase parameters. It feels like there is more uncertainty about if people will keep coming up with the novel ideas required to improve data efficiency compared to if the available compute will continue to increase in the near to mid-term future. If the available data really becomes exhausted within the next few years, then improving the quality of models will be more dependend on such novel ideas under Hoffman’s laws than under Kaplan’s.