In regards to 1), I don’t necessarily think that older developments that are re-emerging can’t be interesting (see the whole RL scene nowadays, which to my understanding is very much bringing back the kind of approaches that were popular in the 70s). But I do think the particular ML development that people should focus on is the one with the most potential, which will likely end up being newer. My grips with GPT-2 is that there’s no comparative proof that it has potential to generalize compared to a lot of other things (e.g. quick architecture search methods, custom encoders/heads added to a resnet), actually I’d say the sheer size of it and the issue one encounters when training it indicates the opposite.
I don’t think 2) is a must, but going back to 1), I think that training time is one of the important criterions to compare the approaches we are focusing on. Since training time on a simple task is arguably the best you can do to understand training time for a more complex task.
As for 3) and 4)… I’d agree with 3), I think 4) is too vague, but I wasn’t trying to bring either point across in this specific post.
In regards to 1), I don’t necessarily think that older developments that are re-emerging can’t be interesting (see the whole RL scene nowadays, which to my understanding is very much bringing back the kind of approaches that were popular in the 70s). But I do think the particular ML development that people should focus on is the one with the most potential, which will likely end up being newer. My grips with GPT-2 is that there’s no comparative proof that it has potential to generalize compared to a lot of other things (e.g. quick architecture search methods, custom encoders/heads added to a resnet), actually I’d say the sheer size of it and the issue one encounters when training it indicates the opposite.
I don’t think 2) is a must, but going back to 1), I think that training time is one of the important criterions to compare the approaches we are focusing on. Since training time on a simple task is arguably the best you can do to understand training time for a more complex task.
As for 3) and 4)… I’d agree with 3), I think 4) is too vague, but I wasn’t trying to bring either point across in this specific post.