The idea that a given architecture can learn to solve X/Y and Z efficiently (which in this case it isn’t) wouldn’t even be that impressive, unless you couldn’t get a good [architecture] search algorithm to solve X,Y and Z equally fast.
Most people don’t have something like an architecture search algorithm on hand. (Aside from perhaps their brains, as you mentioned in ‘AGI is here’ post.)
Well, the “surprised” part is what I don’t understand.
In this case, surprise is a result of learning something. Yes, it’s surprising to you that not everyone has learned this already. (Though there are different ways/levels of learning things.) Releasing a good architecture search might help, or writing a post about this: “GPT-2 can do (probably) anything very badly that’s just moving symbols around. This might include rubix cubes, but not dancing*).”
*I assume. Also guessing that moving in general is hard (for non-custom hardware; things other than brains) and it has a big space that GPT-2 doesn’t have a shot at (like StarCraft/DotA/etc.).
The concern is that ‘GPT-2 is bad at everything, but better than random’, and people wondering, ‘how long until something that is good at everything comes along’? Will it be sudden, or will ‘bad’ have to be replaced by ‘slightly less bad’ a thousand times over the course of the next hundred/thousand years?
Most people don’t have something like an architecture search algorithm on hand
I’m not sure what you mean by this … ? Architecture search is fairly trivial to implement from scratch and take literally 2 lines of code with something like AX. Well, arguable if it’s trivial per-say, but I think most people would have an easier time coming up with, understanding and implementing architecture search than coming up with, understand and implementing a transformer (e.g. GPT-2) or any other attention-based network.
I assume. Also guessing that moving in general is hard (for non-custom hardware; things other than brains) and it has a big space that GPT-2 doesn’t have a shot at (like StarCraft/DotA/etc.).
Again,I’m not sure why GPT2 wouldn’t have a shot and stracraft or dota. The most basic fully connected network you could write, as long as it has enough parameters and the correct training environment, has shot at starcraft2, dota… etc. It’s just that it will learn slower than something build for those specific cases.
The concern is that ‘GPT-2 is bad at everything, but better than random’, and people wondering, ‘how long until something that is good at everything comes along’? Will it be sudden, or will ‘bad’ have to be replaced by ‘slightly less bad’ a thousand times over the course of the next hundred/thousand years?
Again, I’m not sure how “bad” and “good” are defined here. If you are defining them as “quick to train”, than again, something that’s “better at everything” than GPT-2 is already here since the 70s, dynamic architecture search *(ok, arguably only widely used in the last 6 years or so).
If you are talking about “able to solve”, then again, any architecture with enough parameters should be able to solve any problem that is solve-able given enough time to train, the time required to train it is the issue.
[Again, I’m] not sure why GPT2 wouldn’t have a shot [at] stracraft or dota. The most basic fully connected network you could write, as long as it has enough parameters and the correct training environment, has shot at starcraft2, dota… etc.
Moving has a lot of degrees of freedom, as do those domains. There’s also the issue of quick response time (which is not something it was built for), and it not being an economical solution (which can also be said for OpenAI’s work in those areas).
When things built for starcraft don’t make it to the superhuman level, something that isn’t built for it probably won’t.
It’s just that it will learn slower than something [built] for those specific cases.
The question is how long − 10 years? Solving chess via analyzing the whole tree would take too much time, so no one does it. Would it learn in a remotely feasible amount of time?
The question is how long − 10 years? Solving chess via analyzing the whole tree would take too much time, so no one does it. Would it learn in a remotely feasible amount of time?
Well yeah, that’s my whole point here. We need to talk about accuracy and training time !
If the GPT-2 model was trained in a few hours, and losses 99% of games vs a decision tree based model (ala deep blue) that was trained in a few minutes on the same machine, then it’s worthless. It’s exactly like saying “In theory, given almost infinite RAM and 10 years we could beat deep blue (or alpha chess or whatever the cool kids are doing nowadays) by just analyzing a very large subset of all possible moves + combinations and arranging them hierarchically”.
So you think people should only be afraid/excited about developments in AGI that
1) are more recent than 50 to arguably 6 years ago
2) could do anything/a lot of things well with a reasonable amount of training time?
3) Or that might actually generalize in the sense of general artificial intelligence, that’s remotely close to being on par with humans (w.r.t ability to handle such a variety of domains)?
In regards to 1), I don’t necessarily think that older developments that are re-emerging can’t be interesting (see the whole RL scene nowadays, which to my understanding is very much bringing back the kind of approaches that were popular in the 70s). But I do think the particular ML development that people should focus on is the one with the most potential, which will likely end up being newer. My grips with GPT-2 is that there’s no comparative proof that it has potential to generalize compared to a lot of other things (e.g. quick architecture search methods, custom encoders/heads added to a resnet), actually I’d say the sheer size of it and the issue one encounters when training it indicates the opposite.
I don’t think 2) is a must, but going back to 1), I think that training time is one of the important criterions to compare the approaches we are focusing on. Since training time on a simple task is arguably the best you can do to understand training time for a more complex task.
As for 3) and 4)… I’d agree with 3), I think 4) is too vague, but I wasn’t trying to bring either point across in this specific post.
Just an example of a library that can be used to do hyperparameter search quickly.
But again, there are many tools and methodologies and you can mix and match, this is one (methodology/idea of architecture search) that I found kinda of interesting for example: https://arxiv.org/pdf/1802.03268.pdf
Most people don’t have something like an architecture search algorithm on hand. (Aside from perhaps their brains, as you mentioned in ‘AGI is here’ post.)
In this case, surprise is a result of learning something. Yes, it’s surprising to you that not everyone has learned this already. (Though there are different ways/levels of learning things.) Releasing a good architecture search might help, or writing a post about this: “GPT-2 can do (probably) anything very badly that’s just moving symbols around. This might include rubix cubes, but not dancing*).”
*I assume. Also guessing that moving in general is hard (for non-custom hardware; things other than brains) and it has a big space that GPT-2 doesn’t have a shot at (like StarCraft/DotA/etc.).
The concern is that ‘GPT-2 is bad at everything, but better than random’, and people wondering, ‘how long until something that is good at everything comes along’? Will it be sudden, or will ‘bad’ have to be replaced by ‘slightly less bad’ a thousand times over the course of the next hundred/thousand years?
I’m not sure what you mean by this … ? Architecture search is fairly trivial to implement from scratch and take literally 2 lines of code with something like AX. Well, arguable if it’s trivial per-say, but I think most people would have an easier time coming up with, understanding and implementing architecture search than coming up with, understand and implementing a transformer (e.g. GPT-2) or any other attention-based network.
Again,I’m not sure why GPT2 wouldn’t have a shot and stracraft or dota. The most basic fully connected network you could write, as long as it has enough parameters and the correct training environment, has shot at starcraft2, dota… etc. It’s just that it will learn slower than something build for those specific cases.
Again, I’m not sure how “bad” and “good” are defined here. If you are defining them as “quick to train”, than again, something that’s “better at everything” than GPT-2 is already here since the 70s, dynamic architecture search *(ok, arguably only widely used in the last 6 years or so).
If you are talking about “able to solve”, then again, any architecture with enough parameters should be able to solve any problem that is solve-able given enough time to train, the time required to train it is the issue.
Moving has a lot of degrees of freedom, as do those domains. There’s also the issue of quick response time (which is not something it was built for), and it not being an economical solution (which can also be said for OpenAI’s work in those areas).
When things built for starcraft don’t make it to the superhuman level, something that isn’t built for it probably won’t.
The question is how long − 10 years? Solving chess via analyzing the whole tree would take too much time, so no one does it. Would it learn in a remotely feasible amount of time?
Well yeah, that’s my whole point here. We need to talk about accuracy and training time !
If the GPT-2 model was trained in a few hours, and losses 99% of games vs a decision tree based model (ala deep blue) that was trained in a few minutes on the same machine, then it’s worthless. It’s exactly like saying “In theory, given almost infinite RAM and 10 years we could beat deep blue (or alpha chess or whatever the cool kids are doing nowadays) by just analyzing a very large subset of all possible moves + combinations and arranging them hierarchically”.
So you think people should only be afraid/excited about developments in AGI that
1) are more recent than 50 to arguably 6 years ago
2) could do anything/a lot of things well with a reasonable amount of training time?
3) Or that might actually generalize in the sense of general artificial intelligence, that’s remotely close to being on par with humans (w.r.t ability to handle such a variety of domains)?
4) Seem actually agent-like.
In regards to 1), I don’t necessarily think that older developments that are re-emerging can’t be interesting (see the whole RL scene nowadays, which to my understanding is very much bringing back the kind of approaches that were popular in the 70s). But I do think the particular ML development that people should focus on is the one with the most potential, which will likely end up being newer. My grips with GPT-2 is that there’s no comparative proof that it has potential to generalize compared to a lot of other things (e.g. quick architecture search methods, custom encoders/heads added to a resnet), actually I’d say the sheer size of it and the issue one encounters when training it indicates the opposite.
I don’t think 2) is a must, but going back to 1), I think that training time is one of the important criterions to compare the approaches we are focusing on. Since training time on a simple task is arguably the best you can do to understand training time for a more complex task.
As for 3) and 4)… I’d agree with 3), I think 4) is too vague, but I wasn’t trying to bring either point across in this specific post.
?
https://github.com/facebook/Ax
Just an example of a library that can be used to do hyperparameter search quickly.
But again, there are many tools and methodologies and you can mix and match, this is one (methodology/idea of architecture search) that I found kinda of interesting for example: https://arxiv.org/pdf/1802.03268.pdf