Having read more AI alarmist literature recently, as someone who strongly disagrees with the subject, I think I’ve come up with a decent classification for them based on the fallacies they commit.
There’s the kind of alarmist that understands how machine learning works but commits the fallacy of assuming that data-gathering is easy and that intelligence is very valuable. The caricature of this position is something along the lines of “PAC learning basically proves that with enough computational resources AGI will take over the universe”.
But I think that my disagreement with this first class of alarmist is not very fundamental, we can probably agree on a few things such as:
1. In principle, the kind of intelligence needed for AGI is a solved problem, all that we are doing now is trying to optimize for various cases.
2. The increase in computational resources is enough to get us closer and closer to AGI even without any more research effort being allocated to the subject.
These types of alarmists would probably agree with me that, if we found out a way to magically multiply two arbitrary tensors 100x times faster than we do now, for the same electricity consumption, that would constitute a great leap forward.
But the second kind are the ones that scare/annoy me most, because they are the kind that don’t seem to really understand machine learning. Which results in them being surprised by the fact that machine learning models are able to do, what has been uncontroversially established that machine learning models could do for decades.
The not-so-caricatured representation of this position is: “Oh no, a 500,000,000 parameters models designed for {X} can outperform a 20KB decision tree when trained for task {Y}, the end is nigh !”
And if you think that this caricature is unkind (well, maybe the “end is nigh” part is), I’d invite you to read the latest blog entry by Scott Alexander , a writer which I generally consider to be quite intelligent and rational, being amazed that a 1,500,000,000 parameters transform architecture can be trained to play chess poorly… a problem that is so trivial one could probably power its training using a well-designed potato battery and an array of P2SC’s… simulated in minecraft.
I’ve seen endless examples of this, usually boiling down to “Oh no, a very complex neural network can do a very simple task with about the same accuracy as an <insert basic sklearn classifier>” or “Oh no, a neural network can learn to compress information from arbitrary unlabeled data”. Which is literally what people have been doing with neural networks since like… forever. That’s the point of neural networks, they are usually inefficient and hard to tune but are highly generalizable.
I think this second viewpoint is potentially dangerous and I think it would be well worth-while to educate people enough so that they switch from it. Since it seems to engender an irrational religion-style fear in people and it shifts focus away from the real problems (e.g. giving models the ability to estimate uncertainty in their own conclusions)
Regarding MIRI/SIAI/Yudkowsky, I think you are considerably overestimating the extent to which the early AI safety movement took any notice of research. Early MIRI obsessed about stuff like AIXI, that AI researchers didn’t care about, and based a lot of their nighmare scenarious on “genie” style reasoning derived from fairy tales.
Having read more AI alarmist literature recently, as someone who strongly disagrees with the subject, I think I’ve come up with a decent classification for them based on the fallacies they commit.
I feel similarly, except I think the flaws are a lack of clarity and jumping to conclusions, at times, rather than fallacies.
But I think that my disagreement with this first class of alarmist is no very fundamental, we can probably agree on a few things such as:
1. In principle, the kind of intelligence needed for AGI is a solved problem, all that we are doing now is trying to optimize for various cases.
2. The increase in computational resources is enough to get us closer and closer to AGI even without any more research effort being allocated to the subject.
This is definitely not something you will find agreement on. Thinking that this is something that alarmists would agree with you on suggests you are using a different definition of AGI than they are, and may have other significant misunderstandings of what they’re saying.
being amazed that a 1,500,000,000 parameters transform architecture can be trained to play chess poorly… a problem that is so trivial one could probably power its training using a well-designed potato battery and an array of P2SC’s… simulated in minecraft.
Truly, a new standard for replication. Jokes aside, I wouldn’t have said ‘amazed’, just surprised.* The question around that is, how far can you go with just textual pattern matching?** What can’t be done that way? RTS games? ‘Actually playing an instrument’ rather than writing music for one?
*From the article:
Is any of this meaningful? How impressed should we be that the same AI can write poems, compose music, and play chess, without having been designed for any of those tasks? I still don’t know.
**Though for comparison, it might be useful to see how well other programs, or humans do on those tasks. Ideally this would be a novel task, which requires people who haven’t played chess or heard music, or using an unfamiliar notation.
Well, the “surprised” part is what I don’t understand.
As in, in principle, provided enough model complexity (as in, allowing it to model very complex functions) you can basically learn anything as long as you can format your inputs and outputs in such a way as to fit the model.
1.5B parameters is more than enough complexity to learn chess, provided that it’s been done via models with 0.01% the amount of parameters.
In general the only issue is that the less-fitted your model is for the task the more it takes to train it. Given that I can train a deep blue equivalent model on an RTX2080 would be counted in the minutes, the fact that you can train something worse than that using the GPT-2 architecture in a few hours of days is.… not really impressive, nor is it surprising. It would have been surprising if the training time was low than other bleeding-edge model or similar to them and the resulting equation could out-perform or match them.
As it stands the idea of a generalizable architecture that take a very long time to train is not very useful, since we already have quick ways of doing architecture search and hyper parameter search. The idea that a given architecture can learn to solve X/Y and Z efficiently (which in this case it isn’t) wouldn’t even be that impressive, unless you couldn’t get a good arhictecture search algorithm to solve X,Y and Z equally fast.
The idea that a given architecture can learn to solve X/Y and Z efficiently (which in this case it isn’t) wouldn’t even be that impressive, unless you couldn’t get a good [architecture] search algorithm to solve X,Y and Z equally fast.
Most people don’t have something like an architecture search algorithm on hand. (Aside from perhaps their brains, as you mentioned in ‘AGI is here’ post.)
Well, the “surprised” part is what I don’t understand.
In this case, surprise is a result of learning something. Yes, it’s surprising to you that not everyone has learned this already. (Though there are different ways/levels of learning things.) Releasing a good architecture search might help, or writing a post about this: “GPT-2 can do (probably) anything very badly that’s just moving symbols around. This might include rubix cubes, but not dancing*).”
*I assume. Also guessing that moving in general is hard (for non-custom hardware; things other than brains) and it has a big space that GPT-2 doesn’t have a shot at (like StarCraft/DotA/etc.).
The concern is that ‘GPT-2 is bad at everything, but better than random’, and people wondering, ‘how long until something that is good at everything comes along’? Will it be sudden, or will ‘bad’ have to be replaced by ‘slightly less bad’ a thousand times over the course of the next hundred/thousand years?
Most people don’t have something like an architecture search algorithm on hand
I’m not sure what you mean by this … ? Architecture search is fairly trivial to implement from scratch and take literally 2 lines of code with something like AX. Well, arguable if it’s trivial per-say, but I think most people would have an easier time coming up with, understanding and implementing architecture search than coming up with, understand and implementing a transformer (e.g. GPT-2) or any other attention-based network.
I assume. Also guessing that moving in general is hard (for non-custom hardware; things other than brains) and it has a big space that GPT-2 doesn’t have a shot at (like StarCraft/DotA/etc.).
Again,I’m not sure why GPT2 wouldn’t have a shot and stracraft or dota. The most basic fully connected network you could write, as long as it has enough parameters and the correct training environment, has shot at starcraft2, dota… etc. It’s just that it will learn slower than something build for those specific cases.
The concern is that ‘GPT-2 is bad at everything, but better than random’, and people wondering, ‘how long until something that is good at everything comes along’? Will it be sudden, or will ‘bad’ have to be replaced by ‘slightly less bad’ a thousand times over the course of the next hundred/thousand years?
Again, I’m not sure how “bad” and “good” are defined here. If you are defining them as “quick to train”, than again, something that’s “better at everything” than GPT-2 is already here since the 70s, dynamic architecture search *(ok, arguably only widely used in the last 6 years or so).
If you are talking about “able to solve”, then again, any architecture with enough parameters should be able to solve any problem that is solve-able given enough time to train, the time required to train it is the issue.
[Again, I’m] not sure why GPT2 wouldn’t have a shot [at] stracraft or dota. The most basic fully connected network you could write, as long as it has enough parameters and the correct training environment, has shot at starcraft2, dota… etc.
Moving has a lot of degrees of freedom, as do those domains. There’s also the issue of quick response time (which is not something it was built for), and it not being an economical solution (which can also be said for OpenAI’s work in those areas).
When things built for starcraft don’t make it to the superhuman level, something that isn’t built for it probably won’t.
It’s just that it will learn slower than something [built] for those specific cases.
The question is how long − 10 years? Solving chess via analyzing the whole tree would take too much time, so no one does it. Would it learn in a remotely feasible amount of time?
The question is how long − 10 years? Solving chess via analyzing the whole tree would take too much time, so no one does it. Would it learn in a remotely feasible amount of time?
Well yeah, that’s my whole point here. We need to talk about accuracy and training time !
If the GPT-2 model was trained in a few hours, and losses 99% of games vs a decision tree based model (ala deep blue) that was trained in a few minutes on the same machine, then it’s worthless. It’s exactly like saying “In theory, given almost infinite RAM and 10 years we could beat deep blue (or alpha chess or whatever the cool kids are doing nowadays) by just analyzing a very large subset of all possible moves + combinations and arranging them hierarchically”.
So you think people should only be afraid/excited about developments in AGI that
1) are more recent than 50 to arguably 6 years ago
2) could do anything/a lot of things well with a reasonable amount of training time?
3) Or that might actually generalize in the sense of general artificial intelligence, that’s remotely close to being on par with humans (w.r.t ability to handle such a variety of domains)?
In regards to 1), I don’t necessarily think that older developments that are re-emerging can’t be interesting (see the whole RL scene nowadays, which to my understanding is very much bringing back the kind of approaches that were popular in the 70s). But I do think the particular ML development that people should focus on is the one with the most potential, which will likely end up being newer. My grips with GPT-2 is that there’s no comparative proof that it has potential to generalize compared to a lot of other things (e.g. quick architecture search methods, custom encoders/heads added to a resnet), actually I’d say the sheer size of it and the issue one encounters when training it indicates the opposite.
I don’t think 2) is a must, but going back to 1), I think that training time is one of the important criterions to compare the approaches we are focusing on. Since training time on a simple task is arguably the best you can do to understand training time for a more complex task.
As for 3) and 4)… I’d agree with 3), I think 4) is too vague, but I wasn’t trying to bring either point across in this specific post.
Just an example of a library that can be used to do hyperparameter search quickly.
But again, there are many tools and methodologies and you can mix and match, this is one (methodology/idea of architecture search) that I found kinda of interesting for example: https://arxiv.org/pdf/1802.03268.pdf
Having read more AI alarmist literature recently, as someone who strongly disagrees with the subject, I think I’ve come up with a decent classification for them based on the fallacies they commit.
There’s the kind of alarmist that understands how machine learning works but commits the fallacy of assuming that data-gathering is easy and that intelligence is very valuable. The caricature of this position is something along the lines of “PAC learning basically proves that with enough computational resources AGI will take over the universe”.
<I actually wrote an article trying to argue against this position, the LW corsspost of which gave me the honor of having the most down-voted featured article in this forum’s history>
But I think that my disagreement with this first class of alarmist is not very fundamental, we can probably agree on a few things such as:
1. In principle, the kind of intelligence needed for AGI is a solved problem, all that we are doing now is trying to optimize for various cases.
2. The increase in computational resources is enough to get us closer and closer to AGI even without any more research effort being allocated to the subject.
These types of alarmists would probably agree with me that, if we found out a way to magically multiply two arbitrary tensors 100x times faster than we do now, for the same electricity consumption, that would constitute a great leap forward.
But the second kind are the ones that scare/annoy me most, because they are the kind that don’t seem to really understand machine learning. Which results in them being surprised by the fact that machine learning models are able to do, what has been uncontroversially established that machine learning models could do for decades.
The not-so-caricatured representation of this position is: “Oh no, a 500,000,000 parameters models designed for {X} can outperform a 20KB decision tree when trained for task {Y}, the end is nigh !”
And if you think that this caricature is unkind (well, maybe the “end is nigh” part is), I’d invite you to read the latest blog entry by Scott Alexander , a writer which I generally consider to be quite intelligent and rational, being amazed that a 1,500,000,000 parameters transform architecture can be trained to play chess poorly… a problem that is so trivial one could probably power its training using a well-designed potato battery and an array of P2SC’s… simulated in minecraft.
I’ve seen endless examples of this, usually boiling down to “Oh no, a very complex neural network can do a very simple task with about the same accuracy as an <insert basic sklearn classifier>” or “Oh no, a neural network can learn to compress information from arbitrary unlabeled data”. Which is literally what people have been doing with neural networks since like… forever. That’s the point of neural networks, they are usually inefficient and hard to tune but are highly generalizable.
I think this second viewpoint is potentially dangerous and I think it would be well worth-while to educate people enough so that they switch from it. Since it seems to engender an irrational religion-style fear in people and it shifts focus away from the real problems (e.g. giving models the ability to estimate uncertainty in their own conclusions)
Regarding MIRI/SIAI/Yudkowsky, I think you are considerably overestimating the extent to which the early AI safety movement took any notice of research. Early MIRI obsessed about stuff like AIXI, that AI researchers didn’t care about, and based a lot of their nighmare scenarious on “genie” style reasoning derived from fairy tales.
And the thing I said that isn’t factually correct is...
(This is arguably testable.)
The only thing factually incorrect is your implied assumption that voting has anything to do with truth assessment here ;)
I feel similarly, except I think the flaws are a lack of clarity and jumping to conclusions, at times, rather than fallacies.
This is definitely not something you will find agreement on. Thinking that this is something that alarmists would agree with you on suggests you are using a different definition of AGI than they are, and may have other significant misunderstandings of what they’re saying.
Would you care to go into more details ?
If there’s different definitions of AGI then that’s quite a barrier to understanding generally. Never mind my confusions as a curious newbie.
This feels a really good time to jump in and ask for a working definition of AGI. (simple words, no links to essays please.)
Truly, a new standard for replication. Jokes aside, I wouldn’t have said ‘amazed’, just surprised.* The question around that is, how far can you go with just textual pattern matching?** What can’t be done that way? RTS games? ‘Actually playing an instrument’ rather than writing music for one?
*From the article:
**Though for comparison, it might be useful to see how well other programs, or humans do on those tasks. Ideally this would be a novel task, which requires people who haven’t played chess or heard music, or using an unfamiliar notation.
Well, the “surprised” part is what I don’t understand.
As in, in principle, provided enough model complexity (as in, allowing it to model very complex functions) you can basically learn anything as long as you can format your inputs and outputs in such a way as to fit the model.
1.5B parameters is more than enough complexity to learn chess, provided that it’s been done via models with 0.01% the amount of parameters.
In general the only issue is that the less-fitted your model is for the task the more it takes to train it. Given that I can train a deep blue equivalent model on an RTX2080 would be counted in the minutes, the fact that you can train something worse than that using the GPT-2 architecture in a few hours of days is.… not really impressive, nor is it surprising. It would have been surprising if the training time was low than other bleeding-edge model or similar to them and the resulting equation could out-perform or match them.
As it stands the idea of a generalizable architecture that take a very long time to train is not very useful, since we already have quick ways of doing architecture search and hyper parameter search. The idea that a given architecture can learn to solve X/Y and Z efficiently (which in this case it isn’t) wouldn’t even be that impressive, unless you couldn’t get a good arhictecture search algorithm to solve X,Y and Z equally fast.
Most people don’t have something like an architecture search algorithm on hand. (Aside from perhaps their brains, as you mentioned in ‘AGI is here’ post.)
In this case, surprise is a result of learning something. Yes, it’s surprising to you that not everyone has learned this already. (Though there are different ways/levels of learning things.) Releasing a good architecture search might help, or writing a post about this: “GPT-2 can do (probably) anything very badly that’s just moving symbols around. This might include rubix cubes, but not dancing*).”
*I assume. Also guessing that moving in general is hard (for non-custom hardware; things other than brains) and it has a big space that GPT-2 doesn’t have a shot at (like StarCraft/DotA/etc.).
The concern is that ‘GPT-2 is bad at everything, but better than random’, and people wondering, ‘how long until something that is good at everything comes along’? Will it be sudden, or will ‘bad’ have to be replaced by ‘slightly less bad’ a thousand times over the course of the next hundred/thousand years?
I’m not sure what you mean by this … ? Architecture search is fairly trivial to implement from scratch and take literally 2 lines of code with something like AX. Well, arguable if it’s trivial per-say, but I think most people would have an easier time coming up with, understanding and implementing architecture search than coming up with, understand and implementing a transformer (e.g. GPT-2) or any other attention-based network.
Again,I’m not sure why GPT2 wouldn’t have a shot and stracraft or dota. The most basic fully connected network you could write, as long as it has enough parameters and the correct training environment, has shot at starcraft2, dota… etc. It’s just that it will learn slower than something build for those specific cases.
Again, I’m not sure how “bad” and “good” are defined here. If you are defining them as “quick to train”, than again, something that’s “better at everything” than GPT-2 is already here since the 70s, dynamic architecture search *(ok, arguably only widely used in the last 6 years or so).
If you are talking about “able to solve”, then again, any architecture with enough parameters should be able to solve any problem that is solve-able given enough time to train, the time required to train it is the issue.
Moving has a lot of degrees of freedom, as do those domains. There’s also the issue of quick response time (which is not something it was built for), and it not being an economical solution (which can also be said for OpenAI’s work in those areas).
When things built for starcraft don’t make it to the superhuman level, something that isn’t built for it probably won’t.
The question is how long − 10 years? Solving chess via analyzing the whole tree would take too much time, so no one does it. Would it learn in a remotely feasible amount of time?
Well yeah, that’s my whole point here. We need to talk about accuracy and training time !
If the GPT-2 model was trained in a few hours, and losses 99% of games vs a decision tree based model (ala deep blue) that was trained in a few minutes on the same machine, then it’s worthless. It’s exactly like saying “In theory, given almost infinite RAM and 10 years we could beat deep blue (or alpha chess or whatever the cool kids are doing nowadays) by just analyzing a very large subset of all possible moves + combinations and arranging them hierarchically”.
So you think people should only be afraid/excited about developments in AGI that
1) are more recent than 50 to arguably 6 years ago
2) could do anything/a lot of things well with a reasonable amount of training time?
3) Or that might actually generalize in the sense of general artificial intelligence, that’s remotely close to being on par with humans (w.r.t ability to handle such a variety of domains)?
4) Seem actually agent-like.
In regards to 1), I don’t necessarily think that older developments that are re-emerging can’t be interesting (see the whole RL scene nowadays, which to my understanding is very much bringing back the kind of approaches that were popular in the 70s). But I do think the particular ML development that people should focus on is the one with the most potential, which will likely end up being newer. My grips with GPT-2 is that there’s no comparative proof that it has potential to generalize compared to a lot of other things (e.g. quick architecture search methods, custom encoders/heads added to a resnet), actually I’d say the sheer size of it and the issue one encounters when training it indicates the opposite.
I don’t think 2) is a must, but going back to 1), I think that training time is one of the important criterions to compare the approaches we are focusing on. Since training time on a simple task is arguably the best you can do to understand training time for a more complex task.
As for 3) and 4)… I’d agree with 3), I think 4) is too vague, but I wasn’t trying to bring either point across in this specific post.
?
https://github.com/facebook/Ax
Just an example of a library that can be used to do hyperparameter search quickly.
But again, there are many tools and methodologies and you can mix and match, this is one (methodology/idea of architecture search) that I found kinda of interesting for example: https://arxiv.org/pdf/1802.03268.pdf
Errata:
two?
fixed those mistakes