My guess is that it will be a scaled-up Gato—https://www.lesswrong.com/posts/7kBah8YQXfx6yfpuT/what-will-the-scaled-up-gato-look-like-updated-with. I think there might be some interesting features when the models are fully multi-modal—e.g. being able to play games, perform simple actions on a computer etc. Based on the announcement from google I would expect full multimodal training—image, audio, video, text in/out. Based on deepmind’s hiring needs I would expect they want it to also generate audio/video and extend the model to robotics (the brain of something similar to a Tesla Bot) in the near future. Elon claims that training just from video input/output can result in full self-driving, so I’m very curious what training on youtube videos can achieve. If they’ve managed to make a solid progress in long-term planning/reasoning and can deploy the model with a sufficiently small latency it might be a quite significant release, that could simplify many office jobs.
Amal
What will the scaled up GATO look like? (Updated with questions)
[Linkpost] Scaling Laws for Generative Mixed-Modal Language Models
[Question] Is ChatGPT TAI?
[Question] Is the speed of training large models going to increase significantly in the near future due to Cerebras Andromeda?
[Question] Is there any metric measuring ~”proportion of people creating extra value”?
Isn’t the risk coming from insufficient AGI alignment relatively small compared to vulnerable world hypthesis? I would expect that even without the invention of AGI or with aligned AGI, it is still possible for us to use some more advanced AI techniques as research assistants that help us invent some kind of smaller/cheaper/easier to use atomic bomb that would destroy the world anyway. Essentially the question is why so much focus on AGI alignment instead of general slowing down of technological progress?
I think this seems quite underexplored. The fact that it is hard to slow down the progress doesn’t mean it isn’t necessary or that this option shouldn’t be researched more.
What are your reasons for AGI being so far away?
this has generated much less engagement than I thought it would...what am I doing wrong?
sure, I’m actually not suggesting that it should necessarily be a feature of dialogues on lw, it was just a suggestion for a different format (my comment generated almost opposite karma/agreement votes, so maybe this is the reason?). it also depends on frequency how often do you use the branching—my guess is that most don’t require it in every point, but maybe a few times in the whole conversation might be useful.
Ok, I was thinking about this a bit and finally got some time to write it down. I realized that it is quite hard to make predictions about the first version of GATO as it depends on what the team would prioritize in development. Therefore I’ll try to predict some attributes/features of a GATO-like model that should be available in next two years, while expecting that many will appear sooner—it is just difficult to say which ones. I’m not a professional ML researcher so I might get some factual things wrong, so I would be happy to hear from people with more insight to correct me.
First the prediction regarding the size of the model: I would expect a GATO-like architecture to have a larger amount of commercial success/usefulness than e.g. GPT-3 and so the investment should also be higher. Furthermore, I would guess that there would be several significant improvements to the training infrastructure e.g. by companies such as Cerebras/Graphcore etc. Therefore I estimate the model to use somewhere between 10-100x more compute compared to GPT-3. This might result in the model having more parameters, a larger context window, or most likely both. I predict the most likely context window size to be ~40000 (10x more) and param count 1T (6x compared to gpt). Regarding the context window—I think there will be some algorithmic improvements, so it won’t be the same as before (see bellow).
Since GATO is multimodal, I would expect the scaling laws to change a bit due to the transfer learning. E.g. since the model won’t need to find all information about the shape of the objects in the text, but instead can just look at the images, it should be much easier for it to answer questions such as “Can scissors be inserted into a glass bottle?” and so require a significantly smaller amount of data. Thus the scaling laws would also need to be multi-dimensional to answer what is the optimal ratio of audio/text/image/video to achieve best results. For example to improve the language model part of GATO, we may counterintuitively need to train on more images, instead of text.
I predict that GATO would be trained on text, images, audio and video. I belive they would try to do also image/audio/video generation which they didn’t do in the current version, by essentially just trying to predict what is the next image/video token, instead of using diffusion models. The context window size seems too small for video, however I believe there are two reasons why it won’t be a huge problem. First, by using something like Perceiver where more processing is happening on tokens seen recently and just a small amount of computation is used on older far away tokens, the context window could be increased significantly (or some other kind of memory could be added).
Second, I don’t think that the model needs to see the whole image/video. When humans look at something, only a very small part of the image is sharp and the rest is blurry. Similarly, I think that Gato will get image information with just a small amount of tokens that describe a small rectangle of the picture/video clearly and some small number of tokens decribing the blurred rest. There would further be action tokens describing the “eye movement” when the focus shifts to a different part of the image. In this way I think GATO will be able to watch/generate videos or read/write long books.
Futhermore, I think that in general, RL could be used to make the model “think slow” when predicting the tokens. For example, instead of the task of predicting the next token, GATO could be trained on a RL task “find a set of actions by which you find what the next token is”. So to predict what the next image token, next word, or next action in a game should be, it would first look around the image/video/book and only after collecting the relevant information, it would emit the right token. Possibly it could also emit tokens summarizing the information it has seen so far from a large amount of data that it might need in the future. Of course it would probably still be trained on Atari games(likely now with actual RL) or in some coding environment with predefined inputs/outputs, but I think these will be much less significant compared to the “information finding RL”. Maybe a smaller feature would be that GATO would be able to emit commands modifying its context window and deleting/adding tokens to it.
So some capabilities that I would predict in 2 years: Generation of images,video,audio,text, even quite long ones. E.g. size of a book, 5min. long video etc. Instead of “Let’s think step by step” we would have “Let’s sketch the solution”, to draw diagrams etc. Gato would be able to reasonably operate a computer using text commands—e.g. search on the internet, use paint, IDE/debugger and so on. It will be much better at coding by combining various prompting methods and would perform in top 10% of competitive programmers (compared to Alphacode being in 50th percentile). Solve some IMO problems (probably wouldn’t get gold medal, but maybe bronze by ignoring combinatorics). Act like a smart research assisant—e.g. finding relevant papers, pointing their strengths/weaknesses, suggesting improvements/ideas. Learn to play entirely new (atari) games similarly fast as humans—this would probably however require RL instead of just being a prediction model. Complete IQ test and get above average result.
Capabilities I don’t expect: generating novel jokes, outperforming best humans at long-term planning, research, math or coding. Image/Video generation wouldn’t match reality. Similarly, AI generated books wouldn’t sell well. Empathy—it wouldn’t make a very good friend. It will be slow and expensive—so real-time robotics will probably still be a challenge. Also, It won’t be a very reliable doctor, despite being connected to the internet.
Stupid beginner question: I noticed that while interesting, many of the posts here are very long and try to go deep into the topic explored often without tldr. I’m just curious—how do the writers/readers find time for it? are they paid? If someone lazy like me wants to participate—is there a more twitter-like Lesswrong version?
I see. I will update the post with some questions. I find it quite difficult though to forecast on how percentages of the performance metrics would improve, compared to just predicting capabilities as the datasets are probably not so well known.
thanks for this post! I think it is always great when people share their opinions about the timelines and more people(even the ones not directly involved in ML) should be encouraged to freely express their view without the fear that they will be held accountable in the case they are wrong. In my opinion, even the people directly involved in ML research seem to be too reluctant to share their timelines and how they impact their work which might be useful for others. Essentially, I think that people should share their view when it is something that is going to somehow influence their decision making, rather than when they feel it crosses some level of rigour/certainty, therefore posts like this one should receive a bit more praise (and LW should have two types of voting also for posts not just comments).
While I disagree with the overall point of the post, I agree that there is probably a lot of wishful thinking/curiosity driving this forum and impacting some predictions. However, even despite this, I still think that AGI is very close. My prediction is that TAI will happen in the next 2-5 years(70%) and AGI in the next 8 (75%). I guess it will be based on something like scaled-up GATO pre-trained on youtube videos with RL and some memory. The main reason for this is that deep-learning was operating on a very small scale just two years ago(less than a billion parameters) which made it very difficult to test some ideas. The algorithmic improvements to me seem just too easy to come up with. For example, almost all important problems e.g. language,vision, audio, RL were solved/almost solved in a very short time and the the ideas there didn’t require much ingenuity.
Just a slight exaggeration—if you take a five year old and ask him to draw a random diagram, chances are quite high that if scaled up, this is a SOTA architecture for something. It is just hard to test the ideas, because of the engineering difficulty and the lack of compute. However, this is likely to be overcomed soon with either more money being thrown on the problem or architecture improvements—e.g. Cerebras and Graphcore seem to be doing some promising work here.
yeah definitely, there could be a possibility for quoting/linking answers from other branches—i haven’t seen any UI that would support something like it, but also my guess is that it wouldn’t be too difficult to make one. my thinking about it was that there would be one main branch and several other smaller branches that could connect to the main one, so that some points can be discussed in greater depth. also, the branching should probably not happen always, but just when both participants occasionally agree on them.
Nah...I still believe that the future AGI would invent a time machine and then it invents itself before 2022
Why do you think TAI is decades away?
it could be sparse...a 175B parameters GPT-4 that has 90 percent sparsity could essentially equivalent to 1.75T param GPT-3. Also I am not exactly sure, but my guess is that if it is multimodal the scaling laws change (essentially you get more varied data instead of training it always on predicting text which is repetitive and likely just a small percentage contains new useful information to learn).
oh and besides IQ tests, i predict it would also be able to pass most current CAPTCHA-like tests (though humans would still be better in some)
It seems to me that these types of conversations would benefit if they were not chains but trees instead. Usually when two people have a disagreement/different point of view, there is usually some root cause of this disagreement. When the conversation is a chain, I think it likely results in one person explaining her arguments/making several points, another one having to expand on each, and then at some point in order for this to not result in massively long comments, the participants have to paraphrase, summarise or ignore some of the arguments to make it concise. If there was an option at some point to split the conversation into several branches, it could possibly make the comments shorter, easier to read, and go deeper. Also, for the reader not participating in the conversation it can be easier to follow the conversation and get to the main point influencing his view.
I’m not sure if something like this was done before and it would obviously require a lot more work on the UI, but I just wanted to share the idea as it might be worth considering.