I understand why you doubt that anything is more data-efficient or compute-efficient than the human brain alone. The problem is that the AIs are raised on far more compute and training data. As I commented on Yudkowsky’s attempt to explain the danger of the ASI, the problem with the ASI is that it has far more training data and compute than a human, so even an algorithm hundreds of times weaker than a human would still cause the AI to learn more about the world than the human knows.
Returning to the proposals which you list as failing to create the ASI:
Just increasing compute. Scaling laws measure the wrong objective because loss is far easier to measure than benchmark performance, which is nonzero only for models large enough.
Higher quality data.Suppose that you would like the model to predict the idea in the next set of tokens (e.g. via the recent CALM technique where the LLM generates an entire sequence of tokens). Then one can, for example, ask a not-so-intelligent LLM to check whether the student’s idea and the real idea are the same.
Synthetic data. Synthetic datasets could also allow things like models trying to solve problems, then being trained on successful solutions.
Curriculum learning. Agreed.
Using another smaller LLM as an evaluator. The problem is that it’s far easier to check whether some proof is actually false than to generate the true proof. Consider the P=NP conjecture which is believed to be false and means that for any problem where one can check a solution (e.g. the values of variables satisfying all conditions in the 3-SAT formula. The checking is done by just substituting the values) in polynomial time one can also generate the solution in polynomial time.
So a smaller LLM rejecting the bigger LLM’s outputs would likely teach the latter to be more intelligent than the smaller model.
RLHF ???
Transformers and “attention”: Mostly correct. Alas, SOTA LLMs are capable of things like solving the IMO by using only their pre-set weights and the Chain of Thought where they write their thoughts. We have yet to rule out the possibilities like neuralese recurrence where a larger part of the model is affected.
Having models learn online means that they either modify only a little part of weights or that fast-access long-term memory required to implement the models reaches tens of gigabytes per userwhich is infeasible.
While I failed to understand the rest of proposals which wouldn’t scale to the ASI, this conclusion requires a more delicate analysis like the ones by Toby Ord and me.
Thanks for your detailed response. I agree that if we have enough data/compute, we could overcome the data/compute inefficiency of AI models. I suspect the AI models are so intensely data/compute inefficient that this will be very difficult though, and that’s what I tried to gesture at in my post. If I could prove it, I’d have written a white paper or something instead of a blog post, but I hoped to at least share some of my thoughts on the subject.
Some specific responses:
Just increasing compute. I agree this is why we measure loss, but that doesn’t imply that measuring loss will get us to superintelligence long-term. Also, for this: “benchmark performance, which is nonzero only for models large enough”, I think you could have benchmarks that scale with the model, like novel games that start simple, and grow more complex as the model gains capability. Either manually, or implicitly as with AlphaGo Zero.
Higher quality data. Thanks for bringing my attention to CALM, I’ll have to look into that. I don’t think using a not-so-intelligent LLM to check whether the student’s idea and the real idea are the same will work in the limit, for the same reason it would be hard to get a kindergartner to grade a high school math test, even if they had access to a correct version written by the teacher. (Assuming the test wasn’t multiple choice, or single numerical answers or something easy to verify.)
Using another smaller LLM as an evaluator. I’m definitely not against all approaches that use a smaller LLM to evaluate a larger LLM, and you’re right to push back here. In fact, I almost suggested one such approach in my “what might work” section. Narrow models like AlphaGo Zero do something like this to great effect. What I’m against specifically is asking smaller models to evaluate the “goodness” of an output, and trusting the smaller LLM to have good judgement about what is good. If it had to judge something specific and objective, that would possibly work. You want to trust the small model only for what it’s good at (parsing sentence structure/basic meaning of outputs, for example) and not what it’s bad at.
RLHF. RLHF works for what it does, but no amount of RLHF can overcome the problems with self-supervised learning I discussed in the post. It’s still a general “reality-stuffing” model. That’s all I meant.
Transformers and “attention”. I do not take the benchmarks like solving the IMO seriously. These same AI models fail to solve kindergarten math worksheets, and fail to solve very basic problems in practice all the time. In particular, it does not seem smart to test how well a model can think by giving it problems that may require a whole lot of thinking, or may require not much, depending on what similar things happened to be in the training data, which we have no idea about. You mentioned P=NP. Solving problems is much easier if you already know how to solve similar-enough problems. We don’t know what similar problems a given model does or does not know how to solve. Rendering the benchmark useless. Unless you construct a benchmark such that we know there can’t have been anything meaningfully similar in the training data (e.g. novel games). (I am unsure whether to take FrontierMath Tier 4 a bit more seriously because the problems seem really hard and unlikely to be similar to anything in the training data, but ideally you’d have a benchmark that works even for less difficult problems anyway.)
As for your comment about online learning, I don’t think solving any particular task should require a model to totally reorganize its weights across the entire model. Updating only a little part of weights should be fine. An analogy to humans should show that much. I agree though that having to hold onto fine-tuned partial models for users, even briefly, is more expensive than what we’re doing now, but the capabilities gains may eventually be worth it if non-online-learning models do plateau.
I understand why you doubt that anything is more data-efficient or compute-efficient than the human brain alone. The problem is that the AIs are raised on far more compute and training data. As I commented on Yudkowsky’s attempt to explain the danger of the ASI, the problem with the ASI is that it has far more training data and compute than a human, so even an algorithm hundreds of times weaker than a human would still cause the AI to learn more about the world than the human knows.
Returning to the proposals which you list as failing to create the ASI:
Just increasing compute. Scaling laws measure the wrong objective because loss is far easier to measure than benchmark performance, which is nonzero only for models large enough.
Higher quality data.Suppose that you would like the model to predict the idea in the next set of tokens (e.g. via the recent CALM technique where the LLM generates an entire sequence of tokens). Then one can, for example, ask a not-so-intelligent LLM to check whether the student’s idea and the real idea are the same.
Synthetic data. Synthetic datasets could also allow things like models trying to solve problems, then being trained on successful solutions.
Curriculum learning. Agreed.
Using another smaller LLM as an evaluator. The problem is that it’s far easier to check whether some proof is actually false than to generate the true proof. Consider the P=NP conjecture which is believed to be false and means that for any problem where one can check a solution (e.g. the values of variables satisfying all conditions in the 3-SAT formula. The checking is done by just substituting the values) in polynomial time one can also generate the solution in polynomial time.
So a smaller LLM rejecting the bigger LLM’s outputs would likely teach the latter to be more intelligent than the smaller model.
RLHF ???
Transformers and “attention”: Mostly correct. Alas, SOTA LLMs are capable of things like solving the IMO by using only their pre-set weights and the Chain of Thought where they write their thoughts. We have yet to rule out the possibilities like neuralese recurrence where a larger part of the model is affected.
Having models learn online means that they either modify only a little part of weights or that fast-access long-term memory required to implement the models reaches tens of gigabytes per user which is infeasible.
While I failed to understand the rest of proposals which wouldn’t scale to the ASI, this conclusion requires a more delicate analysis like the ones by Toby Ord and me.
Thanks for your detailed response. I agree that if we have enough data/compute, we could overcome the data/compute inefficiency of AI models. I suspect the AI models are so intensely data/compute inefficient that this will be very difficult though, and that’s what I tried to gesture at in my post. If I could prove it, I’d have written a white paper or something instead of a blog post, but I hoped to at least share some of my thoughts on the subject.
Some specific responses:
Just increasing compute. I agree this is why we measure loss, but that doesn’t imply that measuring loss will get us to superintelligence long-term. Also, for this: “benchmark performance, which is nonzero only for models large enough”, I think you could have benchmarks that scale with the model, like novel games that start simple, and grow more complex as the model gains capability. Either manually, or implicitly as with AlphaGo Zero.
Higher quality data. Thanks for bringing my attention to CALM, I’ll have to look into that. I don’t think using a not-so-intelligent LLM to check whether the student’s idea and the real idea are the same will work in the limit, for the same reason it would be hard to get a kindergartner to grade a high school math test, even if they had access to a correct version written by the teacher. (Assuming the test wasn’t multiple choice, or single numerical answers or something easy to verify.)
Using another smaller LLM as an evaluator. I’m definitely not against all approaches that use a smaller LLM to evaluate a larger LLM, and you’re right to push back here. In fact, I almost suggested one such approach in my “what might work” section. Narrow models like AlphaGo Zero do something like this to great effect. What I’m against specifically is asking smaller models to evaluate the “goodness” of an output, and trusting the smaller LLM to have good judgement about what is good. If it had to judge something specific and objective, that would possibly work. You want to trust the small model only for what it’s good at (parsing sentence structure/basic meaning of outputs, for example) and not what it’s bad at.
RLHF. RLHF works for what it does, but no amount of RLHF can overcome the problems with self-supervised learning I discussed in the post. It’s still a general “reality-stuffing” model. That’s all I meant.
Transformers and “attention”. I do not take the benchmarks like solving the IMO seriously. These same AI models fail to solve kindergarten math worksheets, and fail to solve very basic problems in practice all the time. In particular, it does not seem smart to test how well a model can think by giving it problems that may require a whole lot of thinking, or may require not much, depending on what similar things happened to be in the training data, which we have no idea about. You mentioned P=NP. Solving problems is much easier if you already know how to solve similar-enough problems. We don’t know what similar problems a given model does or does not know how to solve. Rendering the benchmark useless. Unless you construct a benchmark such that we know there can’t have been anything meaningfully similar in the training data (e.g. novel games). (I am unsure whether to take FrontierMath Tier 4 a bit more seriously because the problems seem really hard and unlikely to be similar to anything in the training data, but ideally you’d have a benchmark that works even for less difficult problems anyway.) As for your comment about online learning, I don’t think solving any particular task should require a model to totally reorganize its weights across the entire model. Updating only a little part of weights should be fine. An analogy to humans should show that much. I agree though that having to hold onto fine-tuned partial models for users, even briefly, is more expensive than what we’re doing now, but the capabilities gains may eventually be worth it if non-online-learning models do plateau.