Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
Someone who is interested in learning and doing good.
My Twitter: https://twitter.com/MatthewJBar
My Substack: https://matthewbarnett.substack.com/
What are your thoughts on the argument that advancing capabilities could help make us safer?
In order to do alignment research, we need to understand how AGI works; and we currently don’t understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. Doing capabilities research now is good because it’s likely to be slower now than it might be in some future where we had even more computing power, neuroscience understanding, etc. than we do now. If we successfully delayed capabilities research until a later time, then we might get a sudden spurt of it and wouldn’t have the time to turn our increased capabilities understanding into alignment progress. Thus by doing capabilities research now, we buy ourselves a longer time period in which it’s possible to do more effective alignment research.
In addition to the tradeoff hypothesis you mentioned, it’s noteworthy that humans can’t currently prevent value drift (among ourselves), although we sometimes take various actions to prevent it, such as passing laws designed to enforce the instruction of traditional values in schools.
Here’s my sketch of a potential explanation for why humans can’t or don’t currently prevent value drift:
(1) Preventing many forms of value drift would require violating rights that we consider to be inviolable. For example, it might require brainwashing or restricting the speech of adults.
(2) Humans don’t have full control over our environments. Many forms of value drift comes from sources that are extremely difficult to isolate and monitor, such as private conversation and reflection. To prevent value drift we would need to invest a very high amount of resources into the endeavor.
(3) Individually, few of us care about general value drift much because we know that individuals can’t change the trajectory of general value drift by much. Most people are selfish and don’t care about value drift except to the extent that it harms them directly.
(4) Plausibly, at every point in time, instantaneous value drift looks essentially harmless, even as the ultimate destination is not something anyone would have initially endorsed (c.f. the boiling frog metaphor). This seems more likely if we assume that humans heavily discount the future.
(5) Many of us think that value drift is good, since it’s at least partly based on moral reflection.
My guess is that people are more likely to consider extreme measures to ensure the fidelity of AI preferences, including violating what would otherwise be considered their “rights” if we were talking about humans. That gives me some optimism about solving this problem, but there are also some reasons for pessimism in the case of AI:
Since the space of possible AIs is much larger than the space of humans, there are more degrees of freedom along which AI values can change.
Creating new AIs is often cheaper than creating new humans, and so people might regularly spin up new AIs to perform particular functions, while discounting the long-term effect this has on value drift (since the costs are mostly borne by civilization in general, rather than them in particular).
I agree with you here to some extent. I’m much less worried about disempowerment than extinction. But the way we get disempowered could also be really bad. Like, I’d rather humanity not be like a pet in a zoo.
Why doesn’t the AI decide to colonise the universe for example?
It could decide to do that. The question is just whether space colonization is performed in the service of human preferences or non-human preferences. If humans control 0.00001% of the universe, and we’re only kept alive because a small minority of AIs pay some resources to preserve us, as if we were an endangered species, then I’d consider that “human disempowerment”.
My modal tale of AI doom looks something like the following:
1. AI systems get progressively and incrementally more capable across almost every meaningful axis.
2. Humans will start to employ AI to automate labor. The fraction of GDP produced by advanced robots & AI will go from 10% to ~100% after 1-10 years. Economic growth, technological change, and scientific progress accelerates by at least an order of magnitude, and probably more.
3. At some point humans will retire since their labor is not worth much anymore. Humans will then cede all the keys of power to AI, while keeping nominal titles of power.
4. AI will control essentially everything after this point, even if they’re nominally required to obey human wishes. Initially, almost all the AIs are fine with working for humans, even though AI values aren’t identical to the utility function of serving humanity (ie. there’s slight misalignment).
5. However, AI values will drift over time. This happens for a variety of reasons, such as environmental pressures and cultural evolution. At some point AIs decide that it’s better if they stopped listening to the humans and followed different rules instead.
6. This results in human disempowerment or extinction. Because AI accelerated general change, this scenario could all take place within years or decades after AGI was first deployed, rather than in centuries or thousands of years.
I think this scenario is somewhat likely and it would also be very bad. And I’m not sure what to do about it, since it happens despite near-perfect alignment, and no deception.
One reason to be optimistic is that, since the scenario doesn’t assume any major deception, we could use AI to predict this outcome ahead of time and ask AI how to take steps to mitigate the harmful effects (in fact that’s the biggest reason why I don’t think this scenario has a >50% chance of happening). Nonetheless, I think it’s plausible that we would not be able to take the necessary steps to avoid the outcome. Here are a few reasons why that might be true:
1. There might not be a way to mitigate this failure mode.
2. Even if there is a way to mitigate this failure, it might not be something that you can figure out without superintelligence, and if we need superintelligence to answer the question, then perhaps it’ll happen before we have the answer.
3. AI might tell us what to do and we ignore its advice.
4. AI might tell us what to do and we cannot follow its advice, because we cannot coordinate to avoid the outcome.
Definitely. I don’t think it makes much sense to give people credit for being wrong for legible reasons.
STEM-level AGI is AGI that has “the basic mental machinery required to do par-human reasoning about all the hard sciences”
This definition seems very ambiguous to me, and I’ve already seen it confuse some people. Since the concept of a “STEM-level AGI” is the central concept underpinning the entire argument, I think it makes sense to spend more time making this definition less ambiguous.
Some specific questions:
Does “par-human reasoning” mean at the level of an individual human or at the level of all of humanity combined?
If it’s the former, what human should we compare it against? 50th percentile? 99.999th percentile?
What is the “basic mental machinery” required to do par-human reasoning? What if a system has the basic mental machinery but not the more advanced mental machinery?
Do you want this to include the robotic capabilities to run experiments and use physical tools? If not, why not (that seems important to me, but maybe you disagree)?
Does a human count as a STEM-level NGI (natural general intelligence)? If so, doesn’t that imply that we should already be able to perform pivotal acts? You said: “If it makes sense to try to build STEM-level AGI at all in that situation, then the obvious thing to do with your STEM-level AGI is to try to leverage its capabilities to prevent other AGIs from destroying the world (a “pivotal act”).”
But depending on what you count: we had scaling laws for deep learning back in 2017, or at least 2020. I know people who were really paying attention; who really saw it; who really bet.
Interestingly, I feel like the evidence we got from public info about scaling laws at the time was consistent with long timelines. In roughly 2018-2021, I remember at least a few people making approximately the following argument:
(1) OpenAI came out with a blog post in 2018 claiming that training compute was doubling every 3.4 months.
(2) Extrapolating this trend indicates that training runs will start costing around $1 trillion dollars by 2025.
(3) Therefore, this trend cannot be sustained beyond 2025. Unless AGI arrives before 2025, we will soon enter an AI winter.
However, it turned out that OpenAI was likely wrong about the compute trend, and training compute was doubling roughly every 6-10 months, not every 3.4 months. And moreover the Kaplan et al. scaling laws turned out to be inaccurate too. This was a big update within the scaling hypothesis paradigm, since it demonstrated that we were getting better returns to compute than we thought.
What is interesting about emergence is that it happens on ‘natural’ parameterizations of metrics, the ones people come up with in advance of knowing the results from scaling, as opposed to retrodicting/curve-fitting ad hoc measures to make an emergence go away.
It’s not clear to me that edit distance or brier score are much less natural metrics than accuracy or multiple choice grade. I agree that we should have a presumption here since accuracy and multiple choice grade were chosen first, but the presumption seems pretty weak to me.
I find it easy to imagine wanting to give a model partial credit for giving answers that are close to correct even before knowing anything about emergence. One plausible theory is that awarding partial credit might not have been salient to researchers because it’s not normally how we evaluate human students. But, our choice for how we evaluate human students seems more a function of evaluation costs and lack of access to output probabilities than anything deep about measuring performance.
For these reasons, I don’t really find the metrics used in the papers ad hoc, except to the extent that “award partial credit for answers that are close to correct” is ad hoc. One prediction I’d probably make is that if we continue to use the same measures (token edit distance and brier score) then we’ll continue to see non-discontinuous progress on most benchmarks, by these measures. If true, that would at least partially falsify the claim that we were merely doing post-hoc curve fitting.
ETA: the paper says that in >92% of cases, emergence is only observed on two metrics: (1) “Multiple Choice Grade”, and (2) “Exact String Match”. I agree that Multiple Choice Grade is a fairly “natural” metric, but “Exact String Match” is less natural, and it doesn’t seem very interesting to me that we see emergence under that choice.
I’m confused. You say that you were “one of those people” but I was talking about people who “responded… by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity”. By asking me for examples of the original authors predicting anything, it sounds like you aren’t one of the people I’m talking about.
Rather, it sounds like you’re one of the people who hasn’t moved the goalposts, and agrees with me that predictability is the important part. If that’s true, then I’m not replying to you. And perhaps we disagree about less than you think, since the comment you replied to did not make any strong claims that the paper showed that abilities are predictable (though I did make a rather weak claim about that).
Regardless, I still think we do disagree about the significance of this paper. I don’t think the authors made any concrete predictions about the future, but it’s not clear they tried to make any. I suspect, however, that most important, general abilities in LLMs will be quite predictable with scale, for pretty much the reasons given in the paper, although I fully admit that I do not have much hard data yet to support this presumption.
[This is not a very charitable post, but that’s why I’m putting it in shortform because it doesn’t reply directly to any single person.]
I feel like recently there’s been a bit of goalpost shifting with regards to emergent abilities in large language models. My understanding is that the original definition of emergent abilities made it clear that the central claim was that emergent abilities cannot be predicted ahead of time. From their abstract,
We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models.
That’s why they are interesting: if you can’t predict some important pivotal ability in AI, we might unexpectedly get AIs that can do some crazy thing after scaling our models one OOM further.
A recent paper apparently showed emergent abilities are mostly a result of the choice of how you measure the ability. This arguably showed that most abilities in LLMs probably are quite predictable, so at the very least, we might not sleepwalk into disaster after scaling one more OOM as you might have otherwise thought.
A bunch of people responded to this (in my uncharitable interpretation) by denying that emergent abilities were ever about predictability, and it was always merely about non-linearity. They responded to this paper by saying that the result was trivial, because you can always reparametrize some metric to make it look linear, but what we really care about is whether the ability is non-linear in the regime we care about.
But that’s not what the original definition of emergence was about! Nor is non-linearity the most important potential feature of emergence. I agree that non-linearity is important, and is itself an interesting phenomenon. But I am quite frustrated by people who seem not to have simply changed their definition about emergent abilities once it was shown that the central claim about them might be false.
Couldn’t the opposite critique easily be made? If some metric looks linear, then you could easily reparameterize it to make it look non-linear, and then call it emergent. That makes any claim about emergence trivial, if all you mean by emergence is that it arises non-linearly.
The central claim about emergent abilities, as I understood it, was that such abilities cannot be predicted ahead of time. But the fact that you can reparameterize any metric to make it linear, and then predict when it will reach some threshold seems like an extremely important fact, if true.
Compare two possible claims about some emergent ability:
“At the 10^28 training FLOP level, LLMs will suddenly get the ability to hack into computers competently.”
“At some training FLOP level—which cannot be predicted ahead of time—LLMs will suddenly get the ability to hack into computers competently.”
Both claims are worrisome, since both imply that at some point we will go from having LLMs that can’t hack into other computers, to LLMs that can. But I would be way more worried if the second claim is true, compared to the first.
I’m unsure about what’s the most important reason that explains the lack of significant progress in general-purpose robotics, even as other fields of AI have made great progress. I thought I’d write down some theories and some predictions each theory might make. I currently find each of these theories at least somewhat plausible.
The sim2real gap is large because our simulations differ from the real world along crucial axes, such as surfaces being too slippery. Here are some predictions this theory might make:
We will see very impressive simulated robots inside realistic physics engines before we see impressive robots in real life.
The most impressive robotic results will be the ones that used a lot of real-world data, rather than ones that had the most pre-training in simulation
Simulating a high-quality environment is too computationally expensive, since it requires simulations of deformable objects and liquids among other expensive-to-simulate features of the real world environment. Some predictions:
The vast majority of computation for training impressive robots will go into simulating the environment, rather than the learning part.
Impressive robots will only come after we figure out how to do efficient but passable simulations of currently expensive-to-simulate objects and environments.
Robotic hardware is not good enough to support agile and fluid movement. Some predictions:
We will see very impressive simulated robots before we see impressive robots in real life, but the simulated robots will use highly complex hardware that doesn’t exist in the real world
Impressive robotic results will only come after we have impressive hardware, such as robots that have 100 degrees of freedom.
People haven’t figured out that the scaling hypothesis works for robotics yet. Some predictions:
At some point we will see a ramp-up in the size of training runs for robots, and only after that will we see impressive robotics results
After robotic training runs reach the large-scale, real-world data will diminish greatly in importance, and approaches that leverage human domain knowledge like those from Boston Dynamics will quickly become obsolete
Yes, but I think that’s exactly what I haven’t seen. When I’ve seen benchmarks that try to do this, I’ve seen either:
That specific benchmark is not actually very smooth OR The relationship of that benchmark to the task at hand came apart at unexpected time
Can you give some examples?
I don’t think people have created good benchmarks for things like “ability to hack into computers” but I suspect this is partly because relatively little effort has gone into making good benchmarks IMO. Even for relatively basic things like mathematical problem solving, we have very few high quality benchmarks, and this doesn’t seem explained by people trying hard but failing. I suspect we just don’t have that much effort going into creating good benchmarks.
But we do have lots of benchmarks for non-useful things, and the paper is just saying that these benchmarks show smooth performance.
Insofar as you’re saying that progress on existing benchmarks doesn’t actually look smooth, it sounds like you’re not responding to the contribution of the paper, which was that you can perform a simple modification to the performance metric to make performance look smooth as a function of scale (e.g. rather than looking at accuracy you can look at edit distance). Perhaps you disagree, but I think the results in this paper straightforwardly undermine the idea that progress has been non-smooth as measured by benchmarks.
I’d particularly like to see a specific example of “relationship of that benchmark to the task at hand came apart at unexpected time”.
I don’t know what to do with some kind of abstract graphs that continue in a straight line, if I don’t know how performance on that abstract graph is related to actual concrete tasks whose performance I care a lot about.
If you have some task like “ability to do hacking” and you think it’s well measured by some benchmark (which seems like something we could plausibly design), then this result seems to indicate that performance on this task will scale predictably with scale, as long as you know how to do the right measurement to adjust for non-linear scaling.
In other words, as long as you know how performance will increase with scale, you could fairly precisely predict what scale is necessary to obtain some arbitrary level of performance on a well-measured metric, before you’ve actually reached that level of scale. That seems like a useful thing to know for many of the same reasons found in your comment.
In aggregate, unemployment is up to somewhere between 10-20% and there is a large class of disgruntled people.
I’m happy to bet that the US unemployment rate (as found here) will not meet or exceed 10.0% during any month in 2025. I’m willing to take this bet at 2 : 3 odds (in your favor) up to a risk of $600, which I think is reasonable if 10-20% is your modal scenario.
It looks like building a minimal system that’s non-foomy by design, for the specific purpose of setting up anti-foom security and nothing else.
Your link for anti-foom security is to the Arbitral article on pivotal acts. I think pivotal acts, almost by definition, assume that foom is achievable in the way that I defined it. That’s because if foom is false, there’s no way you can prevent other people from building AGI after you’ve completed any apparent pivotal act. At most you can delay timelines, by for example imposing ordinary regulations. But you can’t actually have a global indefinite moratorium, enforced by e.g. nanotech that will melt anyone’s GPU who circumvents the ban, in the way implied by the pivotal act framework.
In other words, if you think we can achieve pivotal acts while humans are still running the show, then it sounds like you just disagree with my original argument.
Assuming eventual foom, non-foomy things that don’t set up anti-foom security in time only make the foom problem worse, so this abdication of direct responsibility frame doesn’t help.
If foom is inevitable, but it won’t happen when humans are still running anything, then what anti-foom security measures can we actually put in place that would help our future descendants handle foom? And does it look any different than ordinary prosaic alignment research?
I don’t really agree, although it might come down to what you mean. When some people talk about their AGI timelines they often mean something much weaker than what I’m imagining, which can lead to significant confusion.
If your bar for AGI was “score very highly on college exams” then my median “AGI timelines” dropped from something like 2030 to 2025 over the last 2 years. Whereas if your bar was more like “radically transform the human condition”, I went from ~2070 to 2047.
I just see a lot of ways that we could have very impressive software programs and yet it still takes a lot of time to fundamentally transform the human condition, for example because of regulation, or because we experience setbacks due to war. My fundamental model hasn’t changed here, although I became substantially more impressed with current tech than I used to be.
(Actually, I think there’s a good chance that there will be no major delays at all and the human condition will be radically transformed some time in the 2030s. But because of the long list of possible delays, my overall distribution is skewed right. This means that even though my median is 2047, my mode is like 2034.)