CS student at the University of Southern California. Previously worked for three years as a data scientist at a fintech startup. Before that, four months on a work trial at AI Impacts. Currently looking for research opportunities aligning large language models. Currently working with Professor Lionel Levine on language model safety research.
aogara
Love the Box Contest idea. AI companies are already boxing models that could be dangerous, but they’ve done a terrible job of releasing the boxes and information about them. Some papers that used and discussed boxing:
Section 2.3 of OpenAI’s Codex paper. This model was allowed to execute code locally.
Section 2 and Appendix A of OpenAI’s WebGPT paper. This model was given access to the Internet.
Appendix A of DeepMind’s GopherCite paper. This model had access to the Internet, and the authors do not even mention the potential security risks of granting such access.
DeepMind again giving access to the Google API without discussing any potential risks.
The common defense is that current models are not capable enough to write good malware or interact with search APIs in unintended ways. That might well be true, but someday it won’t be, and there’s no excuse for setting a dangerous precedent. Future work will need to set boxing norms and build good boxing software. I’d be very interested to see follow-up work on this topic or to discuss with anyone who’s working on it.
Emergent Abilities of Large Language Models [Linkpost]
Turns out that this dataset contains little to no correlation between a researcher’s years of experience in the field and their HLMI timelines. Here’s the trendline, showing a small positive correlation where older researchers have longer timelines—the opposite of what you’d expect if everyone predicted AGI as soon as they retire.
My read of this survey is that most ML researchers haven’t updated significantly on the last five years of progress. I don’t think they’re particularly informed on forecasting and I’d be more inclined to trust the inside view arguments, but it’s still relevant information. It’s also worth noting that the median number of years until a 10% probability of HLMI is only 10 years, showing they believe HLMI is at least plausible on somewhat short timelines.
If you have the age of the participants, it would be interesting to test whether there is a strong correlation between expected retirement age and AI timelines.
This was heavily upvoted at the time of posting, including by me. It turns out to be mostly wrong. AI Impacts just released a survey of 4271 NeurIPS and ICML researchers conducted in 2021 and found that the median year for expected HLMI is 2059, down only two years from 2061 since 2016. Looks like the last five years of evidence hasn’t swayed the field much. My inside view says they’re wrong, but the opinions of the field and our inability to anticipate them are both important.
https://aiimpacts.org/2022-expert-survey-on-progress-in-ai/
My report estimates that the amount of training data required to train a model with N parameters scales as N^0.8, based significantly on results from Kaplan et al 2020. In 2022, the Chinchilla scaling result (Hoffmann et al 2022) showed that instead the amount of data should scale as N.
Are you concerned that pretrained language models might hit data constraints before TAI? Nostalgebraist estimates that there are roughly 3.2T tokens available publicly for language model pretraining. This estimate misses important potential data sources such as transcripts from audio and video and private text conversations and email. But the BioAnchors report estimates that the median transformative model will require a median of 22T data points, nearly an order of magnitude higher than this estimate.
The BioAnchors estimate was also based on older scaling laws that placed a lower priority on data relative to compute. With the new Chinchilla scaling laws, more data would be required for compute-optimal training. Of course, training runs don’t need to be compute-optimal: You can get away with using more compute and less data if you’re constrained by data, even if it’s going to cost more. And text isn’t the only data a transformative model could use: audio, video, and RLHF on diverse tasks all seem like good candidates.
Does the limited available public text data affect your views of how likely GPT-N is to be transformative? Are there any considerations overlooked here, or questions that could use a more thorough analysis? Curious about anybody else’s opinions, and thanks for sharing the update, I think it’s quite persuasive.
- 3 Aug 2022 3:31 UTC; 4 points) 's comment on chinchilla’s wild implications by (
Would you have any thoughts on the safety implications of reinforcement learning from human feedback (RLHF)? The HFDT failure mode discussed here seems very similar to what Paul and others have worked on at OpenAI, Anthropic, and elsewhere. Some have criticized this line of research as only teaching brittle task-specific preferences in a manner that’s open to deception, therefore advancing us towards more dangerous capabilities. If we achieve transformative AI within the next decade, it seems plausible that large language models and RLHF will play an important role in those systems — so why do safety minded folks work on it?
Ah right. Thank you!
[Chinchilla 10T would have a 143x increase in parameters and] 143 times more data would also be needed, resulting in a 143*143= 20449 increase of compute needed.
Would anybody be able to explain this calculation a bit? It implies that compute requirements scale linearly with the number of parameters. Is that true for transformers?
My understanding would be that making the transformer deeper would increase compute linearly with parameters, but a wider model would require more than linear compute because it increases the number of connections between nodes at each layer.
Great post, thanks for sharing. Here’s my core concern about LeCun’s worldview, then two other thoughts:
The intrinsic cost module (IC) is where the basic behavioral nature of the agent is defined. It is where basic behaviors can be indirectly specified. For a robot, these terms would include obvious proprioceptive measurements corresponding to “pain”, “hunger”, and “instinctive fears”, measuring such things as external force overloads, dangerous electrical, chemical, or thermal environments, excessive power consumption, low levels of energy reserves in the power source, etc.
They may also include basic drives to help the agent learn basic skills or accomplish its missions. For example, a legged robot may comprise an intrinsic cost to drive it to stand up and walk. This may also include social drives such as seeking the company of humans, finding interactions with humans and praises from them rewarding, and finding their pain unpleasant (akin to empathy in social animals). Other intrinsic behavioral drives, such as curiosity, or taking actions that have an observable impact, may be included to maximize the diversity of situations with which the world model is trained (Gottlieb et al., 2013)
The IC can be seen as playing a role similar to that of the amygdala in the mammalian brain and similar structures in other vertebrates. To prevent a kind of behavioral collapse or an uncontrolled drift towards bad behaviors, the IC must be immutable and not subject to learning (nor to external modifications).
This is the paper’s treatment of the outer alignment problem. It says models should have basic drives and behaviors that are specified directly by humans and not trained. The paper doesn’t mention the challenges of reward specification or the potential for learning human preferences. It doesn’t discuss our normative systems or even the kinds of abstractions that humans care about. I don’t understand why he doesn’t see the challenges with specifying human values.
Most of the paper instead focuses on the challenges of building accurate, multimodal predictive world models. This seems entirely necessary to continue advancing AI, but the primary focus on predictive capabilities and minimizing of the challenges in learning human values worries me.
If anybody has good sources about LeCun’s views on AI safety and value learning, I’d be interested.
success of model-free RL in complex video game environments like StarCraft and Dota 2
Do we expect model-free RL to succeed in domains where you can’t obtain incredible amounts of data thanks to e.g. self-play? Having a purely predictive world model seems better able to utilize self-supervised predictive objective functions, and to generalize to many possible goals that use a single world model. (Not to mention the potential alignment benefits of a more modular system.) Is model-free RL simply a fluke that learns heuristics by playing games against itself, or are there reasons to believe it will succeed on more important tasks?
Since the whole architecture is trained end-to-end with gradient descent
I don’t think this is what he meant, though I might’ve missed something. The world model could be trained with the self-supervised objective functions of language and vision models, as well as perhaps large labeled datasets and games via self-play. On the other hand, the actor must learn to adapt to many different tasks very quickly, but could potentially use few-shot learning or fine-tuning to that end. The more natural architecture would seem to be modules that treat each other as black boxes and can be swapped out relatively easily.
I’d like to publicly preregister an opinion. It’s not worth making a full post because it doesn’t introduce any new arguments, so this seems like a fine place to put it.
I’m open to the possibility of short timelines on risks from language models. Language is a highly generalizable domain that’s seen rapid progress shattering expectations of slower timelines for several years in a row now. The self-supervised pretraining objective means that data is not a constraint (though it could be for language agents, tbd), and the market seems optimistic about business applications of language models.
While I would bet against (~80%) language models pushing annual GDP growth above 20% in the next 10 years, I strongly expect (~80%) risks from AI persuasion to materialize (e.g. becomes a mainstream topic of discussion, influence major political outcomes in the next 10 years) and I’m concerned (~20%) about tail risks from power-seeking LM agents (mainly hacking, but also financial trading, impersonation, or others). I’d be interested in (and should spend some time on) making clear falsifiable predictions here.
Credit to “What 2026 Looks Like” and “It Looks Like You’re Trying To Take Over The World” for making this case well before I believed it was possible. I’m also influenced by the widespread interest in LMs from AI safety grantmakers and researchers. This has been my belief for a few months, as I noted here, and I’ve taken action by working on LM truthfulness, which I expect to be most useful in scenarios of fast LM growth. (Though I don’t think it will substantially combat power-seeking LM agents, and I’m still learning about other research directions that might be more valuable.)
I’m having trouble understanding the argument for why a “sharp left turn” would be likely. Here’s my understanding of the candidate reasons, I’d appreciate any missing considerations:
Inner Misalignment
AGI will contain optimizers. Those optimizers will not necessarily optimize for the base objective used by the outer optimization process implemented by humans. This AGI could still achieve the outer goal in training, but once deployed they could competently pursue its incorrect inner goal. Deception inner alignment is a special case, where we don’t realize the inner goal is misaligned until deployment because the system realizes it’s in training and deliberately optimizes for the outer goal until deployment. See Risks from Learned Optimization and Goal Misgeneralization in Deep Reinforcement Learning.
Question: If inner optimization is the cause of the sharp left turn, why does Nate focus on failure modes that only arise once we’ve built AGI? We already have examples of inner misalignment, and I’d expect we can work on solving inner misalignment in current systems.
Wireheading / Power Seeking
AGI might try to exercise extreme control over its reward signal, either by hacking directly into the technical system providing its reward or by seeking power in the world to better achieve its rewards. These might be more important problems when systems are more intelligent and can more successfully execute these strategies.
Question: These problems are observable and addressable in systems today. See wireheading and conservative agency. Why focus on the unique AGI case?
Capabilities Discontinuities
The fast-takeoff hypothesis. This could be caused by recursive self-improvement, but the more popular justification seems to be that intelligence is fundamentally simple in some way and will be understood very quickly once AI reaches a critical threshold. This seems closely related to the idea that “capabilities fall into an attractor and alignment doesn’t”, which I don’t fully understand. AFAICT discontinuous capabilities don’t introduce misaligned objectives, but do make them much more dangerous. Rohin’s comment above sees this as the central disagreement about whether sharp left turns are likely or not.
Question: Is this the main reason to expect a sharp left turn? If you knew that capabilities progress would be slow and continuous, would you still be concerned about left turns?
Learning without Gradient Descent
If AI is doing lots of learning in between gradient descent steps, it could invent and execute a misaligned objective before we have time to correct it. This makes several strategies for preventing misalignment less useful: truthfulness, ELK, … . What it doesn’t explain is why learning between gradient descent steps is particularly likely to be misaligned—perhaps it’s not, and the root causes of misalignment are the inner misalignment and wireheading / power-seeking mentioned above.
Question: Does this introduce new sources of misaligned objectives? Or is it simply a reason why alignment strategies that rely on gradient descent updates will fail?
Further question: The technical details of how systems will learn without gradient descent seem murky, the post only provides an analogy to human learning without evolution. This is discussed here, I’d be curious to hear any thoughts.
Are these the core arguments for a sharp left turn? What important steps are missing? Have I misunderstood any of the arguments I’ve presented? Here is another attempt to understand the arguments.
Ah okay. Are there theoretical reasons to think that neurons with lower variance in activation would be better candidates for pruning? I guess it would be that the effect on those nodes is similar across different datapoints, so they can be pruned and their effects will be replicated by the rest of the network.
“…nodes with the smallest standard deviation.” Does this mean nodes whose weights have the lowest absolute values?
Similarly, humans are terrible at coordination compared to AIs.
Are there any key readings you could share on this topic? I’ve come across arguments about AIs coordinating via DAOs or by reading each others’ source code, including in Andrew Critch’s RAAP. Is there any other good discussion of the topic?
A model/ensemble of models will achieve >90% on the MATH dataset using a no-calculator rule
Curious to hear if/how you would update your credence in this being achieved by 2026 or 2030 after seeing the 50%+ accuracy from Google’s Minerva. Your prediction seemed reasonable to me at the time, and this rapid progress seems like a piece of evidence favoring shorter timelines.
I think it’s a pretty good argument. Holden Karnofsky puts a 1/3rd chance that we don’t see transformative AI this century. In that world, people today know very little about what advanced AI will eventually look like, and how to solve the challenges it presents. Surely some people should be working on problems that won’t be realized for a century or more, but it would seem much more difficult to argue that AI safety today is more altruistically pressing than other longtermist causes like biosecurity, and even neartermist causes like animal welfare and global poverty.
Personally I do buy the arguments that we could reach superintelligent AI within the next few decades, which is a large part of why I think AI safety is an important cause area right now.
“We’ll still probably put it in a box, for the same reason that keeping password hashes secure is a good idea. We might as well. But that’s not really where the bulk of the security comes from.”
This seems true in worlds where we can solve AI safety to the level of rigor demanded by security mindset. But lots of things in the world aren’t secure by security mindset standards. The internet and modern operating systems are both full of holes. Yet we benefit greatly from common sense, fallible safety measures in those systems.
I think it’s worth working on versions of AI safety that are analogous to boxing and password hashing, meaning they make safety more likely without guaranteeing or proving it. We should also work on approaches like yours that could make systems more reliably safe, but might not be ready in time for AGI. Would you agree with that prioritization, or should we only work on approaches that might provide safety guarantees?