Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision.
aogara
Thank you, this was very helpful. As a bright-eyed youngster, it’s hard to make sense of the bitterness and pessimism I often see in the field. I’ve read the old debates, but I didn’t participate in them, and that probably makes them easier to dismiss. Object level arguments like these help me understand your point of view.
This is a great presentation of the compute-focused argument for short AI timelines usually given by the BioAnchors report. Comparing several ML systems to several biological brain sizes provides more data points that BioAnchors’ focus on only the human brain vs. TAI. You succinctly summarize the key arguments against your viewpoint: that compute growth could slow, that human brain algorithms are more efficient, that we’ll build narrow AI, and the outside view economics perspective. While your ultimate conclusion on timelines isn’t directly implied by your model, that seems like a feature rather than a bug — BioAnchors offers false numerical precision given its fundamental assumptions.
All of these academics are widely read and cited. Looking at their Google Scholar profiles, everyone one of them has more than 1000, and half have more than 10,000 citations. Outside of LessWrong, lots of people in academia and industry labs already read and understand their work. We shouldn’t disparage people who are successfully bringing AI safety into the mainstream ML community.
My report estimates that the amount of training data required to train a model with N parameters scales as N^0.8, based significantly on results from Kaplan et al 2020. In 2022, the Chinchilla scaling result (Hoffmann et al 2022) showed that instead the amount of data should scale as N.
Are you concerned that pretrained language models might hit data constraints before TAI? Nostalgebraist estimates that there are roughly 3.2T tokens available publicly for language model pretraining. This estimate misses important potential data sources such as transcripts from audio and video and private text conversations and email. But the BioAnchors report estimates that the median transformative model will require a median of 22T data points, nearly an order of magnitude higher than this estimate.
The BioAnchors estimate was also based on older scaling laws that placed a lower priority on data relative to compute. With the new Chinchilla scaling laws, more data would be required for compute-optimal training. Of course, training runs don’t need to be compute-optimal: You can get away with using more compute and less data if you’re constrained by data, even if it’s going to cost more. And text isn’t the only data a transformative model could use: audio, video, and RLHF on diverse tasks all seem like good candidates.
Does the limited available public text data affect your views of how likely GPT-N is to be transformative? Are there any considerations overlooked here, or questions that could use a more thorough analysis? Curious about anybody else’s opinions, and thanks for sharing the update, I think it’s quite persuasive.
- 3 Aug 2022 3:31 UTC; 5 points) 's comment on chinchilla’s wild implications by (
Unfortunately I don’t think academia will handle this by default. The current field of machine unlearning focuses on a narrow threat model where the goal is to eliminate the impact of individual training datapoints on the trained model. Here’s the NeurIPS 2023 Machine Unlearning Challenge task:
The challenge centers on the scenario in which an age predictor is built from face image data and, after training, a certain number of images must be forgotten to protect the privacy or rights of the individuals concerned.
But if hazardous knowledge can be pinpointed to individual training datapoints, perhaps you could simply remove those points from the dataset before training. The more difficult threat model involves removing hazardous knowledge that can be synthesized from many datapoints which are individually innocuous. For example, a model might learn to conduct cyberattacks or advise in the acquisition of biological weapons after being trained on textbooks about computer science and biology. It’s unclear the extent to which this kind of hazardous knowledge can be removed without harming standard capabilities, but most of the current field of machine unlearning is not even working on this more ambitious problem.
Wild. One important note is that the model is trained with labeled examples of successful performance on the target task, rather than learning the tasks from scratch by trial and error like MuZero and OpenAI Five. For example, here’s the training description for the DeepMind Lab tasks:
We collect data for 255 tasks from the DeepMind Lab, 254 of which are used during training, the left out task was used for out of distribution evaluation. Data is collected using an IMPALA (Espeholt et al., 2018) agent that has been trained jointly on a set of 18 procedurally generated training tasks. Data is collected by executing this agent on each of our 255 tasks, without further training.
Gato then achieves near-expert performance on >200 DM Lab tasks (see Figure 5). It’s unclear whether the model could have learned superhuman performance training from scratch, and similarly unclear whether the model could learn new tasks without examples of expert performance.
More broadly, this seems like substantial progress on both multimodal transformers and transformer-powered agents, two techniques that seem like they could contribute to rapid AI progress and risk. I don’t want to downplay the significance of these kinds of models and would be curious to hear other perspectives.
Kevin Esvelt explicitly calls for not releasing future model weights.
Would sharing future model weights give everyone an amoral biotech-expert tutor? Yes.
Therefore, let’s not.
Very nice, these arguments seem reasonable. I’d like to make a related point about how we might address deceptive alignment which makes me substantially more optimistic about the problem. (I’ve been meaning to write a full post on this, but this was a good impetus to make the case concisely.)
Conceptual interpretability in the vein of Collin Burns, Alex Turner, and Representation Engineering seems surprisingly close to allowing us to understand a model’s internal beliefs and detect deceptive alignment. Collin Burns’s work was very exciting to at least some people because it provided an unsupervised method for detecting a model’s beliefs. Collin’s explanation of his theory of impact is really helpful here. Broadly, because it allows us to understand a model’s beliefs without using any labels provided by humans, it should be able to detect deception in superhuman models where humans cannot provide accurate feedback.
Over the last year, there’s been a lot of research that meaningfully extends Collin’s work. I think this could be used to detect deceptively aligned models if they arise within the next few years, and I’d be really excited about more people working on it. Let me highlight just a few contributions:
Scott Emmons and Fabien Roger showed that the most important part of Collin’s method was contrast pairs. The original paper focused on “logical consistency properties of truth” such as P(X) + P(!X) = 1. While this is an interesting idea, its performance is hardly better than a much simpler strategy relegated to Appendix G.3: taking the average difference between a model’s activations at a particular layer for many contrast pairs of the form X and !X. Collin shows this direction empirically coincides with truthfulness.
Alex Turner and his SERI MATS stream took seriously the idea that contrast pairs could reveal directions in a model’s latent space which correspond to concepts. They calculated a “cheese vector” in a maze-solving agent as the difference in the model’s activations between when the cheese was present and when it was missing. By adding and subtracting this vector to future forward passes of the model, its behavior could be controlled in surprising ways. GPT-2 can also be subtly controlled with this technique.
Representation Engineering starts with this idea and runs a large number of empirical experiments. It finds a concept of truthfulness that dramatically improves performance on TruthfulQA (36% to 54%), as well as concepts of power-seeking, honesty, and morality that can control the behavior of language agents.
These are all unsupervised methods for detecting model beliefs. They’ve empirically improved performance on many real world tasks today, and it seems possible that they’d soon be able to detect deceptive alignment.
(Without providing a full explanation, these two related papers (1, 2) are also interesting.)
Future work on this topic could attempt to disambiguate between directions in latent space corresponding to “human simulators” versus “direct translation” of the model’s beliefs. It could also examine whether these directions are robust to optimization pressure. For example, if you train a model to beat a lie detector test based on these methods, will the lie detector still work after many rounds of optimization? I’d also be excited about straightforward empirical extensions of these unsupervised techniques applied to standard ML benchmarks, as there are many ways to vary the methods and it’s unclear which variants would be the most effective.
Great post, thanks for sharing. Here’s my core concern about LeCun’s worldview, then two other thoughts:
The intrinsic cost module (IC) is where the basic behavioral nature of the agent is defined. It is where basic behaviors can be indirectly specified. For a robot, these terms would include obvious proprioceptive measurements corresponding to “pain”, “hunger”, and “instinctive fears”, measuring such things as external force overloads, dangerous electrical, chemical, or thermal environments, excessive power consumption, low levels of energy reserves in the power source, etc.
They may also include basic drives to help the agent learn basic skills or accomplish its missions. For example, a legged robot may comprise an intrinsic cost to drive it to stand up and walk. This may also include social drives such as seeking the company of humans, finding interactions with humans and praises from them rewarding, and finding their pain unpleasant (akin to empathy in social animals). Other intrinsic behavioral drives, such as curiosity, or taking actions that have an observable impact, may be included to maximize the diversity of situations with which the world model is trained (Gottlieb et al., 2013)
The IC can be seen as playing a role similar to that of the amygdala in the mammalian brain and similar structures in other vertebrates. To prevent a kind of behavioral collapse or an uncontrolled drift towards bad behaviors, the IC must be immutable and not subject to learning (nor to external modifications).
This is the paper’s treatment of the outer alignment problem. It says models should have basic drives and behaviors that are specified directly by humans and not trained. The paper doesn’t mention the challenges of reward specification or the potential for learning human preferences. It doesn’t discuss our normative systems or even the kinds of abstractions that humans care about. I don’t understand why he doesn’t see the challenges with specifying human values.
Most of the paper instead focuses on the challenges of building accurate, multimodal predictive world models. This seems entirely necessary to continue advancing AI, but the primary focus on predictive capabilities and minimizing of the challenges in learning human values worries me.
If anybody has good sources about LeCun’s views on AI safety and value learning, I’d be interested.
success of model-free RL in complex video game environments like StarCraft and Dota 2
Do we expect model-free RL to succeed in domains where you can’t obtain incredible amounts of data thanks to e.g. self-play? Having a purely predictive world model seems better able to utilize self-supervised predictive objective functions, and to generalize to many possible goals that use a single world model. (Not to mention the potential alignment benefits of a more modular system.) Is model-free RL simply a fluke that learns heuristics by playing games against itself, or are there reasons to believe it will succeed on more important tasks?
Since the whole architecture is trained end-to-end with gradient descent
I don’t think this is what he meant, though I might’ve missed something. The world model could be trained with the self-supervised objective functions of language and vision models, as well as perhaps large labeled datasets and games via self-play. On the other hand, the actor must learn to adapt to many different tasks very quickly, but could potentially use few-shot learning or fine-tuning to that end. The more natural architecture would seem to be modules that treat each other as black boxes and can be swapped out relatively easily.
This paper renames the decades-old method of semi-supervised learning as self-improvement. Semi-supervised learning does enable self-improvement, but not mentioning the hundreds of thousands of papers that previously used similar methods obscures the contribution of this paper.
Here’s a characteristic example of the field, Berthelot et al., 2019 training an image classifier on the sharpened average of its own predictions for multiple data augmentations of an unlabeled input. Another example would be Cho et al., 2019 who train a language model to generate paraphrases of an input sequence by generating many candidate paraphrases and then selecting only those whose confidence passes a specified threshold. As wunan mentioned, AlphaFold is also semi-supervised. Google Scholar lists 174,000 results for semi-supervised learning.
The authors might protest that this isn’t semi-supervised learning because it’s training on explanations that are known to generate the correct answer as determined by ground truth labels, rather than training on high confidence answers of unknown ground truth status. That’s an important distinction, and I don’t know if anybody has used this particular method before. But this work seems much more similar to previous work on semi-supervised learning than they’re letting on, and I think it’s important to appreciate the context.
These professors all have a lot of published papers in academic conferences. It’s probably a bit frustrating to not have their work summarized, and then be asked to explain their own work, when all of their work is published already. I would start by looking at their Google Scholar pages, followed by personal websites and maybe Twitter. One caveat would be that papers probably don’t have full explanations of the x-risk motivation or applications of the work, but that’s reading between the lines that AI safety people should be able to do themselves.
“OpenAI leadership tend to put more likelihood on slow takeoff”
Could you say more about the timelines of people at OpenAI? My impression was that they’re very short and explicitly include the possibility of scaling language models to AGI. If somebody builds AGI in the next 10 years, OpenAI seems like a leading candidate to do so. Would people at OpenAI generally agree with this?
This is great news. I particularly agree that legislators should pass new laws making it illegal to train AIs on copyrighted data without the consent of the copyright owner. This is beneficial from at least two perspectives:
If AI is likely to automate most human labor, then we need to build systems for redistributing wealth from AI providers to the rest of the world. One previous proposal is the robot tax, which would offset the harms of automation borne by manufacturing workers. Another popular idea is a Universal Basic Income. Following the same philosophy as these proposals, I think the creators of copyrighted material ought to be allowed to name their price for training AI systems on their data. This would distribute some AI profits to a larger group of people who contributed to the model’s capabilities, and it might slow or prevent automation in industries where workers organize to deny AI companies access to training data. In economic terms, automation would then only occur if the benefits to firms and consumers outweigh the costs to workers. This could reduce concentration of power via wealth inequality, and slow the takeoff speeds of GDP growth.
For anyone concerned about existential threats from AI, restricting the supply of training data could slow AI development, leaving more time for work on technical safety and governance which would reduce x-risk.
I think previous counterarguments against this position are fairly weak. Specifically, while I agree that foundation models which are pretrained to imitate a large corpus of human-generated data are safer in many respects than RL agents trained end-to-end, I think that foundation models are clearly the most promising paradigm over the next few years, and even with restrictions on training data I don’t think end-to-end RL training would quickly catch up.
OpenAI appears to lobby against these restrictions. This makes sense if you model OpenAI as profit-maximizing. Surprisingly to me, even OpenAI employees who are concerned about x-risk have opposed restrictions, writing “We hope that US policymakers will continue to allow this area of dramatic recent innovation to proceed without undue burdens from the copyright system.” I wonder if people concerned about AI risk may have been “captured” by industry on this particular issue, meaning that people have unquestioningly supported a policy because they trust the AI companies which endorse it, even though the policy might increase x-risk from AI development.
- 28 Dec 2023 21:11 UTC; 2 points) 's comment on AI #44: Copyright Confrontation by (
This is great context. With Eliezer being brought up in White House Press Corps meetings, it looks like a flood of people might soon enter the AI risk discourse. Tyler Cowen has been making some pretty bad arguments on AI lately, but I though this quote was spot on:
“This may sound a little harsh, but the rationality community, EA movement, and the AGI arguers all need to radically expand the kinds of arguments they are able to process and deal with. By a lot. One of the most striking features of the “six-month Pause” plea was how intellectually limited and non-diverse — across fields — the signers were. Where were the clergy, the politicians, the historians, and so on? This should be a wake-up call, but so far it has not been.”
Nearly all reasonable discussions of AI x-risk have taken place in the peculiar cultural bubble of rationality and EA. These past efforts could be multiplied by new interest from mainstream folks in AI, policy, philosophy, economics, and other fields. Or they could be misunderstood and discarded in favor of distractions that claim the mantle of AI safety. Hopefully we can find ways of communicating with new people in other disciplines that will lead to productive conversations on AI x-risk.
Love the Box Contest idea. AI companies are already boxing models that could be dangerous, but they’ve done a terrible job of releasing the boxes and information about them. Some papers that used and discussed boxing:
Section 2.3 of OpenAI’s Codex paper. This model was allowed to execute code locally.
Section 2 and Appendix A of OpenAI’s WebGPT paper. This model was given access to the Internet.
Appendix A of DeepMind’s GopherCite paper. This model had access to the Internet, and the authors do not even mention the potential security risks of granting such access.
DeepMind again giving access to the Google API without discussing any potential risks.
The common defense is that current models are not capable enough to write good malware or interact with search APIs in unintended ways. That might well be true, but someday it won’t be, and there’s no excuse for setting a dangerous precedent. Future work will need to set boxing norms and build good boxing software. I’d be very interested to see follow-up work on this topic or to discuss with anyone who’s working on it.
Turns out that this dataset contains little to no correlation between a researcher’s years of experience in the field and their HLMI timelines. Here’s the trendline, showing a small positive correlation where older researchers have longer timelines—the opposite of what you’d expect if everyone predicted AGI as soon as they retire.
My read of this survey is that most ML researchers haven’t updated significantly on the last five years of progress. I don’t think they’re particularly informed on forecasting and I’d be more inclined to trust the inside view arguments, but it’s still relevant information. It’s also worth noting that the median number of years until a 10% probability of HLMI is only 10 years, showing they believe HLMI is at least plausible on somewhat short timelines.
I’d suggest looking at this from a consequentialist perspective.
One of your questions was, “Should it also be illegal for people to learn from copyrighted material?” This seems to imply that whether a policy is good for AIs depends on whether it would be good for humans. It’s almost a Kantian perspective—“What would happen if we universalized this principle?” But I don’t think that’s a good heuristic for AI policy. For just one example, I don’t think AIs should be given constitutional rights, but humans clearly should.
My other comment explains why I think the consequences of restricting training data would be positive.
- 28 Dec 2023 21:11 UTC; 2 points) 's comment on AI #44: Copyright Confrontation by (
Instant classic. Putting it in our university group syllabus next to What Failure Looks Like. Sadly could get lost in the recent LW tidal wave, someone should promote it to the Alignment Forum.
I’d love to see the most important types of work for each failure mode. Here’s my very quick version, any disagreements or additions are welcome:
Predictive model misuse—People use AI to do terrible things.
Adversarial Robustness: Train ChatGPT to withstand jailbreaking. When people try to trick it into doing something bad using a creative prompt, it shouldn’t work.
Anomaly Detection: A related but different approach. Instead of training the base model to give good responses to adversarial prompts, you can train a separate classifier to detect adversarial prompts, and refuse to answer them / give a stock response from a separate model.
Cybersecurity at AI companies: If someone steals the weights, they can misuse the model!
Regulation: Make it illegal to release models that help people do bad things.
Strict Liability: If someone does a bad thing with OpenAI’s model, make OpenAI legally responsible.
Communications: The open source community is often gleeful about projects like ChaosGPT. Is it possible to shift public opinion? Be careful, scolding can embolden them.
Predictive models playing dangerous characters
Haven’t thought about this enough. Maybe start with the new wave of prompting techniques for GPT agents, is there a way to make those agents less likely to go off the rails? Simulators and Waluigi Effect might be useful frames to think through.
Scalable oversight failure without deceptive alignment
Scalable Oversight, of course! But I’m concerned about capabilities externalities here. Some of the most successful applications of IDA and AI driven feedback only make models more capable of pursuing goals, without telling them any more about the content of our goals. This research should seek to scale feedback about human values, not about generic capabilities.
Machine Ethics / understanding human values: The ETHICS benchmark evaluates whether models have the microfoundations of our moral decision making. Ideally, this will generalize better than heuristics that are specific to a particular situation, such as RLHF feedback on helpful chatbots. I’ve heard the objection that “AI will know what you want, it just won’t care”, but that doesn’t apply to this failure mode—we’re specifically concerned about poorly specifying our goals to the AI.
Regulation to ensure slow, responsible deployment. This scenario is much more dangerous if AI is autonomously making military, political, financial, and scientific decisions. More human oversight at critical junctures and a slower transformation into a strange, unknowable world means more opportunities to correct AI and remove it from high stakes deployment scenarios. The EU AI Act identifies eight high risk areas requiring further scrutiny, including biometrics, law enforcement, and management of critical infrastructure. What else should be on this list?
What else? Does ELK count? If descendants of Collin Burns could tell us a model’s beliefs, could we train on not for what would earn the highest reward, but what a human would really want? Or is ELK only for spot checking, and training on its answers would negate its effectiveness?
Deceptive Alignment
Interpretability, ELK, and Trojans seem tractable and useful by my own inside view.
Some things I’m less familiar with that might be relevant: John Wentworth-style ontology identification, agent foundations, and shard theory. Correct me if I’m wrong here!
Recursive Self-Improvement --> hard takeoff
Slowing down. The FLI letter, ARC Evals, Yo Shavit’s compute governance paper, Ethan Cabellero and others predicting emergent capabilities. Maybe we can stop at the edge of the cliff and wait there while we figure something out. I’m sure some would argue this is impossible.
P(Doom) for each scenario would also be useful, as well as further scenarios not discussed here.
Something I learned today that might be relevant: OpenAI was not the first organization to train transformer language models with search engine access to the internet. Facebook AI Research released their own paper on the topic six months before WebGPT came out, though the paper is surprisingly uncited by the WebGPT paper.
Generally I agree that hooking language models up to the internet is terrifying, despite the potential improvements for factual accuracy. Paul’s arguments seem more detailed on this and I’m not sure what I would think if I thought about them more. But the fact that OpenAI was following rather than leading the field would be some evidence against WebGPT accelerating timelines.
To test this claim we could look to China, where AI x-risk concerns are less popular and influential. China passed a regulation on deepfakes in January 2022 and one on recommendation algorithms in March 2022. This year, they passed a regulation on generative AI which requires evaluation of training data and red teaming of model outputs. Perhaps this final measure was the result of listening to ARC and other AI safety folks who popularized model evaluations, but more likely, it seems that red teaming and evaluations are the common sense way for a government to prevent AI misbehavior.
The European Union’s AI Act was also created before any widespread recognition of AI x-risks.
On the other hand, I agree that key provisions in Biden’s executive order appear acutely influenced by AI x-risk concerns. I think it’s likely that without influence from people concerned about x-risk, their actions would more closely resemble the Blueprint for an AI Bill of Rights.
The lesson I draw is that there is plenty of appetite for AI regulation independent of x-risk concerns. But it’s important to make sure that regulation is effective, rather than blunt and untargeted.
Link to China’s red teaming standard — note that their definitions of misbehavior are quite different from yours, and they do not focus on catastrophic risks: https://twitter.com/mattsheehan88/status/1714001598383317459?s=46