Research Engineering Intern at the Center for AI Safety. Helping to write the AI Safety Newsletter. Studying CS and Economics at the University of Southern California, and running an AI safety club there. Previously worked at AI Impacts and with Lionel Levine and Collin Burns on calibration for Detecting Latent Knowledge Without Supervision.
aogara
a realistic example where I expect the delay would generate a strong incentive for using an agent AGI
I’d guess high speed stock trading. Right now, we already have AI trading stock to maximize profits over significant time horizons way faster than humans can effectively supervise.
We might already have examples of these AIs being misaligned and causing harm. (Maybe.) The 2010 Flash Crash is poorly understood, and few blame it entirely on high frequency trading algorithms. But regulators say that HFTs operating without human supervision were “clearly a contributing factor” to the crash because:
HFTs sold big and fast as soon as the market began dipping, faster than humans likely could’ve, and:
(Probably not a major factor) HFTs may have clogged and confused markets with quote stuffing—”placing and then almost immediately cancelling large numbers of rapid-fire orders to buy or sell stocks”.
To be fair, others say that HFTs were a big part of why the crash was quickly reversed and the market returned to normal.
In any case, all of this happened without any human supervision, and was so opaque that we still don’t understand what happened. That seems like evidence for opaque, unsupervised AIs with broad goals.
I haven’t seriously evaluated the arguments, but my intuition is that suffering and happiness are opposite sides of the same scale, not separate values. Utility is the measure of how good or bad something is, and happiness and suffering correspond to positive and negative values of utility.
terminal value monism (suffering is the only thing that motivates us “by itself”)
So I’d say that I value only utility.
it is a straw man argument that NUs don’t value life or positive states, because NUs value them instrumentally
Again, not having thought too much about it, I find that my intuition better matches a system that cares about positive utility even when it doesn’t avert negative utility. E.g., I want paradise forever, not mild pleasantness forever.
Is there a good reason to suspect this is wrong?
OpenPhil supported the Center for Election Science once, but they’re much more a political action group than a voting theory research group. They primarily do ballot initiatives and public education on what we already know.
If enacting your policies is the real bottleneck, then it makes sense that 90% of your argument is true, but it still doesn’t matter because you can’t enact political change.
I don’t know if I believe that, but it’s imaginable.
EDIT: After seeing that you know way more about this than I do, I’ll leave my thought here, but definitely defer to you.
I strongly disagree that there is a >10% chance of AGI in the next 10 years. I don’t have the bandwidth to fully debate the topic here and now, but some key points:
My comment EA has unusual beliefs about AI timelines and Ozzie Gooen’s reply
Two other considerations driving me towards longer timelines
Of the news in the last week, PaLM definitely indicates faster language model progress over the next few years, but I’m skeptical that this will translate to success in the many domains with sparse data. Holden Karnofsky’s timelines seem reasonable to me, if a bit shorter than my own:
I estimate that there is more than a 10% chance we’ll see transformative AI within 15 years (by 2036); a ~50% chance we’ll see it within 40 years (by 2060); and a ~2/3 chance we’ll see it this century (by 2100).
Thanks, fixed.
Hey Evan, thanks for the response. You’re right that there are circles where short AI timelines are common. My comment was specifically about people I personally know, which is absolutely not the best reference class. Let me point out a few groups with various clusters of timelines.
Artificial intelligence researchers are a group of people who believe in short to medium AI timelines. Katja Grace’s 2015 survey of NIPS and ICML researchers provided an aggregate forecast giving a 50% chance of HLMI occurring by 2060 and a 10% chance of it occurring by 2024. (Today, seven years after the survey was conducted, you might want to update against the researchers that predicted HLMI by 2024.) Other surveys of ML researchers have shown similarly short timelines. This seems as good of an authority as any on the topic, and would be one of the better reasons to have relatively short timelines.
What I’ll call the EA AI Safety establishment has similar timelines to the above. This would include decision makers at OpenPhil, OpenAI, FHI, FLI, CHAI, ARC, Redwood, Anthropic, Ought, and other researchers and practitioners of AI safety work. As best I can tell, Holden Karnofsky’s timelines are reasonably similar to the others in this reference group, including Paul Christiano and Rohin Shah (would love to add more examples if anybody can point to them), although I’m sure there are plenty of individual outliers. I have a bit longer timelines than most of these people for a few object level reasons, but their timelines seem reasonable.
Much shorter timelines than the two groups above come from Eliezer Yudkowsky, MIRI, many people on LessWrong, and others. You can read this summary of Yudkowsky’s conversation with Paul Christiano, where he does not quantify his timelines but consistently argues for faster takeoff speeds than Christiano believes are likely. See also this aggregation of the five most upvoted timelines from LW users, with a median of 25 years until AGI. That is 15 years sooner than Holden Karnofsky and 15 years sooner than Katja Grace’s survey of ML researchers. This is the group of scenarios that I would most strongly disagree with, appealing to both the “expert” consensus and my object level arguments above.
Stephen Hawking, Elon Musk, Bill Gates, Sam Harris, everyone on this open letter back in 2015 etc.
The open letter from FLI does not mention any specific AI timelines at all. These individuals all agree that the dangers from AI are significant and that AI safety research is important, but I don’t believe most of them have particularly short timelines. You can read about Bill Gates’s timelines here, he benchmarks his timelines as “at least 5 times as long as what Ray Kurzweil says”. I’m sure other signatories of the letter have talked about their timelines, I’d love to add these quotes but haven’t found any others.
Overall, I’d still point to Holden Karnofsky’s estimates as the most reasonable “consensus” on the topic. The object-level reasons I’ve outlined above are part of the reason why I have longer timelines than Holden, but even without those, I don’t think it’s reasonable to “pull the short timelines fire alarm”.
- 14 Apr 2022 23:43 UTC; 8 points) 's comment on aogara’s Quick takes by (EA Forum;
Wei is correct, current LLMs are 100% corrigible. Large language models are trained on so-called self supervised objective functions to “predict the next word” (or sometimes, predict the masked word). If we’d like them to provide a particular output, all we need is to include that response in the training data. Through the training process, the model naturally learns to agree with its input data.
The problem of (in)corrigibility, as formalized by this MIRI paper, is our potential (in)ability to turn off an AI agent. But the paper only concerns agents, which language models are not. RL agents pose the potential for self-preservation, but self-supervised language models are more akin to “oracle” AIs that merely answer questions without a broader goal in mind.
Now, the most compelling stories of AI doom combine language processing with agentic optimization. These agents could be incorrigible and attempt self-preservation, potentially at the expense of humanity. Unfortunately most work on this topic has been theoretical—I would love to see an empirical demonstration of incorrigible self-preservation behavior by an RL agent.
An AGI might become a dictator in every country on earth while still not being able to wash dishes or make errors when it comes to driving 100,000 miles. Physical coordination is not required.
How would you expect an AI to take over the world without physical capacity? Attacking financial systems, cybersecurity networks, and computer-operated weapons systems all seem possible from an AI that can simply operate a computer. Is that your vision of an AI takeover, or are there other specific dangerous capabilities you’d like the research community to ensure that AI does not attain?
Would you be interested in running a wider array of few-shot performance benchmarks? No performance loss on few shot generation is a bold but defensible claim, and it would be great to have stronger evidence for it. I’d be really interested in doing the legwork here if you’d find it useful.
Fantastic paper, makes a strong case that rejection sampling should be part of the standard toolkit for deploying pretrained LMs on downstream tasks.
For sure, benchmarking can still be useful even if the classifier is less powerful than it could be. My main question is: How well does the generator model perform after rejection sampling? You could imagine that the rejection sampling process is destructive for output quality. But the initial results from your paper indicate the opposite—rejection sampling does not reduce human preference for generated outputs, so we would hope that it does not reduce benchmark performance either.
For example, MultiRC is a popular reading comprehension benchmark where the generator model takes a prompt consisting of a text passage, a question about the passage, and a list of possible answers. The generator then labels each answer as true or false, and is graded on its accuracy. Evaluation on MultiRC would show us how the classifier affects the generator’s QA skills.
GPT-Neo has already been evaluated on a wide suite of benchmarks. To prove that rejection sampling is performance competitive with unfiltered generation, you would not need to achieve SOTA performance on these benchmarks—you’d simply need to be competitive with unfiltered GPT-Neo.
“If you actually have that better thing lined up, I think it’s a pretty straightforward decision. If you don’t, it’s a lot tougher to predict whether it exists.”
Great point. If you can support yourself with a full-time paid job at Poetic or another organization, you can feel confident and secure leaving school for a while. Also, you don’t have to “drop out” — you can take a semester or two or four off, and your university is very likely to readmit you if you decide to go back to school.
I dropped out of college after my freshman year to work at a startup. It was a great experience and I’m glad I did it. After about two years, I realized I needed more formal training in CS and ML in order to move from industry data science to AI safety and other more difficult career paths. I transferred to a new school that is much better socially and academically than my first school, with a much better sense of my academic goals.
You can find a stable, respectable option for leaving school and preserving optionality to return. Introspective Systems already sounds like that option (send them an email!). Other startups would probably hire you, you can send emails to YC founders to find out. EA orgs and funding are more difficult in my experience, but you might have better luck. Finally, with all the respect in the world for attempting ambitious work in an important field, I would caution against pinning too much on Poetic. Undergraduates very rarely found successful startups, even less so in research-intensive industries dominated by PhDs such as NLP. If you find somebody older and more experienced who’s doing something you’d like to do, you can put school on hold while safely preserving optionality to return.
Wild. One important note is that the model is trained with labeled examples of successful performance on the target task, rather than learning the tasks from scratch by trial and error like MuZero and OpenAI Five. For example, here’s the training description for the DeepMind Lab tasks:
We collect data for 255 tasks from the DeepMind Lab, 254 of which are used during training, the left out task was used for out of distribution evaluation. Data is collected using an IMPALA (Espeholt et al., 2018) agent that has been trained jointly on a set of 18 procedurally generated training tasks. Data is collected by executing this agent on each of our 255 tasks, without further training.
Gato then achieves near-expert performance on >200 DM Lab tasks (see Figure 5). It’s unclear whether the model could have learned superhuman performance training from scratch, and similarly unclear whether the model could learn new tasks without examples of expert performance.
More broadly, this seems like substantial progress on both multimodal transformers and transformer-powered agents, two techniques that seem like they could contribute to rapid AI progress and risk. I don’t want to downplay the significance of these kinds of models and would be curious to hear other perspectives.
Did you consider using the approach described in Ethan Perez’s “Red Teaming LMs with LMs”? This would mean using a new generator model to build many prompts, having the original generator complete those prompts, and then having a classifier identify any injurious examples in the completion.
The tricky part seems to be that this assumes the classifier’s judgements are correct. If you trained the classifier on the examples identified by this process, it would only generate examples that are already labeled correctly by the classifier. To escape this problem, you could have humans grade whether each classification was correct, and train the classifier on incorrectly classified examples. But if humans have to review each example, this might not be any more useful than your original data labeling process using examples from human-written stories.
I suppose it wouldn’t really add much value. Would you agree? Are there any related circumstances where the approach would be more useful? Maybe this would be a better question for Ethan...
Coming back to this: Your concern makes sense to me. Your proposal to train a new classifier for filtered generation to improve performance on other tasks seems very interesting. I think it might also be useful to simply provide a nice open-source implementation of rejection sampling in a popular generator repo like Facebook’s OPT-175B, so that future researchers can build on it.
I’m planning on working on technical AI safety full-time this summer. Right now I’m busy applying to a few different programs, but I’ll definitely follow up on this idea with you.
“It is necessary that people working on alignment have a capabilities lead.” Could you say a little more about this? Seems true but I’d be curious about your line of thought.
The theory of change could be as simple as “once we know how to build aligned AGI, we’ll tell everybody”, or as radical as “once we have an aligned AGI, we can steer the course of human events to prevent future catastrophe”. The more boring argument would be that any good ML research happens on the cutting edge of the field, so alignment needs big budgets and fancy labs just like any other researcher. Would you take a specific stance on which is most important?
Love the effort to engage with alignment work in academia. It might be a very small thread of authors and papers at this point, but hopefully it will grow.
- 6 Jun 2022 23:22 UTC; 4 points) 's comment on A descriptive, not prescriptive, overview of current AI Alignment Research by (
Specifically, do you agree with Eliezer that preventing existential risks requires a “pivotal act” as described here (#6 and #7)?
Yeah, I guess the answer is yes by definition. Still wondering what kind of pivotal acts people are thinking about—whether they’re closer to a big power-grabs like “burn all the GPUs”, or softer governance methods like “publishing papers with alignment techniques” and “encouraging safe development with industry groups and policy standards”. And whether the need for a pivotal act is the main reason why alignment researchers need to be on the cutting edge of capabilities.
Thank you, this was very helpful. As a bright-eyed youngster, it’s hard to make sense of the bitterness and pessimism I often see in the field. I’ve read the old debates, but I didn’t participate in them, and that probably makes them easier to dismiss. Object level arguments like these help me understand your point of view.
I don’t know anything about StarCraft, but the impression I got was that a few seconds of superhuman clicking in high leverage situations can mean a lot.
Agreed that this is a big improvement on previous StarCraft AIs no matter its clicking speed, but this seems like reason to doubt that AI has surpassed human strategy in StarCraft.