Stanford AI Alignment president, technical AI governance research, AIS field building, and animal welfare.
Gabriel Mukobi(Gabriel Mukobi)
Hi Jon! Yeah, that’s an interesting example, and I can confirm that when writing this distillation one of the hardest parts was coming up with a clear example that could use IDA. I think one idea to suggest amplification might apply to scientific development is that a lot of scientific advancements seem to have come clever intuitions and novel ideas. That is, while one scientist is pretty unlikely to get the “Eureka” insight that would lead to e.g. general relativity, 20 scientists collectively have a much higher chance that at least one of them could come up with a good idea, and 1000 scientists an even higher chance (taken to an extreme, you might imagine all of scientific progress on Earth so far has been a bunch of scientists vibing and every so often one of them reaches a useful insight). Scientific progress generally seems to be iterative anyway, so an IDA-amplified PASTA AGI could theoretically spin up a bunch of randomly perturbed versions of itself to work at scientific problems until one comes up with a uniquely good insight, and then it could be distilled down to become more creative and efficient at generating future insights.
I’d count that, reducing the number of things you have to worry about seems like a productivity improvement!
Thanks for posting this! I’m glad to have more concrete and updated examples of how current AI systems can lead to failure, a concept that seems to often be nebulous to people new to AI safety.
Thanks for actually taking the time to organize all the information here, this is and will be very useful!
For OpenAI, you could also link this recent blog post about their approach to alignment research that reinforces the ideas you already gathered. Though maybe that blog post doesn’t go into enough detail or engage with those ideas critically and you’ve already read it and decided to leave it out?
Thanks, yeah that’s a pretty fair sentiment. I’ve changed the wording to “at least 100-200 hours,” but I guess the idea was more to present a very efficient way of learning things that maybe 80⁄20′s some of the material. This does mean there will be more to learn—rather than these being strictly linear progression levels, I imagine someone continuously coming back to AI safety readings and software/ML engineering skills often throughout their journey, as it sounds like you have.
Interesting, that is the level that feels most like it doesn’t have a solid place in a linear progression of skills. I wrote “Level 1 kind of happens all the time” to try to reflect this, but I ultimately decided to put it at the start because I feel that for people just starting out it can be a good way to test their fit for AI safety broadly (do they buy the arguments?) and decide whether they want to go down a more theoretical or empirical path. I just added some language to Level 1 to clarify this.
Mostly, yes, that’s right. The exception is in Level 7: Original Experiments which suggests several resources for forming an inside view and coming up with new research directions, but I think many people could get hired as research engineers before doing that stuff (though maybe they do that stuff while working as a research engineer and that leads them to come up with new better research directions).
Wow, this is a cool concept and video, thanks for making it! As a new person to the field, I’d be really excited for you and other AI safety researchers to do more devlog/livestream content of the form “strap a GoPro on me while I do research!”
I agree that the plausibility and economic competitiveness of long-term planning AIs seems uncertain (especially with chaotic systems) and warrants more investigation, so I’m glad you posted this! I also agree that trying to find ways to incentivize AI to pursue myopic goals generally seems good.
I’m somewhat less confident, however, in the claim that long-term planning has diminishing returns beyond human ability. Intuitively, it seems like human understanding of possible long-term returns diminishes past human ability, but it still seems plausible to me that AI systems could surpass our diminishing returns in this regard. And even if this claim is true and AI systems can’t get much farther than human ability at long-term planning (or medium-term planning is what performs best as you suggest), I still think that’s sufficient for large-scale deception and power-seeking behavior (e.g. many human AI safety researchers have written about plausible ways in which AIs can slowly manipulate society, and their strategic explanations are human-understandable but still seem to be somewhat likely to win).
I’m also skeptical of the claim that “Future humans will have at their disposal the assistance of short-term AIs.” While it’s true that past ML training has often focused on short-term objectives, I think it’s plausible that certain top AI labs could be incentivized to focus on developing long-term planning AIs (such as in this recent Meta AI paper) which could push long-term AI capabilities ahead of short-term AI capabilities.
I don’t know much about how CEOs are selected, but I think the idea is rather that the range of even the (small) tails of normally-distributed human long-term planning ability is pretty close together in the grand picture of possible long-term planning abilities, so other factors (including stochasticity) can dominate and make the variation among humans wrt long-term planning seem insignificant.
If this were true, it would mean the statement “individual humans with much greater than average (on the human scale) information-processing capabilities empirically don’t seem to have distinct advantages in jobs such as CEOs and leaders” could be true and yet not preclude the statement “agents with much greater than average (on the universal scale) … could have distinct advantages in those jobs” from being true (sorry if that was confusingly worded).
Congrats all, it seems like you were wildly successful in just 1 semester of this new strategy!
I have a couple of questions:
130 in 13 weekly reading groups
= 10 people per group, that feels like a lot and maybe contributed to the high drop rate. Do you think this size was ideal?
Ran two retreats, with a total of 85 unique attendees
These seem like huge retreats compared to other university EA retreats at least, and more like mini-conferences. Was this the right size, or do you think they would have been more valuable as more selective and smaller things where the participants perhaps got to know each other better?
two weekly AI governance fellowships with 15 initial and 14 continuing participants.
This retention rate seems very high, though I imagine maybe these were mostly people already into AI gov and not representative of what a scaled-up cohort would look like. Do you plan to also expand AI governance outreach/programming next term?
Overall, I’m really glad your doing all these things and paving the way for others to follow—we’ll seek to replicate some of your success at Stanford :)
I’m still a bit confused. Section 5.4 says
the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)
but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.
Big if true! Maybe one upside here is that it shows current LLMs can definitely help with at least some parts of AI safety research, and we should probably be using them more for generating and analyzing data.
...now I’m wondering if the generator model and the evaluator model tried to coordinate😅
Re Evan R. Murphy’s comment about confusingness: “model agrees with that more” definitely clarifies it, but I wonder if Evan was expecting something like “more right is more of the scary thing” for each metric (which was my first glance hypothesis).
I’m not sure what you particularly mean by trustworthy. If you mean a place with good attitudes and practices towards existential AI safety, then I’m not sure HF has demonstrated that.
If you mean a company I can instrumentally trust to build and host tools that make it easy to work with large transformer models, then yes, it seems like HF pretty much has a monopoly on that for the moment, and it’s worth using their tools for a lot of empirical AI safety research.
This is very interesting, thanks for this work!
A clarification I may have missed from your previous posts: what exactly does “attention QKV weight matrix” mean? Is that the concatenation of the Q, K, and V projection matrices, their sum, or something else?
This does feel pretty vague in parts (e.g. “mitigating goal misgeneralization” feels more like a problem statement than a component of research), but I personally think this is a pretty good plan, and at the least, I’m very appreciative of you posting your plan publicly!
Now, we just need public alignment plans from Anthropic, Google Brain, Meta, Adept, …
Thanks for writing this, Zac and team, I personally appreciate more transparency about AGI labs’ safety plans!
Something I find myself confused about is how Anthropic should draw the line between what to publish or not. This post seems to draw the line between “The Three Types of AI Research at Anthropic” (you generally don’t publish capabilities research but do publish the rest), but I wonder if there’s a case to be more nuanced than that.
To get more concrete, the post brings up how “the AI safety community often debates whether the development of RLHF – which also generates economic value – ‘really’ was safety research” and says that Anthropic thinks it was. However, the post also states Anthropic “decided to prioritize using [Claude] for safety research rather than public deployments” in the spring of 2022. To me, this feels slightly dissonant—insofar as much of the safety and capabilities benefits from Claude came from RLHF/RLAIF (and perhaps more data, though that’s perhaps less infohazardous), this seems like Anthropic decided not to publish “Alignment Capabilities” research for (IMO justified) fear of negative externalities. Perhaps the boundaries around what should generally not be published should also extend to some Alignment Capabilities research then, especially research like RLHF that might have higher capabilities externalities.
Additionally, I’ve also been thinking that more interpretability research maybe should be kept private. As a recent concrete example, Stanford’s Hazy Research lab just published Hyena, a convolutional architecture that’s meant to rival transformers by scaling to much longer context lengths while taking significantly fewer FLOPs to train. This is clearly very much public research in the “AI Capabilities” camp, but they cite several results from Transformer Circuits for motivating the theory and design choices behind their new architecture, and say “This work wouldn’t have been possible without inspiring progress on … mechanistic interpretability.” That’s all to say that some of the “Alignment Science” research might also be useful as ML theory research and then motivate advances in AI capabilities.
I’m curious if you have thoughts on these cases, and how Anthropic might draw a more cautious line around what they choose to publish.
Disclaimer: first post/distillation and somewhat new to alignment, so I may have gotten some things wrong.
Calling this a “re-explanation” because that makes a little more sense to me than “distillation” and I plan to do a series of regular re-explanations over the next year-ish.
Feedback is much appreciated!