Stanford AI Alignment president, technical AI governance research, AIS field building, and animal welfare.
Gabriel Mukobi(Gabriel Mukobi)
Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]
Hi Jon! Yeah, that’s an interesting example, and I can confirm that when writing this distillation one of the hardest parts was coming up with a clear example that could use IDA. I think one idea to suggest amplification might apply to scientific development is that a lot of scientific advancements seem to have come clever intuitions and novel ideas. That is, while one scientist is pretty unlikely to get the “Eureka” insight that would lead to e.g. general relativity, 20 scientists collectively have a much higher chance that at least one of them could come up with a good idea, and 1000 scientists an even higher chance (taken to an extreme, you might imagine all of scientific progress on Earth so far has been a bunch of scientists vibing and every so often one of them reaches a useful insight). Scientific progress generally seems to be iterative anyway, so an IDA-amplified PASTA AGI could theoretically spin up a bunch of randomly perturbed versions of itself to work at scientific problems until one comes up with a uniquely good insight, and then it could be distilled down to become more creative and efficient at generating future insights.
[Question] Favourite new AI productivity tools?
I’d count that, reducing the number of things you have to worry about seems like a productivity improvement!
The Tree of Life: Stanford AI Alignment Theory of Change
Thanks for posting this! I’m glad to have more concrete and updated examples of how current AI systems can lead to failure, a concept that seems to often be nebulous to people new to AI safety.
Thanks for actually taking the time to organize all the information here, this is and will be very useful!
For OpenAI, you could also link this recent blog post about their approach to alignment research that reinforces the ideas you already gathered. Though maybe that blog post doesn’t go into enough detail or engage with those ideas critically and you’ve already read it and decided to leave it out?
Levelling Up in AI Safety Research Engineering
Thanks, yeah that’s a pretty fair sentiment. I’ve changed the wording to “at least 100-200 hours,” but I guess the idea was more to present a very efficient way of learning things that maybe 80⁄20′s some of the material. This does mean there will be more to learn—rather than these being strictly linear progression levels, I imagine someone continuously coming back to AI safety readings and software/ML engineering skills often throughout their journey, as it sounds like you have.
Interesting, that is the level that feels most like it doesn’t have a solid place in a linear progression of skills. I wrote “Level 1 kind of happens all the time” to try to reflect this, but I ultimately decided to put it at the start because I feel that for people just starting out it can be a good way to test their fit for AI safety broadly (do they buy the arguments?) and decide whether they want to go down a more theoretical or empirical path. I just added some language to Level 1 to clarify this.
Mostly, yes, that’s right. The exception is in Level 7: Original Experiments which suggests several resources for forming an inside view and coming up with new research directions, but I think many people could get hired as research engineers before doing that stuff (though maybe they do that stuff while working as a research engineer and that leads them to come up with new better research directions).
Wow, this is a cool concept and video, thanks for making it! As a new person to the field, I’d be really excited for you and other AI safety researchers to do more devlog/livestream content of the form “strap a GoPro on me while I do research!”
I agree that the plausibility and economic competitiveness of long-term planning AIs seems uncertain (especially with chaotic systems) and warrants more investigation, so I’m glad you posted this! I also agree that trying to find ways to incentivize AI to pursue myopic goals generally seems good.
I’m somewhat less confident, however, in the claim that long-term planning has diminishing returns beyond human ability. Intuitively, it seems like human understanding of possible long-term returns diminishes past human ability, but it still seems plausible to me that AI systems could surpass our diminishing returns in this regard. And even if this claim is true and AI systems can’t get much farther than human ability at long-term planning (or medium-term planning is what performs best as you suggest), I still think that’s sufficient for large-scale deception and power-seeking behavior (e.g. many human AI safety researchers have written about plausible ways in which AIs can slowly manipulate society, and their strategic explanations are human-understandable but still seem to be somewhat likely to win).
I’m also skeptical of the claim that “Future humans will have at their disposal the assistance of short-term AIs.” While it’s true that past ML training has often focused on short-term objectives, I think it’s plausible that certain top AI labs could be incentivized to focus on developing long-term planning AIs (such as in this recent Meta AI paper) which could push long-term AI capabilities ahead of short-term AI capabilities.
I don’t know much about how CEOs are selected, but I think the idea is rather that the range of even the (small) tails of normally-distributed human long-term planning ability is pretty close together in the grand picture of possible long-term planning abilities, so other factors (including stochasticity) can dominate and make the variation among humans wrt long-term planning seem insignificant.
If this were true, it would mean the statement “individual humans with much greater than average (on the human scale) information-processing capabilities empirically don’t seem to have distinct advantages in jobs such as CEOs and leaders” could be true and yet not preclude the statement “agents with much greater than average (on the universal scale) … could have distinct advantages in those jobs” from being true (sorry if that was confusingly worded).
Congrats all, it seems like you were wildly successful in just 1 semester of this new strategy!
I have a couple of questions:
130 in 13 weekly reading groups
= 10 people per group, that feels like a lot and maybe contributed to the high drop rate. Do you think this size was ideal?
Ran two retreats, with a total of 85 unique attendees
These seem like huge retreats compared to other university EA retreats at least, and more like mini-conferences. Was this the right size, or do you think they would have been more valuable as more selective and smaller things where the participants perhaps got to know each other better?
two weekly AI governance fellowships with 15 initial and 14 continuing participants.
This retention rate seems very high, though I imagine maybe these were mostly people already into AI gov and not representative of what a scaled-up cohort would look like. Do you plan to also expand AI governance outreach/programming next term?
Overall, I’m really glad your doing all these things and paving the way for others to follow—we’ll seek to replicate some of your success at Stanford :)
I’m still a bit confused. Section 5.4 says
the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)
but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.
Big if true! Maybe one upside here is that it shows current LLMs can definitely help with at least some parts of AI safety research, and we should probably be using them more for generating and analyzing data.
...now I’m wondering if the generator model and the evaluator model tried to coordinate😅
Re Evan R. Murphy’s comment about confusingness: “model agrees with that more” definitely clarifies it, but I wonder if Evan was expecting something like “more right is more of the scary thing” for each metric (which was my first glance hypothesis).
I’m not sure what you particularly mean by trustworthy. If you mean a place with good attitudes and practices towards existential AI safety, then I’m not sure HF has demonstrated that.
If you mean a company I can instrumentally trust to build and host tools that make it easy to work with large transformer models, then yes, it seems like HF pretty much has a monopoly on that for the moment, and it’s worth using their tools for a lot of empirical AI safety research.
Disclaimer: first post/distillation and somewhat new to alignment, so I may have gotten some things wrong.
Calling this a “re-explanation” because that makes a little more sense to me than “distillation” and I plan to do a series of regular re-explanations over the next year-ish.
Feedback is much appreciated!