Gabriel Mukobi(Gabriel Mukobi)

Karma: 905

Stanford AI Alignment president, technical AI governance research, AIS field building, and animal welfare.

Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]

Gabriel Mukobi27 May 2022 5:42 UTC

21 points

4 comments6 min readLW link

Gabriel Mukobi 27 May 2022 17:02 UTC
2 points
on: Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]
Disclaimer: first post/distillation and somewhat new to alignment, so I may have gotten some things wrong.
Calling this a “re-explanation” because that makes a little more sense to me than “distillation” and I plan to do a series of regular re-explanations over the next year-ish.
Feedback is much appreciated!

Gabriel Mukobi 28 May 2022 5:25 UTC
2 points
in reply to: jonmenaster’s comment on: Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]
Hi Jon! Yeah, that’s an interesting example, and I can confirm that when writing this distillation one of the hardest parts was coming up with a clear example that could use IDA. I think one idea to suggest amplification might apply to scientific development is that a lot of scientific advancements seem to have come clever intuitions and novel ideas. That is, while one scientist is pretty unlikely to get the “Eureka” insight that would lead to e.g. general relativity, 20 scientists collectively have a much higher chance that at least one of them could come up with a good idea, and 1000 scientists an even higher chance (taken to an extreme, you might imagine all of scientific progress on Earth so far has been a bunch of scientists vibing and every so often one of them reaches a useful insight). Scientific progress generally seems to be iterative anyway, so an IDA-amplified PASTA AGI could theoretically spin up a bunch of randomly perturbed versions of itself to work at scientific problems until one comes up with a uniquely good insight, and then it could be distilled down to become more creative and efficient at generating future insights.

[Question] Favourite new AI productivity tools?

Gabriel Mukobi15 Jun 2022 1:08 UTC

14 points

5 comments1 min readLW link

Gabriel Mukobi 15 Jun 2022 1:55 UTC
1 point
in reply to: shminux’s comment on: Favourite new AI productivity tools?
I’d count that, reducing the number of things you have to worry about seems like a productivity improvement!

The Tree of Life: Stanford AI Alignment Theory of Change

Gabriel Mukobi2 Jul 2022 18:36 UTC

24 points

0 comments14 min readLW link

Gabriel Mukobi 19 Jul 2022 4:16 UTC
7 points
6
on: Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover
Thanks for posting this! I’m glad to have more concrete and updated examples of how current AI systems can lead to failure, a concept that seems to often be nebulous to people new to AI safety.

Gabriel Mukobi 30 Aug 2022 1:59 UTC
6 points
2
on: (My understanding of) What Everyone in Technical Alignment is Doing and Why
Thanks for actually taking the time to organize all the information here, this is and will be very useful!

For OpenAI, you could also link this recent blog post about their approach to alignment research that reinforces the ideas you already gathered. Though maybe that blog post doesn’t go into enough detail or engage with those ideas critically and you’ve already read it and decided to leave it out?

Levelling Up in AI Safety Research Engineering

Gabriel Mukobi2 Sep 2022 4:59 UTC

57 points

9 comments17 min readLW link

Gabriel Mukobi 2 Sep 2022 16:37 UTC
4 points
0
in reply to: zeshen’s comment on: Levelling Up in AI Safety Research Engineering
Thanks, yeah that’s a pretty fair sentiment. I’ve changed the wording to “at least 100-200 hours,” but I guess the idea was more to present a very efficient way of learning things that maybe ⁸⁰⁄₂₀′s some of the material. This does mean there will be more to learn—rather than these being strictly linear progression levels, I imagine someone continuously coming back to AI safety readings and software/ML engineering skills often throughout their journey, as it sounds like you have.

Gabriel Mukobi 3 Sep 2022 16:00 UTC
3 points
1
in reply to: Rohin Shah’s comment on: Levelling Up in AI Safety Research Engineering
Interesting, that is the level that feels most like it doesn’t have a solid place in a linear progression of skills. I wrote “Level 1 kind of happens all the time” to try to reflect this, but I ultimately decided to put it at the start because I feel that for people just starting out it can be a good way to test their fit for AI safety broadly (do they buy the arguments?) and decide whether they want to go down a more theoretical or empirical path. I just added some language to Level 1 to clarify this.

Gabriel Mukobi 12 Sep 2022 1:03 UTC
2 points
1
in reply to: Yonatan Cale’s comment on: Levelling Up in AI Safety Research Engineering
Mostly, yes, that’s right. The exception is in Level 7: Original Experiments which suggests several resources for forming an inside view and coming up with new research directions, but I think many people could get hired as research engineers before doing that stuff (though maybe they do that stuff while working as a research engineer and that leads them to come up with new better research directions).

Gabriel Mukobi 8 Nov 2022 18:04 UTC
1 point
3
on: Real-Time Research Recording: Can a Transformer Re-Derive Positional Info?
Wow, this is a cool concept and video, thanks for making it! As a new person to the field, I’d be really excited for you and other AI safety researchers to do more devlog/livestream content of the form “strap a GoPro on me while I do research!”

Gabriel Mukobi 22 Nov 2022 20:31 UTC
6 points
2
on: AI will change the world, but won’t take it over by playing “3-dimensional chess”.
I agree that the plausibility and economic competitiveness of long-term planning AIs seems uncertain (especially with chaotic systems) and warrants more investigation, so I’m glad you posted this! I also agree that trying to find ways to incentivize AI to pursue myopic goals generally seems good.
I’m somewhat less confident, however, in the claim that long-term planning has diminishing returns beyond human ability. Intuitively, it seems like human understanding of possible long-term returns diminishes past human ability, but it still seems plausible to me that AI systems could surpass our diminishing returns in this regard. And even if this claim is true and AI systems can’t get much farther than human ability at long-term planning (or medium-term planning is what performs best as you suggest), I still think that’s sufficient for large-scale deception and power-seeking behavior (e.g. many human AI safety researchers have written about plausible ways in which AIs can slowly manipulate society, and their strategic explanations are human-understandable but still seem to be somewhat likely to win).
I’m also skeptical of the claim that “Future humans will have at their disposal the assistance of short-term AIs.” While it’s true that past ML training has often focused on short-term objectives, I think it’s plausible that certain top AI labs could be incentivized to focus on developing long-term planning AIs (such as in this recent Meta AI paper) which could push long-term AI capabilities ahead of short-term AI capabilities.

Gabriel Mukobi 23 Nov 2022 0:53 UTC
2 points
−2
in reply to: boazbarak’s comment on: AI will change the world, but won’t take it over by playing “3-dimensional chess”.
I don’t know much about how CEOs are selected, but I think the idea is rather that the range of even the (small) tails of normally-distributed human long-term planning ability is pretty close together in the grand picture of possible long-term planning abilities, so other factors (including stochasticity) can dominate and make the variation among humans wrt long-term planning seem insignificant.
If this were true, it would mean the statement “individual humans with much greater than average (on the human scale) information-processing capabilities empirically don’t seem to have distinct advantages in jobs such as CEOs and leaders” could be true and yet not preclude the statement “agents with much greater than average (on the universal scale) … could have distinct advantages in those jobs” from being true (sorry if that was confusingly worded).

Gabriel Mukobi 14 Dec 2022 4:53 UTC
2 points
0
on: Update on Harvard AI Safety Team and MIT AI Alignment
Congrats all, it seems like you were wildly successful in just 1 semester of this new strategy!

I have a couple of questions:

130 in 13 weekly reading groups
1. = 10 people per group, that feels like a lot and maybe contributed to the high drop rate. Do you think this size was ideal?
Ran two retreats, with a total of 85 unique attendees
1. These seem like huge retreats compared to other university EA retreats at least, and more like mini-conferences. Was this the right size, or do you think they would have been more valuable as more selective and smaller things where the participants perhaps got to know each other better?
two weekly AI governance fellowships with 15 initial and 14 continuing participants.
1. This retention rate seems very high, though I imagine maybe these were mostly people already into AI gov and not representative of what a scaled-up cohort would look like. Do you plan to also expand AI governance outreach/programming next term?
Overall, I’m really glad your doing all these things and paving the way for others to follow—we’ll seek to replicate some of your success at Stanford :)

Gabriel Mukobi 25 Dec 2022 7:38 UTC
2 points
0
in reply to: evhub’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
I’m still a bit confused. Section 5.4 says
the RLHF model expresses a lower willingness to have its objective changed the more different the objective is from the original objective (being Helpful, Harmless, and Honest; HHH)
but the graph seems to show the RLHF model as being the most corrigible for the more and neutral HHH objectives which seems somewhat important but not mentioned.
If the point was that the corrigibility of the RLHF changes the most from the neutral to the less HHH questions, then it looks like it changed considerably less than the PM which became quite incorrigible, no?
Maybe the intended meaning of that quote was that the RLHF model dropped more in corrigibility just in comparison to the normal LM or just that it’s lower overall without comparing it to any other model, but that felt a bit unclear to me if so.

Gabriel Mukobi 25 Dec 2022 7:42 UTC
3 points
0
on: Discovering Language Model Behaviors with Model-Written Evaluations
Big if true! Maybe one upside here is that it shows current LLMs can definitely help with at least some parts of AI safety research, and we should probably be using them more for generating and analyzing data.

...now I’m wondering if the generator model and the evaluator model tried to coordinate😅

Gabriel Mukobi 25 Dec 2022 7:45 UTC
1 point
0
in reply to: evhub’s comment on: Discovering Language Model Behaviors with Model-Written Evaluations
Re Evan R. Murphy’s comment about confusingness: “model agrees with that more” definitely clarifies it, but I wonder if Evan was expecting something like “more right is more of the scary thing” for each metric (which was my first glance hypothesis).

Gabriel Mukobi 26 Dec 2022 4:03 UTC
1 point
0
in reply to: Zian’s comment on: Levelling Up in AI Safety Research Engineering
I’m not sure what you particularly mean by trustworthy. If you mean a place with good attitudes and practices towards existential AI safety, then I’m not sure HF has demonstrated that.

If you mean a company I can instrumentally trust to build and host tools that make it easy to work with large transformer models, then yes, it seems like HF pretty much has a monopoly on that for the moment, and it’s worth using their tools for a lot of empirical AI safety research.

Gabriel Mukobi(Gabriel Mukobi)

Iter­ated Distil­la­tion-Am­plifi­ca­tion, Gato, and Proto-AGI [Re-Ex­plained]

[Question] Favourite new AI pro­duc­tivity tools?

The Tree of Life: Stan­ford AI Align­ment The­ory of Change

Lev­el­ling Up in AI Safety Re­search Engineering

Iterated Distillation-Amplification, Gato, and Proto-AGI [Re-Explained]

[Question] Favourite new AI productivity tools?

The Tree of Life: Stanford AI Alignment Theory of Change

Levelling Up in AI Safety Research Engineering