Solve Corrigibility Week

A low-hanging fruit for solving alignment is to dedicate a chunk of time actually trying to solve a sub-problem collectively.

To that end, I’ve broken up researching the sub-problem of corrigibility into two categories in this google doc (you have suggestion privileges):

  1. Previous Work: let’s not reinvent the wheel. Write out links to any past work on corrigibility. This can range from just links to links & summaries & analyses. Do comment reactions to other’s reviews to provide counter-arguments. This is just a google doc, low-quality posts, comments, links are accepted; I want people to lean towards babbling more.

  2. Tasks: what do we actually do this week to make progress? Suggest any research direction you find fruitful or general research questions or framings. Example: write an example of corrigibility (one could then comment an actual example).

Additionally, I’ll post 3 top-level comments for:

  1. Meetups: want to co-work with others in the community? Comment availability, work preferences, and a way to contact you (eg calendly link, “dm me”, “ my email is bob and alice dot com”, etc) For example, I’m available most times this week with a Calendly link for scheduling 1-on-1 co-working sessions. Additionally, you yourself could message those you know to collaborate on this, or have a nerdy house co-working party.

  2. Potential topics: what other topics besides corrigibility could we collaborate on in future weeks?

  3. Meta: what are different formats this type of group collaboration could take? Comment suggestions with trade offs or discuss the cost/​benefits of what I’m presenting in this post.

I do believe there’s a legitimate, albeit small, chance that we solve corrigibility or find its “core” this week. Nonetheless, I think it’s of great value to be able to make actual progress on alignment issues as a community and to figure out how to do that better. Additionally, it’s immensely valuable to have an alignment topic post include a literature review, the community’s up-to-date thoughts, and possible future research directions to pursue. I also believe a collaborative project like this will put several community members on the same page as far as terminology and gears-level models.

I explicitly commit to 3 weeks of this (so corrigibility this week and two more the next two weeks). After that is Christmas and New Years, after which I may resume depending on how it goes.

Thanks to Alex Turner for reviewing a draft.