Solving Interpretability Week

For original motivation, see solving corrigibility week. I’ll state what I learned from corrigibility week, why I didn’t post one last week, and the updated format for interpretability.

What I Noticed

The open call to co-work with my calendly link didn’t work. Only met with people when I messaged them specifically. I’m changing my availability from “all day Saturday” to “two hours/​day” to better meet other’s schedule and not have a single point of failure (see next section). It also seems great to ask specific people to meet when I have a question on their research.

Going through previous work, it was good to write out my thoughts in the google doc, spruce it up, and make it a comment on the original post. This also connects with messaging people to meet with if I feel there’s a large disconnect or I have a lot of confusion regarding their work.

The google doc was also less long-term collaborative, which may have been from a lack of notifications that someone responded to what you wrote. So I’m moving the research direction part to the comment section here.

No Post Last Week

I committed to posting 3 of these this month, but didn’t last week. I had a bad Saturday (which was my designate work day for the corrigibility post) and a bad week until I talked to Shay, my therapist, and worked out a plan (Shay is great and has free sessions for those working on alignment; highly recommend). I am not ashamed or embarrassed, but I do put less stock in my public commitments and more aware of my failure modes. I personally still recommend people try (and possibly fail) ambitious projects.

I do want to write a post on corrigibility, but it’s too time-consuming to both work out my own thoughts going through the literature and meet with people to understand their work & distill those conversations. Both are important and not mutually exclusive. Some possible solutions are to circle back to corrigibility next year or take two weeks per topic.

Format This Week

Here is the google doc for interpretability. In the comments are top-level comments for:

  1. Research directions you want discussed with any questions you have

  2. Your meeting schedule for co-working and how you’d like to be contacted

  3. Suggestions for future weeks or changes to the format

In the doc are:

  1. Literature Review

  2. Tasks to do for further research

Again, in the google doc, it is socially acceptable to write low-quality babble. In this post’s comments, I also accept babbling/​ spit-balling and will not delete them.