Self-studying to develop an inside-view model of AI alignment; co-studiers welcome!
tldr: It’s hard for me to develop inside views of AI alignment, but I feel like I’m approximately ready for it now. So I’m developing a curriculum for myself, and I’d welcome people who want to join me in creating their own curricula and discussing progress regularly!
I’m one of those people who finds developing an “inside view” hard.
Sometimes someone will ask me about something I have expertise in, and I’ll realize I don’t actually understand the topic deeply enough to answer them.
Sometimes someone will ask me something, and I’ll cite a bunch of perspectives from other people instead of directly knowing what I think.
Sometimes I don’t know I have opinions, until I’m chatting with someone and some complicated model falls out of my mouth.
Related: When I’m talking with people with strong frames, I’ll have a hard time seeing other perspectives. I get upset when non-expert people talk too confidently around me, because I feel like I can’t screen their perspectives out and they’ll influence the “average” perspective in my head.
Related: When I’m trying to learn something, I usually successfully learn enough to hit my goal and no more. I’m quite good at the goal-directed skill, but my intrinsic interests don’t usually line up with technical content, so I don’t acquire the type of deep understanding that comes from nerd-sniping.
Related: It’s historically felt (socially) safer to me to defer to experts I respect, and academia probably also trained this for me.
In any case, after a lot of years of orienting around this, I now feel like I’m approximately ready to develop an inside view about AI alignment. I’ve consumed a lot of AI content by this point, and feel like it’s about time, so I’m psyched that my psyche feels like it’s finally in the right place for this.
So, because I’m one of those people who likes courses and structure: I’m developing a curriculum for myself, and I’d welcome anyone who wants to join me in creating their own and discussing progress regularly!
Vael’s curriculum (optimized for me, by me)
(squints) This feels like it’s probably going to be 3 months to a year.
2 weekly blog posts, due Sunday before bed.
Blog post #1: An opinion on an AI paper (fine if previously read)
Blog post #2: A personal piece on anything else (this isn’t necessarily relevant to developing an inside-view, but it is relevant to keeping this sustainable for me, since this feels intrinsically rewarding. Also fine to talk about AI if desired!) (Also, optional blog post #3: fiction scene related to AI.)
(very very optional) Blog Post (#3): If I can get any sort of quick write-up published on any content I think would be good to post, like really so many extra bonus points, because quick writing habits are good and content is good. Also reading more of a book that’s in my optional reading list, that’s also great.
Probably cross-post to a friend-viewable/commentable location so I get that positive feedback also
Mark up one piece of reading on a note-taking app.
Report weekly progress on accountability Discord. Fine to skip a week, but I have to make it up at some point. Unless I decide I’m done with this project and don’t want to, in which case: good experiment! Internal rules can change at any time.
Do coaching if useful. Consider signing up for a coaching session every 2 weeks.
If other people are doing this with me: Make a Slack or Discord group and post progress. Moreover, if the people are right and it makes sense to discuss everyone’s AI takes: practice this. I actually think debating / discussion / teaching is a core skill here—maybe THE core challenge skill—so worth figuring out how to integrate this partway through the program if it’s not happening already.
If it’s going to be a problem, decide on the content for that week at least at the beginning of that week. (Planning Vael should be separate from Implementing Vael, otherwise mind-crashes happen.) Fine to just read stuff and choose moment-of-writing though.
Start this week. (The blog posts can be short.)
The Sequences (it’d be great to get through some more of these, though watch out for completionism tendencies) -- epistemics
AGI Safety Fundamentals curriculum (I’ve read through the core readings already, but like… what even is a mesa-optimizer and HCH, sigh) -- object-level content
New AI content. Could be from AI Alignment Newsletter, LW / Alignment Forum, Google Scholar, wherever. Also read some non-safety AI stuff I hope. -- object-level content, some debate
Keep an eye on AI alignment community-building stuff. Relatedly, on the whole space including AI governance—object-level (meta-field, and non-technical AI) content
Revisiting old content from Minding our way and CFAR handbooks (having time to go through CFAR-like questions feels like it’d be quite good) -- psych
Make up my own weird helpful psych stuff. -- psych
For example, I need to develop the technique for “Frame Holding”, which is how you (aggressively? I think that’s probably not the right emotion) hold your own frame solidly in mind when listening to other people, such that there’s nothing to fear, and you can carefully choose what information you want to enter your model while still being socially and epistemically open, and also still somehow inhabit the other person’s frame so you understand it well. I’m confused about how to do this, but it feels like it’s probably possible. Feels like one of the hang-ups is going to be about social perception and agreeableness.
I also want better techniques for “opinion checking”. There’s probably a series of questions I can ask myself so that I can regularly check whether I have opinions or not. This one feels like it’s probably bottlenecked on social stuff as well.
Do some self-perception updating as well so that everything feels aligned.
If you’re excited about doing something similar with me, send me a message here or email! We’ll see how this experiment goes.