I (with the help of a few more people) am planning to create an introduction to AI Safety that a smart teenager can understand. What am I missing?
Disclaimer: My English isn’t very good, but do not dissuade me on this basis—the sequence itself will be translated by a professional translator.
I want to create a sequence that a fifteen or sixteen year old smart school student can read and that can encourage them to go into alignment. Right now I’m running an extracurricular course for several smart school students and one of my goals is “overcome long inferential distances so I will be able to create this sequence”.
I deliberately did not include in the topics the most important modern trends in machine learning. I’m optimizing for the scenario “a person reads my sequence, then goes to university for another four years, and only then becomes a researcher.” So (with the exception of the last part) I avoided topics that are likely to become obsolete by this time.
Here is my (draft) list of topics (the order is not final, it will be specified in the course of writing):
Introduction—what is AI, AGI, Alignment. What are we worried about. AI Safety as AI Notkilleveryoneism.
Why AGI is dangerous. Orthogonality Thesis, Goodhart’s Law, Instrumental Convergency. Corrigibility and why it is unnatural.
Forecasting. AGI timelines. Takeoff Speeds. Arguments for slow and fast takeoff.
Why AI boxing is hard/near to impossible. Humans are not secure systems. Why even Oracle AGI can be dangerous.
Modern ML in a few words (without math!). Neural networks. Training. Supervised Learning. Reinforcement Learning. Reward is not the goal of RL-agent.
Interpretability. Why it is hard. Basic ideas on how to do it.
Inner and outer alignment. Mesa-optimization. Internal, corrigible and deceptive alignment. Why deceptive alignment seems very likely. What can influence its probability.
Decision theory. Prisoner’s Dilemma, Newcomb’s problem, Smoking lesion. CDT, EDT and FDT.
What exactly are optimization and agency? Attempts to define this concepts. Optimization as attractors. Embedded agency problems.
Eliezer Yudkowsky’s point of view. Pivotal actions. Why it can be useful to have imaginary EY over your shoulder even if you disagree with him.
Capability externalities. Avoid them.
Conclusion. What can be done. Important organisations. What are they working on now?
What else should be here? Maybe something should not be here? Are there reasons why the whole idea can be bad? Any other advices?
My quick take here is that your list of topics is not an introduction to AI Safety, it is an introduction to AI safety as seen from inside the MIRI/Yudkowsky bubble, where everything is hard, and nobody is making any progress. Some more diversity in viewpoints would be better.
For your audience, my go-to source would be to cover bits of Christian’s The Alignment Problem.
My overall reactions:
I would cut more things from this list than I’d add.
Getting audience attention is hard, and requires planning.
The more effort you spend getting attention, the more effort you should also spend making sure that you’re not writing things that will backfire.
The downside I think is most likely would be if you write this in the “voice” of an AI authority but confuse or omit some technical details, causing friction with other people in AI or even the audience. I don’t know you, but if you’re not an AI authority, it’s okay to write as yourself—talking about what you personally find interesting / convincing.
Some specific edits I’d make, in order of their destination:
I’d move “what is agency?” from section 9 to section 3, or just spend more time on it in section 1.
Under forecasting, I’d put less emphasis on takeoff speed and more emphasis on attempting to teach people that superhuman performance is very possible on nearly every task, AI is not just going to plateau at human level, that’s not a plausible future.
I would actually not mention the words “inner” or “outer” alignment. But I would talk about generalization failure and deceptive alignment.
I would cut decision theory entirely.
I would merge the EY and capabilities externality bullet points into a more general strategy section. What would the world look like if we were on a trajectory to succeed? What of our actions move us closer / further from that trajectory?
Thanks for your answer!
I’m going to post each part on LW and collect feedback before I put it all together, to avoid this failure mode in particular.
I will think about it.
I’m not sure it should be in the forecasting section, more like in the introduction (or, if it is harder than I think, in its own separate section).
Seems like a good proposal, thanks!
Your outline has a lot of beliefs you expect your students to walk away with, but basically zero skills. If I was one of your prospective students, this would look a lot more like cult indoctrination than a genuine course where I would learn something.
What skills do you hope your students walk away with? Do you hope that they’ll know how to avoid overfitting models? That they’ll know how to detect trojaned networks? That they’ll be able to find circuits in large language models? I’d recommend figuring this out first, and then working backwards to figure out what to teach.
Also, don’t underestimate just how smart smart 15- and 16-year-olds can be. At my high school, for example, there were at least a dozen students who knew calculus at this age, and many more who knew how to program. And this was just a relatively normal public high school.
Thanks for your answer!
This is about… I wouldn’t say “beliefs”—I will make a lot of caveats like “we are not sure”, “there are some smart people who disagree”, “this is an arguments against this view”, etc. (mental note: do it MORE, thank you for your observation) - but about “motivation” and “discourse”. Not about technical skills, that’s true.
I have a feeling that there is an attractor “I am AI-researcher and ML is AWESOME, and I will try to make it even more AWESOME, and yes, there are this safety folks and I know some of their memes and may be they have some legitimate concerns, but we will solve it later and everything will be OK”. And I think that when someone learns some ML-related technical skills before basic AI Safety concepts and discourse, it’s very easy for them to get into this attractor. And from this point it’s pretty hard to return back. So I want to create something like a vaccine against this attractor.
Technical skills are neccesary, but for most of them there are already good courses, textbooks and such. The skills I saw no texbooks for are “to understand AIsafetyspeak” and “to see why alignment-related problem X is hard and why obvious solutions may not work”. Because of the previously mentioned attractor I think it’s better to teach this skills before technical skills.
I make an assumption that average 15-16-year-olds in my target audience know how to program at least a little bit (In Russia basic programming in theory is in the mandatory school program. I don’t know about US), but don’t know calculus (but I think smart school student can easily understand a concept of a derivative without strict mathematical definition).