It’s not exactly clear what you do with such a story or what the upside is, it’s kind of a vague theory of change and most people have some specific theory of change they are more excited about (even if this kind of story is a bit of a public good that’s useful on a broader variety of perspectives / to people who are skeptical).
Ah, interesting! I’m surprised to hear that. I was under the impression that while many researchers had a specific theory of change, it was often motivated by an underlying threat model, and that different threat models lead to different research interests.
Eg, someone worries about a future where AI control the world but are not human comprehensible, feels very different from someone worried about a world where we produce an expected utility maximiser that has a subtly incorrect objective, resulting in bad convergent instrumental goals.
Do you think this is a bad model of how researchers think? Or are you, eg, arguing that having a detailed, concrete story isn’t important here, just the vague intuition for how AI goes wrong?
What’s the engine game?
What research in the past 5 years has felt like the most significant progress on the alignment problem? Has any of it made you more or less optimistic about how easy the alignment problem will be?
Do you have any advice for junior alignment researchers? In particular, what do you think are the skills and traits that make someone an excellent alignment researcher? And what do you think someone can do early in a research career to be more likely to become an excellent alignment researcher?
What is your theory of change for the Alignment Research Center? That is, what are the concrete pathways by which you expect the work done there to systematically lead to a better future?
There has been surprisingly little written on concrete threat models for how AI leads to existential catastrophes (though you’ve done some great work rectifying this!). Why is this? And what are the most compelling threat models that don’t have good public write-ups? In particular, are there under-appreciated threat models that would lead to very different research priorities within Alignment?
Pre-hindsight: 100 years from now, it is clear that your research has been net bad for the long-term future. What happened?
You seem in the unusual position of having done excellent conceptual alignment work (eg with IDA), and excellent applied alignment work at OpenAI, which I’d expect to be pretty different skillsets. How did you end up doing both? And how useful have you found ML experience for doing good conceptual work, and vice versa?
What are the most important ideas floating around in alignment research that don’t yet have a public write-up? (Or, even better, that have a public write-up but could do with a good one?)
You gave a great talk on the AI Alignment Landscape 2 years ago. What would you change if giving the same talk today?
What are the highest priority things (by your lights) in Alignment that nobody is currently seriously working on?
In a real turn, you don’t get this kind of warning.
In a real turn, you don’t get this kind of warning.
I disagree, I think that toy results like this are exactly the kind of warning we’d expect to see.
You might not get a warning shot from a superintelligence, but it seems great to collect examples like this of warning shots from systems dumber—if there’s going to be continuous takeoff, and there’s going to be a treacherous turn eventually, it seems like a great way to get people to take treacherous turns seriously is to watch closely for failed examples (though hopefully ones more sophisticated than this!)
The Clearer Thinking Calibrate Your Judgement tool seems worth checking out.
I really like this post! I have a concerned intuition around ‘sure, the first example in this post seems legit, but I don’t think this should actually update anything in my worldview, for the real-life situations where I actively think about Bayes Rule + epistemics’. And I definitely don’t agree with your example about top 1% traders. My attempt to put this into words:
1. Strong evidence is rarely independent. Hearing you say ‘my name is Mark’ to person A might be 20,000:1 odds, but hearing you then say it to person B is like 10:1 tops. Most hypotheses that explain the first event well, also explain the second event well. So while the first sample contains the most information, the second sample contains way less. Making this idea much less exciting.
It’s much easier to get to middling probabilities than high probabilities. This makes sense, because I’m only going to explicitly consider the odds of <100 hypotheses for most questions, so a hypothesis with say <1% probability isn’t likely to be worth thinking about. But to get to 99% it needs to defeat all of the other ones too
Eg, in the ‘top 1% of traders’ example, it might be easy to be confident I’m above the 90th percentile, but much harder to move beyond that.
2. This gets much messier when I’m facing an adversarial process. If you say ‘my name is Mark Xu, want to bet about what’s on my driver’s license’ this is much worse evidence because I now face adverse selection. Many real-world problems I care about involve other people applying optimisation pressure to shape the evidence I see, and some of this involves adversarial potential. The world does not tend to involve people trying to deceive me about world capitals.
An adversarial process could be someone else trying to trick me, but it could also be a cognitive bias I have, eg ‘I want to believe that I am an awesome, well-calibrated person’. It could also be selection bias—what is the process that generated the evidence I see?
3. Some questions have obvious answers, others don’t. The questions most worth thinking about are rarely the ones that are obvious. The ones where I can access strong evidence easily are much less likely to be worth thinking about. If someone disagrees with me, that’s at least weak evidence against the existence of strong evidence.
+1 I went a CFAR camp for high schoolers a few years ago, and the idea that I can be ambitious and actually fix problems in my life was BY FAR the biggest takeaway I got (and one of the most valuable life lessons I ever learned)
As a single point of anecdata, I personally am fairly prone to negative thoughts and self-blame, and find this super helpful for overcoming that. My Inner Simulator seems to be much better grounded than my spirals of anxiety, and not prone to the same biases.
I’m stressing out about a tiny mistake I made, and am afraid that a friend of mine will blame me for it. So I simulate having the friend find out and get angry with me about it, and ask myself ‘am I surprised at this outcome’. And discover that yes, I am very surprised by this outcome—that would be completely out of character and would feel unreasonable to me in the moment.
I have an upcoming conversation with someone new and interesting, and I’m feeling insecure about my ability to make good first impressions. I simulate the conversation happening, and leaving feeling like it went super well, and check how surprised I feel. And discover that I don’t feel surprised, that in fact that this happens reasonably often.
Such a person could also come up with a way they could improve their life, fail to implement it, and then feel guilty when their reality fails to measure up to their imagined future.
This seems like a potentially fair point. I sometimes encounter this problem. Though I find that my Inner Sim is a fair bit better calibrated about what solutions might actually work. Eg it has a much better sense for ‘I’ll just procrastinate and forget about this’. On balance, I find that the benefits of ‘sometimes having a great idea that works’ + the motivation to implement it far outweighs this failure mode, but your mileage may vary.
Nice, I really like the approach of ‘write up a concrete question → assume I received a helpful answer → let my inner sim fill in the blanks about what it says’