I’m trying to come up with a set of questions for self-calibration, related to AI and ML.
I’ve written down what I’ve come up with so far below. But I am principally interested in what other people come up with—thus the question metatype—both for questions, and for predictions for the questions.
So far I have an insufficient number of questions to produce anything like a nice calibration curve. I’ve also struggled with coming up with meaningful questions.
I’ve rot13′ed my predictions to avoid anchoring anyone. I’m pretty uncertain about most of these as point estimates however.
On Explicitly Stated ML / Systems Goals
It is (relatively) easy to determine if these are fulfilled or not. The trade-off is that they likely have little relation to AGI.
1. OpenAI succeeds in defeating top pro teams on unrestricted Dota2
OpenAI has explicitly said that they wish to beat top human teams in the MOBA Dota2. Their latest attempt to do so used self-play and familiar policy-gradient strategies on an incredibly massive scale to train, but still lost to top teams who won (relatively?) easily.
I’m also interested in people’s probabilities on whether OpenAI succeeds, conditional on OpenAI not including genuine algorithmic novelty in their learning methods, although that’s a harder question to define because of cloudiness around “algorithmic novelty.”
My prediction: Friragl-svir creprag.
2. Tesla succeeds in a self-driving car driving coast-to-coast without intervention.
Tesla sells cars with (ostensibly) all the hardware necessary for full self-driving, and an in-house self-driving research program that uses a mix of ML and hard-coded rules. They have a goal of giving a demonstration autonomous coast-to-coast drive, although this goal has been repeatedly delayed. There is widespread skepticism both of the sensor suite in Tesla cars and of the maturity of their software.
My prediction: Gra creprag.
3. DeepMind reveals a skilled RL-trained agent for StarCraft II.
After AlphaGo, DeepMind announced that they would try to create an expert-level agent for SCII. They’ve released preliminary research related to this topic, although what they’ve revealed is far from such an agent.
My prediction: Nobhg svir creprag.
On Goals Not So Clearly Marked As Targets
1. Someone gets a score on the Winograd Schema tests above 80%.
The Winograd Schemas are a series of tests designed to test the common-sense reasoning properties of a system. Modern ML struggles to get better than random chance—a 50% is approximately equal to random guessing, and modern state of the art gets less than 70%. (This is the best score I could find; there are several papers which claim better scores, but these deal with subsets of the Winograd Schemas as far as I can tell. [I.e., the classic “Hey, we got a better score… on a more constrained dataset.“] I might be wrong about this, if I’m wrong please illuminate me.)
My prediction: Nobhg svir creprag.
2. Reinforcement Learning Starts Working
This is a bad, unclear goal. I’m not sure how to make it clearer and could use help.
There are a lot of articles about how reinforcement learning doesn’t work. It generalizes incredibly poorly, and only succeeds on complex tasks like Dota2 by playing literal centuries worth of games. If some algorithm were discovered, such that one could get RL to work with kind of the same regularity that supervised learning works, that would be amazing. I’m still struggling with a way to rigorously note this. A 10x improvement in sample efficiency on the Atari suite would (probably?) fulfill this, but I’m not sure what else would. And it’s quite a PITA to keep track of what the current state-of-the-art on Atari is anyhow.
My prediction: Need to get this more defined.