Use a very large (future) multimodal self-supervised learned (SSL) initialization to give the AI a latent ontology for understanding the real world and important concepts. Combining this initialization with a recurrent state and an action head, train an embodied AI to do real-world robotics using imitation learning on human in-simulation datasets and then sim2real. Since we got a really good pretrained initialization, there’s relatively low sample complexity for the imitation learning (IL). The SSL and IL datasets both contain above-average diamond-related content, with some IL trajectories involving humans navigating towards diamonds because the humans want the diamonds.
I don’t know much about ML, and I’m a bit confused about this step. How worried are we/should we be about sample efficiency here? It sounds like after pre-training you’re growing the diamond shard via a real-world embedded RL agent? Naively this would be pretty performance uncompetitive compared to agents primarily trained in simulated worlds, unless your algorithm is unusually sample efficient (why?). If you aren’t performance competitive, then I expect your agent to be outcompeted by stronger AI systems with trainers that are less careful about diamond (or rubies, or staples, or w/e) alignment.
OTOH if your training is primarily simulated, I’d be worried about the difficulty of creating an agent that terminally values real world (rather than simulated) diamonds.
Good question, which I should probably have clarified in the essay. On a similar compute budget, could e.g. an actor-critic in-sim approach reach superintelligence even more quickly? Yeah, probably. The point of this story isn’t that this (i.e. SSL+IL+PG RL) is the optimal alignment configuration along (competitiveness, alignability-to-diamonds), but rather I claim that if this story goes through at all, it throws a rock through how we should be thinking about alignment; if this story goes through, one of the simplest, “dumbest”, most quickly dismissed ideas (reward agent for good event) can work just fine to superhuman and beyond, in a predictable-to-us way which we can learn more about by looking at current ML.
I don’t know much about ML, and I’m a bit confused about this step. How worried are we/should we be about sample efficiency here? It sounds like after pre-training you’re growing the diamond shard via a real-world embedded RL agent? Naively this would be pretty performance uncompetitive compared to agents primarily trained in simulated worlds, unless your algorithm is unusually sample efficient (why?). If you aren’t performance competitive, then I expect your agent to be outcompeted by stronger AI systems with trainers that are less careful about diamond (or rubies, or staples, or w/e) alignment.
OTOH if your training is primarily simulated, I’d be worried about the difficulty of creating an agent that terminally values real world (rather than simulated) diamonds.
Good question, which I should probably have clarified in the essay. On a similar compute budget, could e.g. an actor-critic in-sim approach reach superintelligence even more quickly? Yeah, probably. The point of this story isn’t that this (i.e. SSL+IL+PG RL) is the optimal alignment configuration along (competitiveness, alignability-to-diamonds), but rather I claim that if this story goes through at all, it throws a rock through how we should be thinking about alignment; if this story goes through, one of the simplest, “dumbest”, most quickly dismissed ideas (reward agent for good event) can work just fine to superhuman and beyond, in a predictable-to-us way which we can learn more about by looking at current ML.