Website: schimpffabian.github.io
Fabian Schimpf
Arguments for Robustness in AI Alignment
Like Jaques, I also use focusmate to get up and running in the morning. I try to use the 10min between sessions to get ready. This timebox is helpful to me because I would otherwise drag it out. During these sessions, I find it beneficial to do the same things every morning. This way, I can go on autopilot until I am fully awake and don’t have to will myself into doing stuff.
I think this threshold will be tough to set. Confidence in a decision makes IMO only really sense if you consider decisions to be uni-modal. I would argue that this is rarely the case for a sufficiently capable system (like you and me). We are constantly trading off multiple options, and thus, the confidence (e.g., as measured by the log-likelihood of the action given a policy and state) depends on the number of options available. I expect this context dependence would be a tough nut to crack to have a meaningful threshold.
If there were other warning shots in addition to this one, that’s even worse! We’re already playing in Easy Mode here.
Playing devil’s advocate, if the government isn’t aware that the game is on, it doesn’t matter if it’s on easy mode—the performance is likely poor independent of the game’s difficulty.I agree with the post’s sentiment that warning shots would currently not do much good. But I am, as of now, still somewhat hopeful that the bottleneck is getting the government to see and target a problem, not the government’s ability to act on an identified issue.
I agree; that seems to be a significant risk. In case we get lucky to have AI warning shots, it seems prudent to think about how it can be ensured that they are recognized for what they are. This is a problem that I havn’t given much thought to before.
But I find it encouraging to think that we can use warning shots in other fields to understand the dynamics of how such events are being interpreted. As of now, I don’t think AI warning shots would change much, but I would add this potential for learning as a potential counter-argument. I think this seems analogous to the argument “EAs will get better at influencing the government over time” from another comment.
Hi Ben, I like the idea, however almost every decision has conflicting outcomes, e.g., regarding opportunity cost. From how I understand you, this would delegate almost every decision to humans if you take the premise of I can’t do X if I choose to do Y seriously. I think the application to high-impact interference seems therefore promising if the system is limited to only deciding on a few things. The question then becomes if a human can understand the plan that an AGI is capable of making. IMO this ties nicely into, e.g., ELK and interpretability research, but also the problem of predictability.
The reaction seems consistent if people (in government) believe no warning shot was fired. AFAIK the official reading is that we experienced a zoonosis, so banning gain of function research would go against that narrative. It seems true to me that this should be seen as a warning shot, but smallpox and ebola could have prompted this discussion as well and also failed to be seen as a warning shot.
Excellent summary; I had been looking for something like this! Is there a reason you didn’t include the AI Safety Camp in Training & Mentoring Programs?
I like your point that “surprises cut both ways” and assume that this is why your timelines aren’t affected by the possibility of surprises, is that about right? I am confused about the ~zero effect though: Isn’t double descent basically what we see with giant language models lately? Disclaimer: I don’t work on LLMs myself, so my confusion isn’t necessarily meaningful
Starting more restrictive seems sensible; this could be, as you say, learned away, or one could use human feedback to sign off on high-impact actions. The first problem reminds me of finding regions of attractions in nonlinear control where the ROA is explored without leaving the stable region. The second approach seems to hinge on humans being able to understand the implications of high-impact actions and the consequences of a baseline like inaction. There are probably also other alternatives that we have not yet considered.
To me the relevant result/trend is that it seems like catastrophic forgetting is becoming less of an issue as it was maybe two to three years ago e.g. in meta-learning and that we can squeeze these diverse skills into a single model. Sure, the results seem to indicate that individual systems for different tasks would still be the way to go for now, but at least the published version was not trained with the same magnitude of compute that was e.g. used on the latest and greatest LLMs (I take this from Lennart Heim who did the math on this). So it is IMO hard to say if there are timeline-affecting surprises lurking if we either just trained longer or had faster hardware—at least not with certainty. I didn’t expect double descent and grokking so my prior is that unexpected stuff happens.
Thank you for compiling this list. This is useful, and I expect to point people to it in the future. The best thing, IMO, is that it is not verbose and not dripping with personal takes on the problem; I would like to see more compilations of topics like this to give other people a leg up when they aspire to venture into a field.
A potential addition is Dan Hendryck’s PAIS agenda, in which he advocates for ML research that promotes alignment without also causing advances in capabilities. This effectively also slows AI (capabilities) development, and I am quite partial to this idea.