Edit: On a closer read, I take it you’re looking only for tasks well-suited for language models? I’ll leave this comment up for now, in case it’d still be of use.
Task: Extract the the training objective from a fully-trained ML model.
Input type: The full description of a ML model’s architecture + its parameters.
Output type: Mathematical or natural-language description of the training objective.
Input | [Learned parameters and architecture description of a fully-connected neural network trained on the MNIST dataset.] |
---|---|
Output | Classifying handwritten digits. |
Input | [Learned parameters and architecture description of InceptionV1.] |
---|---|
Output | Labeling natural images. |
Can’t exactly fit that here, but the dataset seems relatively easy to assemble.
We can then play around with it:
See how well it generalizes. Does it stop working if we show it a model with a slightly unfamiliar architecture? Or a model with an architecture it knows, but trained for a novel-to-it task? Or a model with a familiar architecture, trained for a familiar task, but on a novel dataset? Would show whether Chris Olah’s universality hypothesis holds for high-level features.
See if it can back out the training objective at all. If not, uh-oh, we have pseudo-alignment. (Note that the reverse isn’t true: if it can extract the intended training objective, the inspected model can still be pseudo-aligned.)
Mess around with what exactly we show it. If we show all but the first layer of a model, would it still work? Only the last three layers? What’s the minimal set of parameters it needs to know?
Hook it up to an attribution tool to see what specific parameters it looks at when figuring out the training objective.
Task: Automated Turing test.
Context: A new architecture improves on state-of-the-art AI performance. We want to check whether AI-generated content is still possible to distinguish from human-generated.
Input type: Long string of text, 100-10,000 words.
Output type: Probability that the text was generated by a human.
Etc., etc.; the data are easy to generate. Those are from here and here.
To be honest, I’m not sure it’s exactly what you’re asking, but it seems easy to implement (compute costs aside) and might serve as a (very weak and flawed) “fire alarm for AGI” + provide some insights. For example, we can then hook this Turing Tester up to an attribution tool and see what specific parts of the input text make it conclude it was/wasn’t generated by a ML model. This could then provide some insights into the ML model in question (are there some specific patterns it repeats? abstract mistakes that are too subtle for us to notice, but are still statistically significant?).
Alternatively, in the slow-takeoff scenario where there’s a brief moment before the ASI kills us all where we have to worry about people weaponizing AI for e. g. propaganda, something like this tool might be used to screen messages before reading them, if it works.