Task: Spotting hidden errors and implications in past writings on alignment
Context: The alignment community has produced a large collection of writing on the alignment problem. This work contains many assumptions, arguments and results that form the basis of our current efforts. If any of these background beliefs are wrong, it could significantly change the structure of the alignment problem. This task would involve teaching an AI to highlight shaky inferences, questionable assumptions and other potential holes or hidden implications in a given line of reasoning. Alignment researchers could then use such AIs to repeatedly scan all past work on the alignment problem, so as to more quickly highlight areas where we could have made a mistake or missed something important.
Input type: A text containing alignment-relevant reasoning.
Output type: Text that describes a potential mistake, missing implication or other issue with the alignment-related text in question.
Additional: These examples just show the key sections of text where the potential mistake / hidden implication occurs. In actual training and execution, the AI would receive the entire alignment-related text (or a large section thereof) as input, and its output would also specify the location in the input text where the issue occurs.
Instance 1:
Input:
We must take care to avoid inner alignment failures. They can lead to a system with goals that diverge from the base objective to an arbitrarily large degree. Consider that humans are not aligned with evolution. We do things like wear condoms and otherwise fail to maximize our inclusive genetic fitness. Our values diverge arbitrarily widely from those of evolution.
Output:
Hidden implication: If human values derive from an inner alignment failure, then attempting to prevent inner alignment failures in AI systems means preventing the processes by which human values arose in humans.
Humans are adapted to chase deer across the savanna, throw spears into them, cook them, and then—this is probably the part that takes most of the brains—cleverly argue that they deserve to receive a larger share of the meat.
It’s amazing that Albert Einstein managed to repurpose a brain like that for the task of doing physics. This deserves applause. It deserves more than applause, it deserves a place in the Guinness Book of Records. Like successfully building the fastest car ever to be made entirely out of Jello.
Output:
Retrodictive error: thinking that the human learning process is highly specialized towards the ancestral environment implies humans would not be able to generalize well beyond the ancestral environment, which does not conform to observed reality.
I think this task will be very difficult for language models to do. I think even Chinchilla may not be quite good enough to be truly useful here. However, I think this task is significantly less difficult than directly making original progress on the core of alignment.
I also think a lot of useful alignment research is blocked by subtle background assumptions that we don’t realize we should question. I basically consider this task to be automating the search for the sort of “miracle” that Yudkowsky sometimes describes as “[violating] some aspect of [his] background model” (source).
This task is also unusual in that even a single true success from such an approach could be enough to entirely change the game in regards to alignment. It would be worth trawling through thousands of false positives to find a single such true positive. The median result of running the system could be garbage. As long as there are occasional gems, it would be worthwhile. Note that large language models do tend to occasionally produce exceptional outputs, even in scenarios where they usually do poorly.
Task: Spotting hidden errors and implications in past writings on alignment
Context: The alignment community has produced a large collection of writing on the alignment problem. This work contains many assumptions, arguments and results that form the basis of our current efforts. If any of these background beliefs are wrong, it could significantly change the structure of the alignment problem. This task would involve teaching an AI to highlight shaky inferences, questionable assumptions and other potential holes or hidden implications in a given line of reasoning. Alignment researchers could then use such AIs to repeatedly scan all past work on the alignment problem, so as to more quickly highlight areas where we could have made a mistake or missed something important.
Input type: A text containing alignment-relevant reasoning.
Output type: Text that describes a potential mistake, missing implication or other issue with the alignment-related text in question.
Additional: These examples just show the key sections of text where the potential mistake / hidden implication occurs. In actual training and execution, the AI would receive the entire alignment-related text (or a large section thereof) as input, and its output would also specify the location in the input text where the issue occurs.
Instance 1:
Input:
Output:
Instance 2:
Input:
(From Eliezer Yudkowsky’s post My Childhood Role Model)
Output:
I think this task will be very difficult for language models to do. I think even Chinchilla may not be quite good enough to be truly useful here. However, I think this task is significantly less difficult than directly making original progress on the core of alignment.
I also think a lot of useful alignment research is blocked by subtle background assumptions that we don’t realize we should question. I basically consider this task to be automating the search for the sort of “miracle” that Yudkowsky sometimes describes as “[violating] some aspect of [his] background model” (source).
This task is also unusual in that even a single true success from such an approach could be enough to entirely change the game in regards to alignment. It would be worth trawling through thousands of false positives to find a single such true positive. The median result of running the system could be garbage. As long as there are occasional gems, it would be worthwhile. Note that large language models do tend to occasionally produce exceptional outputs, even in scenarios where they usually do poorly.