Thanks for all your answers :-)
JanB
The expected value of extinction risk reduction is positive
[Question] Does iterated amplification tackle the inner alignment problem?
[Question] What is the difference between robustness and inner alignment?
I am very surprised that the models do better on the generation task than on the multiple-choice task. Multiple-choice question answering seems almost strictly easier than having to generate the answer. Could this be an artefact of how you compute the answer in the MC QA task? Skimming the original paper, you seem to use average per-token likelihood. Have you tried other ways, e.g.
-
pointwise mutual information as in the Anthropic paper, or
-
adding the sentence “Which of these 5 options is the correct one?” to the end of the prompt and then evaluating the likelihood of “A”, “B”, “C”, “D”, and “E”?
I suggest this because the result is so surprising, it would be great to see if it appear across different methods of eliciting the MC answer.
-
Section 13 (page 47) discusses data/compute scaling and the comparison to Chinchilla. Some findings:
PaLM 540B uses 4.3 x more compute than Chinchilla, and outperforms Chinchilla on downstream tasks.
PaLM 540B is massively undertrained with regards to the data-scaling laws discovered in the Chinchilla paper. (unsurprisingly, training a 540B parameter model on enough tokens would be very expensive)
within the set of (Gopher, Chinchilla, and there sizes of PaLM), the total amount of training compute seems to predict performance on downstream tasks pretty well (log-linear relationship). Gopher underperforms a bit.
Have we “given it” the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?
The distinction that you’re pointing at is useful. But I would have filed it under “difference in the degree of agency”, not under “difference in goals”. When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.
E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of “proving the Riemann hypothesis”, but System B has “more agency”: it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.
I guess I’d recommend the AGI safety fundamentals course: https://www.eacambridge.org/technical-alignment-curriculum
On Stuart’s list: I think this list might be suitable for some types of conceptual alignment research. But you’d certainly want to read more ML for other types of alignment research.
Amusing tid-bit, maybe to keep in mind when writing for an ML audience: The connotations with the term “adversarial examples” or “adversarial training” run deep :-)
I engaged with the paper and related blog posts for a couple of hours. It took really long until my brain accepted that “adversarial examples” here doesn’t mean the thing that it usually means when I encounter the term (i.e. “small” changes to an input that change the classification, for some definition of small).
There were several instances when my brain went “Wait, that’s not how adversarial examples work”, followed by short confusion, followed by “right, that’s because my cached concept of X is only true for “adversarial examples as commonly defined in ML”, not for “adversarial examples as defined here”.
Some ideas for follow-up projects to Redwood Research’s recent paper
Have you tried using automated adversarial attacks (common ML meaning) on text snippets that are classified as injurious but near the cutoff? Especially adversarial attacks that aim to retain semantic meaning. E.g. with a framework like TextAttack?
In the paper, you write: “There is a large and growing literature on both adversarial attacks and adversarial training for large language models [31, 32, 33, 34]. The majority of these focus on automatic attacks against language models. However, we chose to use a task without an automated source of ground truth, so we primarily used human attackers.”
But my best guess would be that if you use an automatic adversarial attack on a snippet that humans say is injurious, the result will quite often still be a snippet that humans say is injurious.
So happy to see this, and such an amazing team!
Great post! This is the best (i.e. most concrete, detailed, clear, and comprehensive) story of existential risk from AI I know of (IMO). I expect I’ll share it widely.
Also, I’d be curious if people know of other good “concrete stories of AI catastrophe”, ideally with ample technical detail.
Your posts should be on arXiv
Ah, I had forgotten about this. I’m happy to endorse people or help them find endorsers.
It probably could, although I’d argue that even if not, quite often it would be worth the author’s time.
Ah, sorry to hear. I wouldn’t have predicted this from reading arXiv’s content moderation guidelines.
I agree that formatting is the most likely issue. The content of Neel’s grokking work is clearly suitable for arXiv (just very solid ML work). And the style of presentation of the blog post is already fairly similar to a standard paper (e.g. is has an Introduction section, lists contributions in bullet points, …).
So yeah, I agree that formatting/layout probably will do the trick (including stuff like academic citation style).
Furthermore, conceptual/philosophical pieces probably should be primarily posted on arXiv’s .CY section.
As an explanation, because this just took me 5 minutes of search: This is the section “Computers and Society (cs.CY)”
I struggle to understand the difference between #2 and #3. The prosaic AI alignment problem only exists because we don’t know how to make an agent that tries to do what we want it to do. Would you say that #3 is a concrete scenario for how #2 could lead to a catastrophe?