A New York-based alignment hub that aims to provide talent search and logistical support for NYU Professor Sam Bowman’s planned AI safety research group.
I think my lab is bottlenecked on things other than talent and outside support for now, but there probably is more that could be done to help build/coordinate an alignment research scene in NYC more broadly.
More organizations like CAIS that aim to recruit established ML talent into alignment research
This is somewhat risky, and should get a lot of oversight. One of the biggest obstacles to discussing safety in academic settings is that academics are increasingly turned off by clumsy, arrogant presentations of the basic arguments for concern.
+1. The combination of the high dollar amount, the subjective criteria, and the panel drawn from the relatively small/insular ‘core’ AI safety research community mean that I expect this to look pretty fishy to established researchers. Even if the judgments are fair (I think they probably will be!) and the contest yields good work (it might!), I expect the benefit of that to be offset to a pretty significant degree by the red flags this raises about how the AI safety scene deals with money and its connection to mainstream ML research.
(To be fair, I think the Inverse Scaling Prize, which I’m helping with, raises some of these concerns, but the more precise/partially-quantifiable prize rubric, bigger/more diverse panel, and use of additional reviewers outside the panel mitigates them at least partially.)
Update: We did a quick follow-up study adding counterarguments, turning this from single-turn to two-turn debate, as a quick way of probing whether more extensive full-transcript debate experiments on this task would work. The follow-up results were negative.
Tweet thread here: https://twitter.com/sleepinyourhat/status/1585759654478422016
Direct paper link: https://arxiv.org/abs/2210.10860 (To appear at the NeurIPS ML Safety workshop.)
We’re still broadly optimistic about debate, but not on this task, and not in this time-limited, discussion-limited setting, and we’re doing a broader more fail-fast style search of other settings. Stay tuned for more methods and datasets.
Fair. For better or worse, a lot of this variation came from piloting—we got a lot of nudges from pilot participants to move toward framings that were perceived as controversial or up for debate.
Thanks! I’ll keep my opinionated/specific overview of the alignment community, but I know governance less well, so I’m happy to defer there.
Thanks! Fixed link.
I agree that this points in the direction of video becoming increasingly important.
But why assume only 1% is useful? And more importantly, why use only the language data? Even if we don’t have the scaling laws, but it seems pretty clear that there’s a ton of information in the non-language parts of videos that’d be useful to a general-purpose agent—almost certainly more than in the language parts. (Of course, it’ll take more computation to extract the same amount of useful information from video than from text.)
Thanks! I’ll admit that I meant to be asking especially about the toxicity case, though I didn’t make that at all clear. As in Charlie’s comment, I’m most interested in using this approach as a way to efficiently explore and pilot techniques that we can ultimately adapt back to humans, and text-based interactions seems like a good starting point for that kind of work.
I don’t see a clear picture either way on whether the noisy signal story presents a hard problem that’s distinctively alignment oriented.
Thanks! I think I have some sense of what both directions look like, but not enough to know what a concrete starting experiment would look like. What would a minimum viable experiment look like for each?
Is anyone working on updating the Biological Anchors Report model based on the updated slopes/requirements here?
I can look up the exact wording if it’s helpful, but I assume it’s clear from the basic setup that at least one of the arguments has to be misleading.
I have no reason to be especially optimistic given these results, but I suppose there may be some fairly simple questions for which it’s possible to enumerate a complete argument in a way that flaws will be clearly apparent.
In general, it seems like single-turn debate would have to rely on an extremely careful judge, which we don’t quite have, given the time constraint. Multi-turn seems likely to be more forgiving, especially if the judge has any influence over the course of the debate.
Yep. (Thanks for re-posting.) We’re pretty resigned to the conclusion that debate fails to reach a correct conclusion in at least some non-trivial cases—we’re mainly interested in figuring out (i) whether there are significant domains or families of questions for which it will often reach a conclusion, and (ii) whether it tends to fail gracefully (i.e., every outcome is either correct or a draw).