jungofthewon comments on The case for aligning narrowly superhuman models

jungofthewon 19 Mar 2021 4:32 UTC
LW: 9 AF: 5
AF
This is exactly what Ought is doing as we build Elicit into a research assistant using language models / GPT-3. We’re studying researchers’ workflows and identifying ways to productize or automate parts of them. In that process, we have to figure out how to turn GPT-3, a generalist by default, into a specialist that is a useful thought partner for domains like AI policy. We have to learn how to take feedback from the researcher and convert it into better results within session, per person, per research task, across the entire product. Another spin on it: we have to figure out how researchers can use GPT-3 to become expert-like in new domains.
We’re currently using GPT-3 for classification e.g. “take this spreadsheet and determine whether each entity in Column A is a non-profit, government entity, or company.” Some concrete examples of alignment-related work that have come up as we build this:
- One idea for making classification work is to have users generate explanations for their classifications. Then have GPT-3 generate explanations for the unlabeled objects. Then classify based on those explanations. This seems like a step towards “have models explain what they are doing.”
- I don’t think we’ll do this in the near future but we could explore other ways to make GPT-3 internally consistent, for example:
  - Ask GPT-3 why it classified Harvard as a “center for innovation.”
  - Then ask GPT-3 if that reason is true for Microsoft.
    Or just ask GPT-3 if Harvard is similar to Microsoft.
  - Then ask GPT-3 directly if Microsoft is a “center for innovation.”
  - And fine-tune results until we get to internal consistency.
- We eventually want to apply classification to the systematic review (SR) process, or some lightweight version of it. In the SR process, there is one step where two human reviewers identify which of 1,000-10,000 publications should be included in the SR by reviewing the title and abstract of each paper. After narrowing it down to ~50, two human reviewers read the whole paper and decide which should be included. Getting GPT-3 to skip these two human processes but be as good as two experts reading the whole paper seems like the kind of sandwiching task described in this proposal.
We’d love to talk to people interested in exploring this approach to alignment!
What links here?
- jungofthewon's comment on How do we prepare for final crunch time? by Eli Tyre (1 Apr 2021 14:55 UTC; 5 points)