Language Model Tools for Alignment Research

I do not speak for the rest of the people working on this project

[I think it’s valuable to have clear, short intros on different research agendas & projects]

How does this reduce x-Risk?

AI will continue to become increasingly more powerful; we should leverage this to accelerate alignment research. Language model tasks will also follow this trend (if transformers don’t lead to AGI, whatever’s next will still be capable of language tasks so this argument doesn’t rely on transformers scaling). If you believe that certain research agendas reduce x-risk, then clearly giving them better tools to do their work faster also reduces x-risk.

Differential Impact

Tools that can accelerate alignment research can probably be repurposed to accelerate capabilities research, so wouldn’t developing these tools be net negative?

Yes. Especially if you gave them out to everybody or if you were a for-profit company with incentives to do so.

Only giving them out to alignment researchers is good. There’s also lots of alignment-researcher-specific work to do, such as:

1. Collecting an alignment dataset
2. Understanding the workflows of alignment researchers
3. Making it incredibly easy for alignment researchers to use these tools
4. Keeping non-alignment specific datasets private

Though we could still increase capabilities by being the first to a capability and releasing that we succeeded. For example, OpenAI released their “inserting text” without telling people how they did it, but the people I work with, based off that information, figured out a way to do it too. The moral is that even just releasing that you succeeded is bits of information that those in the know can work backwards from.

Counterfactual Impact

Let’s say a project like this never gets started, there are still huge economic incentives to make similar products and sell to large amounts of people. Elicit, Cohere, Jasper (previously Jarvis), OpenAI, DeepMind, and more in the years to come will create these products, so why shouldn’t we just use their products since they’re likely to beat us to it and do it better?

Good point. Beyond the points made in “differential impact,” having infrastructure/people to quickly integrate the latest advances into alignment researcher’s workflows is useful even in this scenario. This includes engineers, pre-existing code for interfaces, & a data-labeling pipeline.

Current Work

We’ve scraped LessWrong (including Alignment Forum), some blogs, relevant arxiv papers and their cited papers, and books. We are currently fine-tuning and trying to make it do useful tasks for us.

We’ve also released a survey and talked to several people about what would be most useful for them. However, the most important feedback may be actually demo-ing for different users.

Open Problems/Future Work

1. Cleaning up Data

2. Collecting and listing more sources of Alignment Data

3. Creating a pipeline of data-labelers for specific tasks

Simple interface for collecting data for different tasks

4. Trying to get our models to do specific tasks

clever ways of generating datasets (e.g. tl;dr from reddit for summaries)
clever ways of prompting the model (e.g. a specific username from LW may be really good at a specific task)
different fine-tuning formats (e.g. LW dataset “username: X, Karma: 200, post: ” and change karma. Even do this with non-LW username authors from e.g. arxiv.)

5. Getting better feedback on the most useful tools for alignment researchers.

Who to Contact

DM me. Also join us at the #accelerating-alignment channel in the EleutherAI discord server: https://discord.gg/67AYcKK6. Or dm me for a link if it expired.