Wrote up a short (incomplete) bullet point list of the projects I’d like to work on in 2023:
Accelerating Alignment
Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
Fine-tune GPT-3/GPT-4 on alignment text and connect the API toLoom, VSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
Create an audio-to-post pipeline where we can easily help alignment researchers create posts through conversations rather than staring at a blank page. (1-4 months, depending on collaboration with Conjecture and others; and how many features we add.)
Leaving the door open and experimenting with ChatGPT and/or GPT-4 to use them for things we haven’t explored yet. Especially GPT-4, we can guess in advance what it will be capable of, but we’ll likely need to experiment a lot to discover how to use it optimally given it might have new capabilities GPT-3 doesn’t have. (2 to 6 weeks.)
Work with Janus, Nicholas Dupuis, and others on building tools for accelerating alignment research using language models (in prep for and integrating GPT-4). These will serve as tools for augmenting the work of alignment researchers. Many of the tool examples are covered in the grant proposal, my recent post, and an upcoming post, and Nicholas’doc on Cyborgism (we’ve recently spun up a discord to discuss these things with other researchers; send DM for link). This work is highly relevant to OpenAI’s main alignment proposal.
This above work involves:
Working on setting the foundation for automating alignment and making proposal verification viable. (1 week of active work for a post I’m working on, and then some passive work while I build tools.)
Studying the epistemology of effective research helps generate research that leads us to solve alignment. For example, promoting flow and genius moments, effective learning (I’m taking a course on this and so far it is significantly better than the “Learning How to Learn” course) and how it can translate to high-quality research, etc. (5 hours per week)
It’s very hard to predict how the tool-building will go because I expect to be doing a lot of iteration to land on things that are optimally useful rather than come up with a specific plan and stick to it. My goal here is to implement design thinking and approaches that startups use. This involves taking the survey responses, generating a bunch of ideas, create an MVP, test it out with alignment researchers, and then learn from feedback.
Finish a sequence I’m working on with others. We are currently editing the intro post and refining the first post. We went through 6 weeks of seminars for a set of drafts and we are now working to build upon those. (6 to 8 weeks)
Other Projects outside of the grant (will dedicate about 1 day per week, but expect to focus more on some of these later next year, depending on how Accelerating Alignment goes. If not, I’ll likely find some mentees or more collaborators to work on some of them.)
Support the Shard Theory team in running experiments using RL and language models. I’ll be building off of my MATS colleagues’ work. (3 to 5 months for running experiments and writing about them. Would consider spending a month or so on this and then mentoring someone to continue.)
Applying theTuned Lens to better understand what transformers are doing. For example, what is being written and read from the residual stream and how certain things like RL lead to non-myopic behaviour. Comparing self-supervised models to RL fine-tuned models. (2 to 4 months by myself, probably less if I collaborate.)
Building off of Causal Tracing and Causal Scrubbing to develop more useful causal interpretability techniques. In this linked doc, I discuss this in the second main section: “Relevance For Alignment.” (3 days to wrap up first post. For exploring, studying and writing about new causal methods, anywhere from 2 months to 4 months.)
Provide support for governance projects. I’ve been mentoring someone looking to explore AI Governance for the past few months (they are now applying for an internship at GovAI). They are currently writing up a post on “AI safety” governance in Canada. I’ll be providing mentorship on a few posts I’ve suggested they write. Here’s my recent governance post. (2-3 hours per week)
Update and wrap up theGEM proposal. Adding new insights to it, including the new Tunes Lens that Nora has been working on. (1 week)
Applying quantilizers to Large Language Models. This project is still in the discovery phase for a MATS colleague of mine. I’m providing comments at the moment, but it may turn into a full-time project later next year.
Mentoring through the AI Safety Mentors and Mentees program. I’m currently mentoring someone who is working on Shard Theory and Infra-Bayesianism relevant work.
Two other projects I would find interesting to work on:
Causal Scrubbing to remove specific capabilities from a model. For example, training a language model on The Pile and a code dataset. Then, applying causal scrubbing to try and remove the model’s ability to generate code while still achieving the similar loss on The Pile.
Working on a new grant proposal right now. Should be sent this weekend. If you’d like to give feedback or have a look, please send me a DM! Otherwise, I can send the grant proposal to whoever wants to have a look once it is done (still debating about posting it on LW).
Outside of that, there has been a lot of progress on the Cyborgism discord (there is a VSCode plugin called Worldspider that connects to the various APIs, and there has been more progress on Loom). Most of my focus has gone towards looking at the big picture and keeping an eye on all the developments. Now, I have a better vision of what is needed to create an actually great alignment assistant and have talked to other alignment researchers about it to get feedback and brainstorm. However, I’m spread way too thin and will request additional funding to get some engineer/builder to start building the ideas out so that I can focus on the bigger picture and my alignment work.
If I can get my funding again (previous funding ended last week) then my main focus will be building out the system I have in my for accelerating alignment work + continue working on the new agenda I put out with Quintin and others. There’s some other stuff I‘d like to do, but those are lower priority or will depend on timing. It’s been hard to get the funding application done because things are moving so fast and I’m trying not to build things that will be built by default. And I’ve been talking to some people about the possibility of building an org so that this work could go a lot faster.
Projects I’d like to work on in 2023.
Wrote up a short (incomplete) bullet point list of the projects I’d like to work on in 2023:
Accelerating Alignment
Main time spent (initial ideas, will likely pivot to varying degrees depending on feedback; will start with one):
Fine-tune GPT-3/GPT-4 on alignment text and connect the API to Loom, VSCode (CoPilot for alignment research) and potentially notetaking apps like Roam Research. (1-3 months, depending on bugs and if we continue to add additional features.)
Create an audio-to-post pipeline where we can easily help alignment researchers create posts through conversations rather than staring at a blank page. (1-4 months, depending on collaboration with Conjecture and others; and how many features we add.)
Leaving the door open and experimenting with ChatGPT and/or GPT-4 to use them for things we haven’t explored yet. Especially GPT-4, we can guess in advance what it will be capable of, but we’ll likely need to experiment a lot to discover how to use it optimally given it might have new capabilities GPT-3 doesn’t have. (2 to 6 weeks.)
Work with Janus, Nicholas Dupuis, and others on building tools for accelerating alignment research using language models (in prep for and integrating GPT-4). These will serve as tools for augmenting the work of alignment researchers. Many of the tool examples are covered in the grant proposal, my recent post, and an upcoming post, and Nicholas’doc on Cyborgism (we’ve recently spun up a discord to discuss these things with other researchers; send DM for link). This work is highly relevant to OpenAI’s main alignment proposal.
This above work involves:
Working on setting the foundation for automating alignment and making proposal verification viable. (1 week of active work for a post I’m working on, and then some passive work while I build tools.)
Studying the epistemology of effective research helps generate research that leads us to solve alignment. For example, promoting flow and genius moments, effective learning (I’m taking a course on this and so far it is significantly better than the “Learning How to Learn” course) and how it can translate to high-quality research, etc. (5 hours per week)
Studying how to optimally condition generative models for alignment.
It’s very hard to predict how the tool-building will go because I expect to be doing a lot of iteration to land on things that are optimally useful rather than come up with a specific plan and stick to it. My goal here is to implement design thinking and approaches that startups use. This involves taking the survey responses, generating a bunch of ideas, create an MVP, test it out with alignment researchers, and then learn from feedback.
Finish a sequence I’m working on with others. We are currently editing the intro post and refining the first post. We went through 6 weeks of seminars for a set of drafts and we are now working to build upon those. (6 to 8 weeks)
Other Projects outside of the grant (will dedicate about 1 day per week, but expect to focus more on some of these later next year, depending on how Accelerating Alignment goes. If not, I’ll likely find some mentees or more collaborators to work on some of them.)
Support the Shard Theory team in running experiments using RL and language models. I’ll be building off of my MATS colleagues’ work. (3 to 5 months for running experiments and writing about them. Would consider spending a month or so on this and then mentoring someone to continue.)
Applying the Tuned Lens to better understand what transformers are doing. For example, what is being written and read from the residual stream and how certain things like RL lead to non-myopic behaviour. Comparing self-supervised models to RL fine-tuned models. (2 to 4 months by myself, probably less if I collaborate.)
Building off of Causal Tracing and Causal Scrubbing to develop more useful causal interpretability techniques. In this linked doc, I discuss this in the second main section: “Relevance For Alignment.” (3 days to wrap up first post. For exploring, studying and writing about new causal methods, anywhere from 2 months to 4 months.)
Provide support for governance projects. I’ve been mentoring someone looking to explore AI Governance for the past few months (they are now applying for an internship at GovAI). They are currently writing up a post on “AI safety” governance in Canada. I’ll be providing mentorship on a few posts I’ve suggested they write. Here’s my recent governance post. (2-3 hours per week)
Update and wrap up the GEM proposal. Adding new insights to it, including the new Tunes Lens that Nora has been working on. (1 week)
Applying quantilizers to Large Language Models. This project is still in the discovery phase for a MATS colleague of mine. I’m providing comments at the moment, but it may turn into a full-time project later next year.
Mentoring through the AI Safety Mentors and Mentees program. I’m currently mentoring someone who is working on Shard Theory and Infra-Bayesianism relevant work.
Two other projects I would find interesting to work on:
Causal Scrubbing to remove specific capabilities from a model. For example, training a language model on The Pile and a code dataset. Then, applying causal scrubbing to try and remove the model’s ability to generate code while still achieving the similar loss on The Pile.
A few people have started extending the work from the Discovering Latent Knowledge in Language Models without Supervision paper. I think this work could potentially evolve into a median-case solution to avoiding x-risk from AI.
Curious if you have any updates!
Working on a new grant proposal right now. Should be sent this weekend. If you’d like to give feedback or have a look, please send me a DM! Otherwise, I can send the grant proposal to whoever wants to have a look once it is done (still debating about posting it on LW).
Outside of that, there has been a lot of progress on the Cyborgism discord (there is a VSCode plugin called Worldspider that connects to the various APIs, and there has been more progress on Loom). Most of my focus has gone towards looking at the big picture and keeping an eye on all the developments. Now, I have a better vision of what is needed to create an actually great alignment assistant and have talked to other alignment researchers about it to get feedback and brainstorm. However, I’m spread way too thin and will request additional funding to get some engineer/builder to start building the ideas out so that I can focus on the bigger picture and my alignment work.
If I can get my funding again (previous funding ended last week) then my main focus will be building out the system I have in my for accelerating alignment work + continue working on the new agenda I put out with Quintin and others. There’s some other stuff I‘d like to do, but those are lower priority or will depend on timing. It’s been hard to get the funding application done because things are moving so fast and I’m trying not to build things that will be built by default. And I’ve been talking to some people about the possibility of building an org so that this work could go a lot faster.
Very excited by this agenda, was discussing my hope that someone finetunes LLMs on the alignment archive soon today!
do you have a link?
I’d be interested in being added to the Discord