Looking for help with an acausal safety project. If you’re interested or know someone who might be, it would be really great if you let me know/share
Help with acausal research and get mentoring to learn about decision theory
Motivation: Caspar Oesterheld (inventor/discoverer of ECL/MSR), Emery Cooper and I are doing a project where we try to get LLMs to help us with our acausal research.
Our research is ultimately aimed at making future AIs acausally safe.
Project: In a first step, we are trying to train an LLM classifier that evaluates critiques of arguments. To do so, we need a large number of both good and bad arguments about decision theory (and other areas of Philosophy.)
How you’ll learn: If you would like to learn about decision theory, anthropics, open source game theory, …, we supply you with a curriculum. There’s a lot of leeway for what exactly you want to learn about. You go through the readings.
If you already know things and just want to test your ideas, you can optionally skip this step.
Your contribution: While doing your readings, you, write up critiques of arguments you read.
Bottom-line: We get to use your arguments/critiques for our projects and you get our feedback on them. (We have to read and label them for the project anyway.)
Logistics: Unfortunately, you’d be a volunteer. I might be able to pay you a small amount out-of-pocket, but it’s not going to be very much. Caspar and Em are both university employed and I am similar in means to an independent researcher. We are also all non-Americans based in the US which makes it harder for us to acquire money for projects and such for boring and annoying reasons.
Why are we good mentors: Caspar has dozens of publications on related topics. Em has a handful. And I have been around.
2. Be a saint and help with acausal research by doing tedious manual labor and getting little in return We also need help with various grindy tasks that aren’t super helpful for learning, e.g. turning pdfs with equations etc. into sensible txts to feed to LLMs. If you’re motivated to help with that, we would be extremely grateful.
What’s the minimum capacity in which you’re expecting people to contribute? Are you looking for a few serious long-term contributors or are you also looking for volunteers who offer occasional help without a fixed weekly commitment?
My current guess is that occasional volunteers are totally fine! There’s some onboarding cost but mostly, the cost on our side scales with the number of argument-critique pairs we get. Since the whole point is to have critiques of a large variety of quality, I don’t expect the nth argument-critque pair we get to be much more useable than the 1st one. I might be wrong about this one and change my mind as we try this out with people though!
(Btw I didn’t get a notification for your comment, so maybe better to dm if you’re interested.)
Labs should differentially distill safety-research capabilities of internal models like Mythos into public models
To the extent that some lab’s best models aren’t publicly accessible due to their dangerous capabilities, I think it would be great if labs differentially distill beneficial capabilities from their best models to their publicly available models. This would potentially help a bunch with uplifting AI safety research.
How to implement this?
scrape the kind of questions and tasks that AI safety people ask models, ask the best internal model to answer/solve these and distill the answers back into the best public model, or
just think about which kinds of capabilities seem differentially useful for safety uplift and do something more principled to distill those kind of capabilities. For example, conceptual reasoning seems pretty differentially useful for safety (which is why my team is working on improving it).
We probably don’t have to worry too much about narrowing down exactly the stuff that’s differentially useful since the main reason to worry is about making models better is that AI labs having access to better models speeds up AI R&D but, well, the AI lab already has access anyway. So it’s probably better to distil too much than too little as long as that doesn’t degrade performance too much. (Obviously, don’t distill the stuff you’re holding the model back for, like bio and hacking.)
Maybe this already happens but when mentioning this before, it seemed like a new idea to some people including some lab employees, so putting it out there.
Idea is from Caspar Oesterheld and I just quickly wrote it up.
This would be cool, but also, I think it would be good if companies just provided access to partially available models (like Mythos) to reputable independent safety researchers. I don’t really see any reason why METR/Redwood/various smaller safety orgs and academic groups shouldn’t get access if it’s already being widely used by corporations for cyberdefense. I guess the only risk is they might find an issue with the model (e.g. along the lines of Ryan Greenblatt’s misalignment issues with previous claude models) and write about it, embarrassing Anthropic, but you could just offer access to the model in exchange for an NDA on discussion of it (there already seem to be restrictions on talking about Mythos). This would obviously not be great but is probably better than the alternative of the only AI safety researchers with access to the real frontier models being those who work for labs.
Looking for help with an acausal safety project. If you’re interested or know someone who might be, it would be really great if you let me know/share
Help with acausal research and get mentoring to learn about decision theory
Motivation: Caspar Oesterheld (inventor/discoverer of ECL/MSR), Emery Cooper and I are doing a project where we try to get LLMs to help us with our acausal research.
Our research is ultimately aimed at making future AIs acausally safe.
Project: In a first step, we are trying to train an LLM classifier that evaluates critiques of arguments. To do so, we need a large number of both good and bad arguments about decision theory (and other areas of Philosophy.)
How you’ll learn: If you would like to learn about decision theory, anthropics, open source game theory, …, we supply you with a curriculum. There’s a lot of leeway for what exactly you want to learn about. You go through the readings.
If you already know things and just want to test your ideas, you can optionally skip this step.
Your contribution: While doing your readings, you, write up critiques of arguments you read.
Bottom-line: We get to use your arguments/critiques for our projects and you get our feedback on them. (We have to read and label them for the project anyway.)
Logistics: Unfortunately, you’d be a volunteer. I might be able to pay you a small amount out-of-pocket, but it’s not going to be very much. Caspar and Em are both university employed and I am similar in means to an independent researcher. We are also all non-Americans based in the US which makes it harder for us to acquire money for projects and such for boring and annoying reasons.
Why are we good mentors: Caspar has dozens of publications on related topics. Em has a handful. And I have been around.
2. Be a saint and help with acausal research by doing tedious manual labor and getting little in return
We also need help with various grindy tasks that aren’t super helpful for learning, e.g. turning pdfs with equations etc. into sensible txts to feed to LLMs. If you’re motivated to help with that, we would be extremely grateful.
What’s the minimum capacity in which you’re expecting people to contribute? Are you looking for a few serious long-term contributors or are you also looking for volunteers who offer occasional help without a fixed weekly commitment?
My current guess is that occasional volunteers are totally fine! There’s some onboarding cost but mostly, the cost on our side scales with the number of argument-critique pairs we get. Since the whole point is to have critiques of a large variety of quality, I don’t expect the nth argument-critque pair we get to be much more useable than the 1st one. I might be wrong about this one and change my mind as we try this out with people though!
(Btw I didn’t get a notification for your comment, so maybe better to dm if you’re interested.)
Labs should differentially distill safety-research capabilities of internal models like Mythos into public models
To the extent that some lab’s best models aren’t publicly accessible due to their dangerous capabilities, I think it would be great if labs differentially distill beneficial capabilities from their best models to their publicly available models. This would potentially help a bunch with uplifting AI safety research.
How to implement this?
scrape the kind of questions and tasks that AI safety people ask models, ask the best internal model to answer/solve these and distill the answers back into the best public model, or
just think about which kinds of capabilities seem differentially useful for safety uplift and do something more principled to distill those kind of capabilities. For example, conceptual reasoning seems pretty differentially useful for safety (which is why my team is working on improving it).
We probably don’t have to worry too much about narrowing down exactly the stuff that’s differentially useful since the main reason to worry is about making models better is that AI labs having access to better models speeds up AI R&D but, well, the AI lab already has access anyway. So it’s probably better to distil too much than too little as long as that doesn’t degrade performance too much. (Obviously, don’t distill the stuff you’re holding the model back for, like bio and hacking.)
Maybe this already happens but when mentioning this before, it seemed like a new idea to some people including some lab employees, so putting it out there.
Idea is from Caspar Oesterheld and I just quickly wrote it up.
To clarify, I’m not in favour of labs holding back their models. This is just conditional on them doing it anyways.
This would be cool, but also, I think it would be good if companies just provided access to partially available models (like Mythos) to reputable independent safety researchers. I don’t really see any reason why METR/Redwood/various smaller safety orgs and academic groups shouldn’t get access if it’s already being widely used by corporations for cyberdefense. I guess the only risk is they might find an issue with the model (e.g. along the lines of Ryan Greenblatt’s misalignment issues with previous claude models) and write about it, embarrassing Anthropic, but you could just offer access to the model in exchange for an NDA on discussion of it (there already seem to be restrictions on talking about Mythos). This would obviously not be great but is probably better than the alternative of the only AI safety researchers with access to the real frontier models being those who work for labs.