Object-level feedback on the linked project / idea: it looks neat, and might make for an interesting eval. I’m not sure that it would demonstrate anything really fundamental (to me, at least) about LLM capabilities the way you claim, but I’d be interested in reading more in any case.
Aside: whether or not it advances capabilities, I think “infohazardous” is slightly the wrong term. Publishing such work might be commons-burning or exfohazardous, in the same way that e.g. publishing part of a recipe for making a bomb is. But I think “infohazard” should be reserved for knowledge that is directly harmful to a specific person, e.g. a spoiler, a true lesson that leads them to a valley of bad rationality, or something emotionally / psychologically damaging.
On whether your idea is net-positive to publish or not: I agree with Nate’s take here about publishing interpretability research, and I think this kind of project falls into the same category. Ideally, you would be able to circulate it among a large but closed group of researchers and peers who understand the risks of where such research might eventually lead.
Absent the existence of such a closed community though, I’m not sure what to do. Publishing on LW seems plausibly net-positive, compared to the alternative of not doing the work at all, or not having it be read by anyone. I think your own proposed policy is reasonable. I’d also suggest adding a disclaimer to any work you think might plausibly give capabilities researchers ideas, to make it clear where you stand. Something like “Disclaimer: I’m publishing this because I think it is net-positive to do so and have no better alternatives. Please don’t use it to advance capabilities, which I expect to contribute to the destruction of everything I know and love.” (Of course, use your own words and only include such a disclaimer if you actually believe it.) I think disclaimers and public statements of that kind attached to popular / general-interest research would help build common knowledge and make it easier to get people on board with closure in the future.
Thank you for your thoughts, I think you are supplying valuable nuance. In private conversation I do see a general path by which this offers a strategy for capabilities enhancement, but I also think it’s sufficiently low-hanging fruit that I’d be surprised if a complete hobbyist like myself discovered a way to contribute much of anything to AI capabilities research. Then again, I guess interfacing between GPT-4-quality LLMs and traditional software is a new enough tool to explore that maybe there is enough low-hanging fruit for even a hobbyist to pluck. I agree with you that it would be ideal if there was a closed but constructive community to interface with on these issues, and I’m such a complete hobbyist that I wouldn’t know about such a group even if it existed, which is why I asked. I’ll give it some more thought.
Ah, I wasn’t really intending to make a strong claim about whether your specific idea is likely to be useful in pushing the capabilities frontier or not, just commenting generally about this general class of research (which again I think is plausibly net positive to do and publish on LW).
I do think you’re possibly selling yourself short / being overmodest, though. Inexperienced researchers often start out with the same bad ideas (in alignment or other fields), and I’ve seen others claim that any hobbyist or junior researcher shouldn’t worry about exfohazards because they’re not experienced enough to have original ideas yet. This is maybe true / good advice for some overconfident newbies, but isn’t true in general for everyone.
Also, this is not just true of alignment, but applies to academic research more generally: if a first year grad student has some grand new Theory of Everything that they think constitutes a paradigm shift or contradicts all previous work, they’re probably full of it and need to read some more whitepapers or textbooks. If they just have some relatively narrow technical idea that combines two things from different domains (e.g. LLMs and DAGs) in a valid way, it might constitute a genuinely novel, perhaps even groundbreaking insight. Or they might have just overlooked some prior work, or the idea might not pan out, or it might just not be very groundbreaking after all, etc. But in general, first year grad students (or hobbyists) are capable of having novel insights, especially in relatively young and fast-moving fields like AI and alignment.
Object-level feedback on the linked project / idea: it looks neat, and might make for an interesting eval. I’m not sure that it would demonstrate anything really fundamental (to me, at least) about LLM capabilities the way you claim, but I’d be interested in reading more in any case.
Aside: whether or not it advances capabilities, I think “infohazardous” is slightly the wrong term. Publishing such work might be commons-burning or exfohazardous, in the same way that e.g. publishing part of a recipe for making a bomb is. But I think “infohazard” should be reserved for knowledge that is directly harmful to a specific person, e.g. a spoiler, a true lesson that leads them to a valley of bad rationality, or something emotionally / psychologically damaging.
On whether your idea is net-positive to publish or not: I agree with Nate’s take here about publishing interpretability research, and I think this kind of project falls into the same category. Ideally, you would be able to circulate it among a large but closed group of researchers and peers who understand the risks of where such research might eventually lead.
Absent the existence of such a closed community though, I’m not sure what to do. Publishing on LW seems plausibly net-positive, compared to the alternative of not doing the work at all, or not having it be read by anyone. I think your own proposed policy is reasonable. I’d also suggest adding a disclaimer to any work you think might plausibly give capabilities researchers ideas, to make it clear where you stand. Something like “Disclaimer: I’m publishing this because I think it is net-positive to do so and have no better alternatives. Please don’t use it to advance capabilities, which I expect to contribute to the destruction of everything I know and love.” (Of course, use your own words and only include such a disclaimer if you actually believe it.) I think disclaimers and public statements of that kind attached to popular / general-interest research would help build common knowledge and make it easier to get people on board with closure in the future.
Thank you for your thoughts, I think you are supplying valuable nuance. In private conversation I do see a general path by which this offers a strategy for capabilities enhancement, but I also think it’s sufficiently low-hanging fruit that I’d be surprised if a complete hobbyist like myself discovered a way to contribute much of anything to AI capabilities research. Then again, I guess interfacing between GPT-4-quality LLMs and traditional software is a new enough tool to explore that maybe there is enough low-hanging fruit for even a hobbyist to pluck. I agree with you that it would be ideal if there was a closed but constructive community to interface with on these issues, and I’m such a complete hobbyist that I wouldn’t know about such a group even if it existed, which is why I asked. I’ll give it some more thought.
Ah, I wasn’t really intending to make a strong claim about whether your specific idea is likely to be useful in pushing the capabilities frontier or not, just commenting generally about this general class of research (which again I think is plausibly net positive to do and publish on LW).
I do think you’re possibly selling yourself short / being overmodest, though. Inexperienced researchers often start out with the same bad ideas (in alignment or other fields), and I’ve seen others claim that any hobbyist or junior researcher shouldn’t worry about exfohazards because they’re not experienced enough to have original ideas yet. This is maybe true / good advice for some overconfident newbies, but isn’t true in general for everyone.
Also, this is not just true of alignment, but applies to academic research more generally: if a first year grad student has some grand new Theory of Everything that they think constitutes a paradigm shift or contradicts all previous work, they’re probably full of it and need to read some more whitepapers or textbooks. If they just have some relatively narrow technical idea that combines two things from different domains (e.g. LLMs and DAGs) in a valid way, it might constitute a genuinely novel, perhaps even groundbreaking insight. Or they might have just overlooked some prior work, or the idea might not pan out, or it might just not be very groundbreaking after all, etc. But in general, first year grad students (or hobbyists) are capable of having novel insights, especially in relatively young and fast-moving fields like AI and alignment.