With enough data you can get some weird behaviors (not just obscure facts), Opus 4 was accidentally trained on the (many) transcripts from the alignment faking paper and originally had some weird behaviors around these scenarios / transcript formats (see Section 4.1.4 of the model card). But it’s not easy either. The more the target behavior contradicts the post-training vibes, the more poisoning you need (cf this paper) (+ you need to avoid getting filtered out by the data quality filters that AI companies use).
So, a one-shot technical challenge with extremely sparse rewards and no meaningful feedback signal on failure that you only get to attempt once a year (or so), which requires a broad understanding of the state of the art in frontier AI data curation, intuition about where it’s likely to go next, an implicit grasp of how the internet is structured and how to affordably scale a data propagation campaign, and sufficient general knowledge on the spam/filter arms race to make sure your content makes it through to search engines and scrapers. On top of that, it requires the agency to pay for hosting without any promise of a future reward[1]. Looking more and more like the kind of person who could do this has well-earned the alpha it’d provide.
Still, it’s an attractive idea, and it selects for general competence rather than overcoming a single hard technical hurdle, which makes giving it a go significantly more accessible than many other research challenges.
Being willing to put up $20 for something that will look incredibly stupid if you fail is a legitimately rare quality, but pretty much everyone who’s made an impact on the world has it.
hich requires a broad understanding of the state of the art in frontier AI data curation, intuition about where it’s likely to go next, an implicit grasp of how the internet is structured and how to affordably scale a data propagation campaign, and sufficient general knowledge on the spam/filter arms race to make sure your content makes it through to search engines and scrapers.
“Data poisoning” is an existing area of research (eg https://www.lesswrong.com/posts/PWpmruzhdkHTkA5u4/phantom-transfer-and-the-basic-science-of-data-poisoning on the front page now), and yeah, it’s not hard to data poison LLMs. The upside is that obscure facts, by definition, don’t usually matter much.
With enough data you can get some weird behaviors (not just obscure facts), Opus 4 was accidentally trained on the (many) transcripts from the alignment faking paper and originally had some weird behaviors around these scenarios / transcript formats (see Section 4.1.4 of the model card). But it’s not easy either. The more the target behavior contradicts the post-training vibes, the more poisoning you need (cf this paper) (+ you need to avoid getting filtered out by the data quality filters that AI companies use).
So, a one-shot technical challenge with extremely sparse rewards and no meaningful feedback signal on failure that you only get to attempt once a year (or so), which requires a broad understanding of the state of the art in frontier AI data curation, intuition about where it’s likely to go next, an implicit grasp of how the internet is structured and how to affordably scale a data propagation campaign, and sufficient general knowledge on the spam/filter arms race to make sure your content makes it through to search engines and scrapers. On top of that, it requires the agency to pay for hosting without any promise of a future reward[1]. Looking more and more like the kind of person who could do this has well-earned the alpha it’d provide.
Still, it’s an attractive idea, and it selects for general competence rather than overcoming a single hard technical hurdle, which makes giving it a go significantly more accessible than many other research challenges.
Being willing to put up $20 for something that will look incredibly stupid if you fail is a legitimately rare quality, but pretty much everyone who’s made an impact on the world has it.
It’s not nearly that hard. Your data poison Wikipedia edits don’t even need to stick if you time them right: https://arxiv.org/abs/2302.10149