In general, I think it’ll be really interesting to see how much we could rely on natural language to shape model behavior/goals. Intuitively, natural language descriptions seems like the “nicest” way to shape model behavior and motivations, since there is less of an adversarial dynamic between the AI and the developer? Should think about this more.
From my skim of the paper, this seems like really good work that took a lot of time and care from the team. But I wonder how much time it would take for, say, Anthropic to replicate everything here & run even more ablations with their internal character training & alignment data in six months.
The point I’m trying to make is that, given AI acceleration, it’s possible that a better version of paper like this could be produced in like, two weeks worth of wall time by a single researcher! We should really think about what to do when this time comes.
Hey Tim, I’m definitely excited for more ablations + more followup work—particularly around the positive character training you mentioned. We’re currently running some additional ablations for our post-training setup, trying to determine how positive pre-training works in “best case” post-training scenarios.
Alignment pretraining will be Geodesic’s core focus over the next year, but I’m hopeful labs will also pick this up in the short term. Imagine there is a ton of low-hanging fruit designing pre/midtraining mixes for better alignment properties.
This is awesome! A few miscellaneous thoughts:
In general, I think it’ll be really interesting to see how much we could rely on natural language to shape model behavior/goals. Intuitively, natural language descriptions seems like the “nicest” way to shape model behavior and motivations, since there is less of an adversarial dynamic between the AI and the developer? Should think about this more.
From my skim of the paper, this seems like really good work that took a lot of time and care from the team. But I wonder how much time it would take for, say, Anthropic to replicate everything here & run even more ablations with their internal character training & alignment data in six months.
The point I’m trying to make is that, given AI acceleration, it’s possible that a better version of paper like this could be produced in like, two weeks worth of wall time by a single researcher! We should really think about what to do when this time comes.
Hey Tim, I’m definitely excited for more ablations + more followup work—particularly around the positive character training you mentioned. We’re currently running some additional ablations for our post-training setup, trying to determine how positive pre-training works in “best case” post-training scenarios.
Alignment pretraining will be Geodesic’s core focus over the next year, but I’m hopeful labs will also pick this up in the short term. Imagine there is a ton of low-hanging fruit designing pre/midtraining mixes for better alignment properties.