but we would love to work with someone with an ML background as part of future work involving a human prompt baseline.
My impression, as someone with such a background, is that prompt engineering isn’t as important as it used to be. In the days of GPT-3, we were searching a largely unrefined space of model outputs, and text that correlated with the kinds of writers that produced good work had a stellar impact on output quality. Nowadays, though, a tightly-optimized RLHF/RLVR training pipeline doesn’t leave nearly as much alpha on the table.
I’ll admit that this is primarily intuition and vibes, but if it’s sufficiently easy/cheap to do so, tossing a barebones human-written prompt out and seeing how a model does could be a fair test. I recall that I didn’t need anything particularly detailed when I had Claude implement a simple multithreaded PPO repo a few months back to see if it could.
My impression, as someone with such a background, is that prompt engineering isn’t as important as it used to be. In the days of GPT-3, we were searching a largely unrefined space of model outputs, and text that correlated with the kinds of writers that produced good work had a stellar impact on output quality. Nowadays, though, a tightly-optimized RLHF/RLVR training pipeline doesn’t leave nearly as much alpha on the table.
I’ll admit that this is primarily intuition and vibes, but if it’s sufficiently easy/cheap to do so, tossing a barebones human-written prompt out and seeing how a model does could be a fair test. I recall that I didn’t need anything particularly detailed when I had Claude implement a simple multithreaded PPO repo a few months back to see if it could.