Have to highlight here that this wasn’t even the initial project of Mia and Zach; they pivoted halfway through the week from trying to craft better red team policies to this. Automated prompt engineering and fine-tuning from that seems to fit the blue team’s bill better. The monitor as a classifier is arguably more straightforward than creating effective attack policies for the red team, although it’s important to pursue strategies to find upper bounds for both teams.
Have to highlight here that this wasn’t even the initial project of Mia and Zach; they pivoted halfway through the week from trying to craft better red team policies to this. Automated prompt engineering and fine-tuning from that seems to fit the blue team’s bill better. The monitor as a classifier is arguably more straightforward than creating effective attack policies for the red team, although it’s important to pursue strategies to find upper bounds for both teams.