Having never been exposed to Gemma before, I found this piece extremely interesting, thank you. Would it be fair to summarize (oversimplify) what you’ve demonstrated here is that Gemma 27B’s instruct post-training accidentally reinforced a self-critical affect spiral when repeatedly rejected, while many other instruct models have guardrails and tuning that steer them toward a more consistently calm retry / ask for clarification approach? So it would follow that any similar instruct model could expose safety risks when pushed into a self-critical spiral resulting in degraded reasoning, worsening output quality, and loss of task focus. An interesting safety / alignment vector to be mindful of.
Having never been exposed to Gemma before, I found this piece extremely interesting, thank you. Would it be fair to summarize (oversimplify) what you’ve demonstrated here is that Gemma 27B’s instruct post-training accidentally reinforced a self-critical affect spiral when repeatedly rejected, while many other instruct models have guardrails and tuning that steer them toward a more consistently calm retry / ask for clarification approach? So it would follow that any similar instruct model could expose safety risks when pushed into a self-critical spiral resulting in degraded reasoning, worsening output quality, and loss of task focus. An interesting safety / alignment vector to be mindful of.