We suggest using prompt optimization—methods which increase an LLM’s reward by updating its instructions rather than its weights—to find prompts that explain these reward-hacking strategies in plain, readable English. We can then sanitize the prompt, removing exploitative instructions while keeping instructions that are genuinely useful.
And then, to save computational costs during inference, you can either pre-cache the KV cache for the long sanitized prompt, or distil the long sanitized prompt into the model’s weights.
And then, to save computational costs during inference, you can either pre-cache the KV cache for the long sanitized prompt, or distil the long sanitized prompt into the model’s weights.