Daniel Tan comments on Daniel Tan’s Shortform

Daniel Tan 31 Jan 2025 17:53 UTC
5 points
0
Some unconventional ideas for elicitation
- Training models to elicit responses from other models (c.f. “investigator agents”)
- Promptbreeder (“evolve a prompt”)
  - Table 1 has many relevant baselines
- Multi-agent ICL (e.g. using different models to do reflection, think of strategies, re-prompt the original model)
  - https://arxiv.org/abs/2410.03768 has an example of this (authors call this “ICRL”)
- Multi-turn conversation (similar to above, but where user is like a conversation partner instead of an instructor)
  - Mostly inspired by Janus and Pliny the Liberator
  - The three-layer model implies that ‘deeper’ conversations dodge the safety post-training, which is like a ‘safety wrapper’ over latent capabilities
- Train a controller where the parameters are activations for SAE features at a given layer.
  - C.f. AutoSteer