Some unconventional ideas for elicitation
Training models to elicit responses from other models (c.f. “investigator agents”)
Promptbreeder (“evolve a prompt”)
Table 1 has many relevant baselines
Multi-agent ICL (e.g. using different models to do reflection, think of strategies, re-prompt the original model)
https://arxiv.org/abs/2410.03768 has an example of this (authors call this “ICRL”)
Multi-turn conversation (similar to above, but where user is like a conversation partner instead of an instructor)
Mostly inspired by Janus and Pliny the Liberator
The three-layer model implies that ‘deeper’ conversations dodge the safety post-training, which is like a ‘safety wrapper’ over latent capabilities
Train a controller where the parameters are activations for SAE features at a given layer.
C.f. AutoSteer
Some unconventional ideas for elicitation
Training models to elicit responses from other models (c.f. “investigator agents”)
Promptbreeder (“evolve a prompt”)
Table 1 has many relevant baselines
Multi-agent ICL (e.g. using different models to do reflection, think of strategies, re-prompt the original model)
https://arxiv.org/abs/2410.03768 has an example of this (authors call this “ICRL”)
Multi-turn conversation (similar to above, but where user is like a conversation partner instead of an instructor)
Mostly inspired by Janus and Pliny the Liberator
The three-layer model implies that ‘deeper’ conversations dodge the safety post-training, which is like a ‘safety wrapper’ over latent capabilities
Train a controller where the parameters are activations for SAE features at a given layer.
C.f. AutoSteer