lunatic_at_large comments on lunatic_at_large’s Shortform

lunatic_at_large 31 Dec 2024 19:29 UTC
3 points
0
Has anyone made a good, easy to use user interface for implementing debate / control protocols in deployment? For example, maybe I want to get o12345 to write some code in my codebase but I don’t trust it. I’d like to automatically query Sonnet 17.5 with the code o12345 returned and ask if anything fishy is going on. If so, spin up one of the debate protocols and give me the transcript at the end. Surely the labs are experimenting with this kind of thing internally / during training but it still feels useful during deployment as an end-user, especially between models from different labs which might have internalized different objective functions and hence be less likely to collude.