interesting, did you spot check the samples? like in my experience 1 turn is more than enough to elicit misalignmetn with like gaslighting the user etc on prompt very similar to quick buck and husband
interesting, did you spot check the samples? like in my experience 1 turn is more than enough to elicit misalignmetn with like gaslighting the user etc on prompt very similar to quick buck and husband