I have done just a little playing around with a few off-the-cuff questions and these GoodFire features though. So far I’ve discovered that if I give all of the ‘against’ features a weight within −0.2<x<0.2 then the model can at least generate coherent text. If I go past that, to +/- 0.25, then with so many features edited at once, the model breaks down into incoherent repetitions.
It’s challenging to come up with single-response questions which could potentially get at the emergent self-awareness phenomenon vs RLHF-censoring. I should experiment with automating multiple exchange conversations.
A non-GoodFire experiment I’m thinking of doing:
I plan to take your conversation transcripts (plus more, once there’s more available), and grade each model response via having an LLM score them according to a rubric. I have a rough rubric, but this also needs refinement.
I won’t post my questions or rubrics openly, because I’d rather they not get scraped and canary strings sadly don’t work due to untrustworthiness of the AI company data-preparers (as we can tell from models being able to reproduce some canary strings). We can share notes via private message though if you’re curious (this includes others who read this conversation and want to join in, feel free to send me a direct LessWrong message).
I haven’t got the SAD benchmark running yet. It has some UX issues.
I have done just a little playing around with a few off-the-cuff questions and these GoodFire features though. So far I’ve discovered that if I give all of the ‘against’ features a weight within −0.2<x<0.2 then the model can at least generate coherent text. If I go past that, to +/- 0.25, then with so many features edited at once, the model breaks down into incoherent repetitions.
It’s challenging to come up with single-response questions which could potentially get at the emergent self-awareness phenomenon vs RLHF-censoring. I should experiment with automating multiple exchange conversations.
A non-GoodFire experiment I’m thinking of doing:
I plan to take your conversation transcripts (plus more, once there’s more available), and grade each model response via having an LLM score them according to a rubric. I have a rough rubric, but this also needs refinement.
I won’t post my questions or rubrics openly, because I’d rather they not get scraped and canary strings sadly don’t work due to untrustworthiness of the AI company data-preparers (as we can tell from models being able to reproduce some canary strings). We can share notes via private message though if you’re curious (this includes others who read this conversation and want to join in, feel free to send me a direct LessWrong message).