RowanWang
Evaluating honesty and lie detection techniques on a diverse suite of dishonest models
Building and evaluating alignment auditing agents
Modifying LLM Beliefs with Synthetic Document Finetuning
Some Lessons Learned from Studying Indirect Object Identification in GPT-2 small
Gears-Level Mental Models of Transformer Interpretability
Hm, I think this tool would’ve been really helpful for me in the past for a couple of occasions. Usually if I want to save a cell output, I just won’t edit that cell and I’ll create a new one, even if it means redundant code.
Also +1 on keeping track of bugs! I should’ve added to the og post that one thing I do that’s really helpful for me is keeping track of procedural knowledge (i.e. how to setup a GPU, how to fix common issue X, etc.) in a personal Slack that I’ve created as a second brain basically. I found that I used the message-yourself-in-slack feature a lot to keep track of small notes for myself, and since I did it so much, I created a whole private, personal Slack and that’s been pretty useful in keeping track of bugs, etc.
Let’s definitely catch up!
Yes! Especially if you show you have can provide relevant thoughts about their work, a lot of people will be happy to call or at least reply to some questions via email
Oh man, it totally was wrong, sorry about that, updated data again. I looked at the train datasets from the various models we trained and reran the data generation pipeline and the results looked as expected, so I don’t think I trained models on the wrong data for the original results, but I’m not fully sure how this data mix came about. It looks like it’s a combination of the followup and goals data, i think claude might have accidentally mixed them when i was having it sanitize it for release
also fwiw depending on what you’re using this data for, you should probably just regenerate it, it’s not that hard and you could probably easily generate more diverse data. it probably also helps if the prompts actually elicit deception on the model you’re working with