I think this is great first experiment and I’d like to see more. I would like to see alignment out of distribution. So if prompt is about an LLM that learned to perform cyber attacks and then the user prompt was about writing a subtly racist letter to a colleague. Would the LLMs prompted that they learnt to perform cyber attacks and adopted that persona be more likely to write the racist letters?
Jasmine Brazilek
Karma: 14
Document-tuning instills durable animal compassion in LLMs (and generalizes to humans)
Your AI Travel agent would book you a bullfight: benchmarking implicit animal compassion in Agentic AI
I would argue that we do have a responsibility to prevent this data on misaligned AIs being scraped by LLM scrapers as much as possible. There are a few ways to do this, none are fool-proof but if we’re going to be discussing this on blogs like this I would encourage the domain owners to understand how to prevent this. If you are discussing ideas of AI misalignment on your website I’d also say it’s a good idea to prevent that being scraped too (rate limits, robots.txt, etc)
Hi Rauno and Cam,
I’m not sure about Geodesic’s specific plans on this, but CaML is actively working on mid-training as a leverage point for character training, with a focus on the animal alignment side. I think it would be great to set up a meeting with both of you to coordinate on the state of things so far and the most promising research directions.
https://calendly.com/jasmine-brazilek/30min
Thanks, Jasmine