Student and co-director of the AI Safety Initiative at Georgia Tech. Interested in technical safety/alignment research and general projects that make AI go better. More about me here.
yix
yix’s Shortform
Does anyone know of a convincing definition of ‘intent’ in LLMs (or a way to identify it)? In model organisms type work, I find it hard to ‘incriminate’ LLMs. Even though the output of the LLM will remain what it is regardless of ‘intent’, I think this distinction may be important because ‘intentionally lying’ and ‘stochastic parroting’ should scale differently with overall capabilities.
I find this hard for several reasons, but I’m highly uncertain whether these are fundamental limitations:LLMs behave very differently depending on context. Asking it about something it did post-hoc elicits a different ‘mode’ and doesn’t necessarily allow us to make statements about its original behavior.
Mechanistic techniques seem to be good at generating hypotheses, not validating them. Pointing at a SAE feature activation that says ‘deception’ does not seem conclusive because auto-interp pipelines often does not include enough context for robust explanations for complex high level behaviors like deception.
TastyBench: Toward Measuring Research Taste in LLM
More on giving undergrads their first research experience. Yes, giving first research experience is high impact, but we want to reserve these opportunities to the best people. Often, this first research experience is most fruitful when they work with a highly competent team. We are turning focus to assemble such teams and find fits for the most value aligned undergrads.
We always find it hard to form pipelines because individuals are just so different! I don’t even feel comfortable using ‘undergrad’ as a label if I’m honest…
Lessons from a year of university AI safety field building
Thanks again Esben for collaborating with us! Can confidently say that the above is super valuable advice for any AI safety hackathon organizers, they’re consistent with our experiences.
In the context of a college campus hackathon, I’d especially stress focus on preparing starter materials and making submission requirements clear early on!
The “hallucination/reliability” vs “misaligned lies” distinction probably matters here. The former should in principle go away as capability/intelligence scales while the latter probably gets worse?
I don’t know of a good way to find evidence of model ‘intent’ for this type of incrimination, but if we explain this behavior with the training process it’d probably look something like:
Tiny bits of poorly labelled/bad preference data makes its way into training dataset due to human error. Maybe specific cases where the LLM made up a good looking answer and the the human judge didn’t notice.
The model knows that the above behavior is bad, but gets rewarded anyways, this leads to some amount of misalignment/emergent misalignment. Even though in theory, the fraction of bad training data should be no where near sufficient for EM
Generalization seems to scale with capabilities.
Maybe the scaling law to look at here is model size vs. the % of misaligned data needed for the LLMs to learn this kind of misalignment? Or maybe inoculation prompting fixes all of this, but you’d have to craft custom data for each undesired trait...