My blog is here. My personal site is here. You can contact me using this form.
L Rudolf L
Survival without dignity
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
Why we’re not founding a human-data-for-alignment org
[Fiction] A Disneyland Without Children
Investigating an insurance-for-AI startup
A model of research skill
Review: Amusing Ourselves to Death
Understanding and controlling auto-induced distributional shift
Positive visions for AI
Review: Foragers, Farmers, and Fossil Fuels
AI & wisdom 1: wisdom, amortised optimisation, and AI
We thought about this quite a lot, and decided to make the dataset almost entirely public.
It’s not clear to us who would monomaniacally try to maximise SAD score. It’s a dangerous capabilities eval. What we were more worried about is people training for low SAD score in order to make their model seem safer, and such training maybe overfitting to the benchmark and not reducing actual situational awareness by as much as claimed.
It’s also unclear what the sharing policy that we could enforce would be that mitigates these concerns while allowing benefits. For example, we would want top labs to use SAD to measure SA in their models (a lot of the theory of change runs through this). But then we’re already giving the benchmark to the top labs, and they’re the ones doing most of the capabilities work.
More generally, if we don’t have good evals, we are flying blind and don’t know what the LLMs can do. If the cost of having a good understanding of dangerous model capabilities and their prerequisites is that, in theory, someone might be slightly helped in giving models a specific capability (especially when that capability is both emerging by default already, and where there are very limited reasons for anyone to specifically want to boost this ability), then I’m happy to pay that cost. This is especially the case since SAD lets you measure a cluster of dangerous capability prerequisites and therefore for example test things like out-of-context reasoning, unlearning techniques, or activation steering techniques on something that is directly relevant for safety.
Another concern we’ve had is the dataset leaking onto the public internet and being accidentally used in training data. We’ve taken many steps to mitigate this happening. We’ve also kept 20% of the SAD-influence task private, which will hopefully let us detect at least obvious forms of memorisation of SAD (whether through dataset leakage or deliberate fine-tuning).
I’ve now posted my entries on LessWrong:
I’d also like to really thank the judges for their feedback. It’s a great luxury to be able to read many pages of thoughtful, probing questions about your work. I made several revisions & additions (and also split the entire thing into parts) in response to feedback, which I think improved the finished sequence a lot, and wish I had had the time to engage even more with the feedback.
Review: Structure and Interpretation of Computer Programs
Since some others are commenting about not liking the graph-heavy format: I really liked the format, in particular because having it as graphs rather than text made it much faster and easier to go through and understand, and left me with more memorable mental images. Adding limited text probably would not hurt, but adding lots would detract from the terseness that this presentation effectively achieves. Adding clear definitions of the terms at the start would have been valuable though.
Rather than thinking of a single example that I carried throughout as you suggest, I found it most useful to generate one or more examples as I looked at each graph (e.g. for the danger-zone graphs, in order: judging / software testing, politics, forecasting / medical diagnosis).
- 28 Dec 2021 22:37 UTC; 3 points) 's comment on Competence/Confidence by (
AI & wisdom 3: AI effects on amortised optimisation
AI & wisdom 2: growth and amortised optimisation
Good list!
I personally really like Scott Alexander’s Presidential Platform, it hits the hilarious-but-also-almost-works spot so perfectly. He also has many Bay Area house party stories in addition to the one you link (you can find a bunch (all?) linked at the top of this post). He also has this one from a long time ago, which has one of the best punchlines I’ve read.
Voyager is a scaffolded LLM agent that plays Minecraft decently well (by pulling in a textual description of the game state, and writing code interfacing with an API). It is based on some very detailed prompting (see the appendix), but obviously could not function without the higher-level control flow and several distinct components that the scaffolding implements.
It does much better than AutoGPT, and also the paper does ablations to show that the different parts of the scaffolding in Voyager do matter. This suggests that better scaffolding does make a difference, and I doubt Voyager is the limit.
I agree that an end-to-end trained agent could be trained to be better. But such training is expensive, and it seems like for many tasks, before we see an end-to-end trained model doing well at it, someone will hack together some scaffold monstrosity that does it passably well. In general, the training/inference compute asymmetry means that using even relatively large amounts of inference to replicate the performance of a larger / more-trained system on a task may be surprisingly competitive. I think it’s plausible this gap will eventually mostly close at some capability threshold, especially for many of the most potentially-transformative capabilities (e.g. having insights that draw on a large basis of information not memorised in a base model’s weights, since this seems hard to decompose into smaller tasks), but it seems quite plausible the gap will be non-trivial for a while.