newsletter.safe.ai
Dan H
AISN #32: Measuring and Reducing Hazardous Knowledge in LLMs Plus, Forecasting the Future with LLMs, and Regulatory Markets
AISN #31: A New AI Policy Bill in California Plus, Precedents for AI Governance and The EU AI Office
AISN #30: Investments in Compute and Military AI Plus, Japan and Singapore’s National AI Safety Institutes
AISN #29: Progress on the EU AI Act Plus, the NY Times sues OpenAI for Copyright Infringement, and Congressional Questions about Research Standards in AI Safety
AISN #28: Center for AI Safety 2023 Year in Review
AISN #27: Defensive Accelerationism, A Retrospective On The OpenAI Board Saga, And A New AI Bill From Senators Thune And Klobuchar
AISN #26: National Institutions for AI Safety, Results From the UK Summit, and New Releases From OpenAI and xAI
AISN #25: White House Executive Order on AI, UK AI Safety Summit, and Progress on Voluntary Evaluations of AI Risks
I agree that this is an important frontier (and am doing a big project on this).
AISN #24: Kissinger Urges US-China Cooperation on AI, China’s New AI Law, US Export Controls, International Institutions, and Open Source AI
AISN #23: New OpenAI Models, News from Anthropic, and Representation Engineering
AISN #22: The Landscape of US AI Legislation - Hearings, Frameworks, Bills, and Laws
Uncovering Latent Human Wellbeing in LLM Embeddings
MLSN: #10 Adversarial Attacks Against Language and Vision Models, Improving LLM Honesty, and Tracing the Influence of LLM Training Data
AISN #21: Google DeepMind’s GPT-4 Competitor, Military Investments in Autonomous Drones, The UK AI Safety Summit, and Case Studies in AI Policy
Almost all datasets have label noise. Most 4-way multiple choice NLP datasets collected with MTurk have ~10% label noise, very roughly. My guess is MMLU has 1-2%. I’ve seen these sorts of label noise posts/papers/videos come out for pretty much every major dataset (CIFAR, ImageNet, etc.).
AISN #20: LLM Proliferation, AI Deception, and Continuing Drivers of AI Capabilities
The purpose of this is to test and forecast problem-solving ability, using examples that substantially lose informativeness in the presence of Python executable scripts. I think this restriction isn’t an ideological statement about what sort of alignment strategies we want.
I think there’s a clear enough distinction between Transformers with and without tools. The human brain can also be viewed as a computational machine, but when exams say “no calculators,” they’re not banning mental calculation, rather specific tools.
We did already know that backdoors often (from the title) “Persist Through Safety Training.” This phenomenon studied here and elsewhere is being taken as the main update in favor of AI x-risk. This doesn’t establish probability of the hazard, but it reminds us that backdoor hazards can persist if present.
I think it’s very easy to argue the hazard could emerge from malicious actors poisoning pretraining data, and harder to argue it would arise naturally. AI security researchers such as Carlini et al. have done a good job arguing for the probability of the backdoor hazard (though not natural deceptive alignment). (I think malicious actors unleashing rogue AIs is a concern for the reasons bio GCRs are a concern; if one does it, it could be devastating.)
I think this paper shows the community at large will pay orders of magnitude more attention to a research area when there is, in @TurnTrout’s words, AGI threat scenario “window dressing,” or when players from an EA-coded group research a topic. (I’ve been suggesting more attention to backdoors since maybe 2019; here’s a video from a few years ago about the topic; we’ve also run competitions at NeurIPS with thousands of submissions on backdoors.) Ideally the community would pay more attention to relevant research microcosms that don’t have the window dressing.
I think AI security-related topics have a very good track record of being relevant for x-risk (backdoors, unlearning, adversarial robustness). It’s a been better portfolio than the EA AGI x-risk community portfolio (decision theory, feature visualizations, inverse reinforcement learning, natural abstractions, infrabayesianism, etc.). At a high level its saying power is because AI security is largely about extreme reliability; extreme reliability is not automatically provided by scaling, but most other desiderata are (e.g., commonsense understanding of what people like and dislike).
A request: Could Anthropic employees not call supervised fine-tuning and related techniques “safety training?” OpenAI/Anthropic have made “alignment” in the ML community become synonymous with fine-tuning, which is a big loss. Calling this “alignment training” consistently would help reduce the watering down of the word “safety.”