“Unintentional AI safety research”: Why not systematically mine AI technical research for safety purposes?

Thanks to llll for helping me think this through, and for providing useful comments.

Epistemic Status: My best guess

Introduction

It might be worthwhile to systematically mine AI technical research to find “unintentional AI safety research”—research that, while not explicitly conducted as AI safety research, contains information relevant to AI safety. An example of unintended safety research is Douglas Lenat’s work on the heuristic-search system Eurisko, which inadvertently demonstrated specification gaming when Eurisko exploited a loophole in the rules of the role-playing game Traveller TCS to win the US national championship in 1981 and 1982.[1] This post is not meant to suggest that AI safety researchers don’t already look for unintentional safety research, but I’m unaware of any effort to do so in a systematic way designed to extract as much “safety value” as possible from technical research.

Related work

Tshitoyan, Vahe, et al. “Unsupervised word embeddings capture latent knowledge from materials science literature.” Nature 571.7763 (2019): 95-98. In this study, a language model trained on an unsupervised text corpus of scientific abstracts learned to predict discoveries that were made after the training data cutoff date—that is, the model learned to predict future scientific discoveries. “This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.”

Reasons to systematically mine technical research

Gleaning insights from technical research has already demonstrated its value for safety researchers, even without a structured approach. For example, technical researchers demonstrated in practice that reward misspecification, bias in training data, distributional shift, and adversarial examples can cause ML models to behave in surprising or unintended ways,[2] and demonstrations like these have illuminated safety research directions. A less straightforward example probably worth mentioning is the discovery of “unit-level” feature visualization in DNNs[3]—a proof of concept for mechanistic interpretability, which is an area of general purpose research where safety-minded people have made encouraging discoveries through empirical studies of transformer-based models.[4] I think these examples show that the safety field already has some process for getting insights from technical research onto the safety agenda. However, the process seems to have developed informally and without a coherent strategy to maximize its effectiveness, which suggests there may be room for improvement through a more structured approach.

By implementing a systematic mining process, researchers could potentially gain a head start in addressing safety concerns. This point assumes there’s a nontrivial chance of AI research ending in a catastrophe that safety research could have helped prevent, if only there had been more time. If this is the situation, we should try to minimize the time it takes to notice unattended safety problems; the less time it takes to notice a lead, the more time safety researchers have to chase it down. As noted above, there’s already some process by which unintentional safety research gets on the safety field’s radar. However, the process is probably not optimized for efficient use of time unless that’s happened by default. A mining operation that means business might make the most of time-saving techniques such as automated monitoring; prioritization of studies most likely to have safety implications according to some assessment criteria; real-time monitoring, e.g., tracking updates to research databases before publication; and integration with existing knowledge bases, to avoid duplicating efforts and to ensure that new insights are quickly incorporated into the safety field’s broader body of knowledge.

The sheer volume of published research likely surpasses the capacity for thorough exploration without a methodical and organized effort. AI researchers produced hundreds of thousands of English-language scientific publications from 2010 to 2020[5]—over a million, when Chinese-language research papers are counted[6]—and thousands more are produced each year.[7] Most published research seems to be technical research.[8] It’s possible the safety field hasn’t missed anything important yet, but it seems plausible that something could be buried in the existing literature or vanish under the accumulation of new publications, especially if there’s no obvious safety angle that lends itself to detection by skimming titles or searching by keyword. If there really is a chance of missing something important due to the sheer volume of research and publication rate, a system designed for higher capacity and more thorough exploration of the literature might reduce the likelihood of something important going unnoticed.

Seeking insights beyond the confines of AI safety research might help prevent the limitations of a narrow ideological perspective. There may be 400 people working full-time equivalent on AI x-risk, including people working on strategy and governance,[9] which would mean fewer than 400 people work full-time equivalent on AI safety research. That sounded like a small number of people to me, and I wanted to gain some perspective by comparing the size of the safety field to the size of the entire field of AI research. I never found enough information for a reliable estimate of how many AI researchers there are, but publication data suggests that safety researchers are a small part of a much larger population: 15,920 unique authors submitted papers to NeurIPS 2019,[10] and AI researchers produced 22,822 English-language publications and patents in 2021.[11] In addition to the safety field probably being comparatively small, safety research publishing may be significantly channeled onto the Alignment Forum; “about half” of those surveyed in 2022 by Thibodeau et al. said they regarded the Alignment Forum as their primary platform for sharing research,[12] and in 2022 Kirchner et al. found that “a substantial portion of AI alignment research is communicated on a curated community forum: the Alignment Forum”.[13] The safety field’s small size, coupled with a somewhat centralized publication and peer-review process on the Alignment Forum, might have a homogenizing effect on how safety researchers think. Allowing more time for the safety field to grow and mature could be a natural solution to this possible problem, but it might also be helpful to seek exposure to fresh thinking that hasn’t been shaped by the conditions of the safety field.

Technical research can provide evidence to help test hypotheses that shape safety researchers’ expectations about future AI systems. For instance, it’s unclear if mesa-optimization[14] really is something to worry about, but there’s evidence that AI systems can learn an internal process similar to an optimization algorithm.[15] And it’s too soon to know if we’ll have trouble training powerful AI systems to share latent knowledge,[16] but ChatGPT can claim to not know things that it knows.[17] I think these are two examples of how technical research can help determine if safety researchers’ intuitions seem to be on track about things we can’t falsify yet.

Possible mining approaches

I don’t have any specific implementation ideas, but a few general approaches came to mind.

Hire a bunch of people who know what they’re doing. AI safety researchers and others with relevant expertise would look for unintentional safety research in some coordinated way. This seems impractical. There probably aren’t enough qualified people to manage the workload manually, and I expect that most qualified people would probably want to do something else.

Crowdsource. This makes me think of a police tip-line flooded with bad tips. My impression is that quality control and identifying useful results would be difficult, but I would need to do more research to know if that’s probably true.

Train a large language model. arXivGPT seems like a step in this general direction. Maybe you start with a pretrained LLM, and if its original training data didn’t include all publicly available AI research, train it on that; then fine-tune it using RLHF: the model identifies publications that it thinks contain unintentional safety research and explains the safety angle, and safety researchers give it feedback. Next, you have an automated process with comprehensive online access to AI research that searches for studies to feed to your model, which filters out what it thinks is crap and generates reports on the rest for safety researchers to review. Set this model loose and let people sign up for a schedule of email reports, plus have a searchable public repository of all its findings. Continually evaluate the system and improve the model.

In addition to the general problem of figuring out how to discover unknown unknowns, there are some specific problems with this approach. One problem is that LLMs have limited ability to grasp complex concepts, so it might be hard to get a model to adequately understand AI research. Another problem is that generated text is often coherent but unoriginal, which doesn’t bode well for a model’s ability to generate new insights. And LLMs are not explicitly designed to understand and synthesize information from multiple modalities, which could limit the potential to make meaningful connections across studies that have different kinds of underlying data. Then there’s the risk of human error during RLHF: humans could miss safety-related subtleties in the training data or test output, which might train a model to withhold certain insights or even entire classes of insights, e.g., insights it learns to associate with a particular data type, study design, or research direction; humans could also overestimate how important something is to safety research, which might train a model to look for red herrings. And is there enough training data? There are probably other problems to consider in addition to the ones mentioned here, but I think a system built around an LLM still might be a good way to start.

Questions

Have AI safety researchers publicly discussed or tried systematically mining AI research?

How would you know if it’s working?

Would it be expensive?

What would be the biggest barriers to doing it right?

Are patent databases a promising source of unintentional safety research?

It’s not hard to imagine how a model capable of primary research could be dangerous, but could a model capable of discovering latent knowledge in existing literature be dangerous?

  1. ^

    Lenat, Douglas B. “EURISKO: a program that learns new heuristics and domain concepts: the nature of heuristics III: program design and results.” Artificial intelligence 21.1-2 (1983): 61-98.

  2. ^

    For reward misspecification, see Randløv, Jette, and Preben Alstrøm. “Learning to Drive a Bicycle Using Reinforcement Learning and Shaping.” ICML. Vol. 98. 1998; for bias in training data, see Machkovech, Sam. “Google dev apologizes after Photos app tags black people as ‘gorillas’”. Ars Technica, June 30, 2015: https://​​arstechnica.com/​​information-technology/​​2015/​​06/​​google-dev-apologizes-after-photos-app-tags-black-people-as-gorillas/​​; for distributional shift, see Schlimmer, Jeffrey C., and Richard H. Granger. “Incremental learning from noisy data.” Machine learning 1 (1986): 317-354; for adversarial examples, see Szegedy, Christian, et al. “Intriguing properties of neural networks.” arXiv preprint arXiv:1312.6199 (2013).

  3. ^

    Erhan, Dumitru, et al. “Visualizing higher-layer features of a deep network.” University of Montreal 1341.3 (2009): 1.

  4. ^

    For examples of promising mechanistic interpretability research, see Elhage, N., et al. “A mathematical framework for transformer circuits.” Transformer Circuits Thread (2021); Olsson, Catherine, et al. “In-context learning and induction heads.” arXiv preprint arXiv:2209.11895 (2022); Nanda, Neel, et al. “Progress measures for grokking via mechanistic interpretability.” arXiv preprint arXiv:2301.05217 (2023).

  5. ^

    Zhang, Daniel et al. “The AI Index 2022 Annual Report,” AI Index Steering Committee, Stanford Institute for Human-Centered AI, Stanford University, March 2022, p. 17.

  6. ^

    Chou, Daniel. “Counting AI Research”. Center for Security and Emerging Technology, July 2022: 3.

  7. ^

    Zhang et al., p. 3.

  8. ^

    Judging from the thematic areas identified by Zhang et al. (p. 19) and the keywords Chou used to search publication sources (p. 32).

  9. ^

    Hilton, Benjamin. “How many people are working (directly) on reducing existential risk from AI?” LessWrong, January 2023: https://​​www.lesswrong.com/​​posts/​​FYFrFjk57WrdFdQB8/​​how-many-people-are-working-directly-on-reducing-existential.

  10. ^

    Beygelzimer et al. “What we learned from NeurIPS 2019”. Medium, December 9, 2019: https://​​neuripsconf.medium.com/​​what-we-learned-from-neurips-2019-data-111ab996462c.

  11. ^

    Zhang, Daniel et al. “The AI Index 2022 Annual Report,” AI Index Steering Committee, Stanford Institute for Human-Centered AI, Stanford University, March 2022, 2022 AI Index Public Data—AI Publication and Patent Counts: https://​​docs.google.com/​​spreadsheets/​​d/​​1D_7XVaE4BK0DrEFRg0aLaQSYdGQ4YugA1OD3UPd4LMY/​​edit#gid=1620293967.

  12. ^

    Thibodeau, Jacques et al. “Results from a survey on tool use and workflows in alignment research”, AI Alignment Forum, December 2022: https://​​www.alignmentforum.org/​​posts/​​a2io2mcxTWS4mxodF/​​results-from-a-survey-on-tool-use-and-workflows-in-alignment.

  13. ^

    Kirchner, Jan H., et al. “Researching Alignment Research: Unsupervised Analysis.” arXiv preprint arXiv:2206.02841 (2022).

  14. ^

    Hubinger, Evan, et al. “Risks from learned optimization in advanced machine learning systems.” arXiv preprint arXiv:1906.01820 (2019).

  15. ^

    von Oswald, Johannes, et al. “Transformers learn in-context by gradient descent.” arXiv preprint arXiv:2212.07677 (2022).

  16. ^

    Christiano, Paul, Ajeya Cotra, and Mark Xu. “Eliciting latent knowledge: How to tell if your eyes deceive you.” (2021): https://​​docs.google.com/​​document/​​d/​​1WwsnJQstPq91_Yh-Ch2XRL8H_EpsnjrC1dwZXR37PC8/​​edit#heading=h.jrzi4atzacns.

  17. ^

    “ChatGPT is sensitive to tweaks to the input phrasing or attempting the same prompt multiple times. For example, given one phrasing of a question, the model can claim to not know the answer, but given a slight rephrase, can answer correctly.” Schulman et al. “Introducing ChatGPT”. OpenAI blog, November 30, 2022: https://​​openai.com/​​blog/​​chatgpt#OpenAI.