Newsletter for Alignment Research: The ML Safety Updates

Introducing the ML Safety Updates

TLDR; We present a new AI safety update series in podcast, YouTube and newsletter format released weekly to stay updated in alignment and ML safety research and get exposed to Ai safety opportunities. Read the latest newsletter further down in this post.

Our motivations for this are two-fold:

  • It has never been easy to stay updated on the latest developments in specific research fields and in the past couple of years, the amount of alignment research has increased significantly. On top of that, much safety-relevant AI work is not to be found in legible EA /​ rationalist channels, e.g. cybersecurity, AI legislation, robustness, and monitoring.

  • Existing newsletters in alignment research are focused on deep examinations of theory and give detailed insights to the reader. However, there is no newsletter series for up-to-date, weekly events.

Our newsletters cover a summary of the past week’s research, both within alignment and safety-relevant AI work, as well as promote opportunities in the AI safety space.

The past 7 weeks, we have released these updates as a YouTube video series summarizing novel AI and ML safety research in 4-6 minutes. This week, we released them in podcast and newsletter format, and future updates will also be released to LessWrong. Subscribe here.

The case for an AI safety update series

There are already a few amazing resources on AI safety similar to newsletters. However, the ones that exist are either biased towards specific topics or have not been kept up to date that past year. See our summary below.

  • Alignment Newsletter: Rohin Shah has kept the Alignment Newsletter running for a long while and Rob Miles has recorded its entries as podcast episodes. It is released in Chinese, See the whole team on the website and their spreadsheet of all newsletters. It started on April 9, 2018 and was released once a week. There are a total of 3 episodes in 2022.

  • ML Safety Newsletter: Dan Hendrycks sends out a Substack newsletter every month with summaries of new ML safety research.

  • AGISF Newsletter: This newsletter managed by the Blue Dot Impact team shares opportunities to over 1,000 subscribers somewhat regularly (~monthly+).

  • Quintin’s Alignment Paper Review: Quintin releases a wonderfully comprehensive ~weekly review of an AI safety adjacent or relevant field work.

  • Rob Miles: Rob Miles uploads fantastic videos explaining key concepts in AI safety. He has been on YouTube since the Computerphile days. During the last year, there have been a total of 4 YouTube short videos on the channel, but several full-scale videos seem to be in the pipeline.

  • Machine Alignment Monday: Scott Alexander sometimes (~monthly) discusses new AI safety research.

Additionally, there are several update feeds for alignment research.

  • AlignmentForum: The de facto home for AI safety research, the Alignment Forum is a highly curated and high quality place to share AI safety research within the AI safety community.

  • ML Safety Subreddit: This subreddit is organized by CAIS and shares papers that are usually not available in the AI safety channels but from the robustness, out-of-distribution, alignment and monitoring fields of machine learning.

  • Twitter: Much research is shared on Twitter these days and it represents a very good AI safety feed if you follow the right people.

  • LessWrong: The less restricted and peer-reviewed sister site to the AlignmentForum, with a literature of AI safety-related work.

  • EA Forum: AI safety articles on the EA Forum are mostly about the general dynamics of AI safety cause prioritization, new organizations, project summaries, impact evaluations and AI timelines.

  • Discord servers (e.g. EleutherAI and our own Apart): There are several Discord servers in alignment where interesting discussions about organizations’ projects and unique AI safety readings are available.

  • Slack (e.g. AI Alignment and AGI Safety Fundamentals): These are very similar to the Discord servers but are often more professional and have stricter acceptance criteria.

  • Medium (e.g. DeepMind’s and Paul Christiano’s): Medium can give you a personalized feed based on who you follow so if you follow alignment researchers, you can use it as an AIS feed.

Do share if you think there’s any major channels we missed in the update feeds sheet and the research update channels sheet.

Risks & uncertainties

  1. We misrepresent someone’s research or perspective in AI safety. This is a real risk since these updates will be published once a week.

  2. The research we summarize plays into an existing paradigm and limits the creation of new ideas in AI safety research.

  3. Wrong representation of AI safety in the alignment updates leads to stigmatizing of AIS from the ML community and public actors.

  4. We stop updating the weekly newsletter, and our past existence prevents people from making a new one.

Risk mitigation

  1. We are very open for feedback and will keep a link in the description of our manuscript for you to comment on so we can add any corrections (you can also go to the manuscript in the link further down).

  2. We will consciously look beyond the traditional channels of AI safety to find further resources every week. Additionally, we won’t disregard an article just because it doesn’t have the right amount of karma.

  3. We will strive to optimize the feedback mechanisms from the community to ensure that we integrate rather than separate from the machine learning field. We will report publicly if we stop making the episodes and call for others to take over the task. If this is not possible, we will be very public about the complete shutdown of the series so others can fill the gap.


Give anonymous feedback on the series here or write your feedback in the comments here or on YouTube. You’re also very welcome to contact us at or book a meeting here.

Please do reach out to us or comment in the manuscript doc if you think we misrepresented an article, opinion or perspective during an update.

Subscribe to our newsletters here, listen to the podcast here (Spotify), watch the YouTube videos here and read the newsletters here.

Thank you very much to Alexander Briand for in-depth feedback on this post.

This week’s ML Safety Update

This week, we’re looking at counterarguments to the basic case for why AI is an existential risk to humanity, looking at how strong AI might come very soon, and sharing interesting papers.

But first a small note: You can now subscribe to our newsletter and listen to these updates in your favorite podcasting app. Check out and

Today is October 20th and this is the ML Safety Progress Update!

AI X-risk counterarguments

Existential risk of AI does not seem overwhelmingly likely according to Katja Grace from AI Impacts. She writes a long article arguing against the major perspectives on how AI can become very dangerous, and notes that enough uncertainty makes AI safety seem like a relevant concern despite the relatively low chance of catastrophe.

Her counterarguments go against the three main cases for why superintelligent AI will become an existential risk: 1) Superhuman AI systems will be goal-directed, 2) goal-directed AI systems’ goals will be bad, and 3) superhuman AI will overpower humans.

Her counterarguments for why AI systems might not be goal-directed are that many highly functional systems can be “pseudo-agents”, models that don’t pursue utility maximization but optimize for a range of sub-goals to be met. Additionally, to be a risk, the bar for goal-directedness is extremely high.

Her arguments for why goal-directed AI systems’ goals might not be bad are that: 1) Even evil humans broadly correspond to human values and that slight diversity from the optimal policy seem alright. 2) AI might just learn the correct thing from the dataset since humans also seem to get their behavior from the diverse training data of the world. 3) Deep learning seems very good at learning fuzzy things from data and values seem learnable in slightly the same way as generating faces (and we don’t see faces without noses for example). The last counterargument is that 4) AIs who learn short-term goals will both be highly functional and have a low chance of optimizing for dangerous, long-term goals such as power-seeking.

Superhuman AI might also not overpower humans since: 1) A genius human in the stone age would have a much harder time getting to space than an average intelligence human today, which shows that intelligence is a much more nuanced concept than we set it to be. 2) AI might not be better than human-AI combinations. 3) AI will need our trust to take over critical infrastructure. 4) There are many other properties than intelligence which seem highly relevant. 5) Many goals do not end in taking over the universe. 6) Intelligence feedback loops could havemany speeds and you need a lot of confidence that it will be fast to say it leads to doom. And 7) key concepts in the literature are quite vague, meaning that we lack an understanding of how they will lead to existential risk.

Erik Jenner and Johannes Treutlein give their response to her counterarguments. Their main point is that there’s good evidence that the difference between AI and humans will be large and that we need Grace’s slightly aligned AI to help us reach a state where we do not build much more capable and more misaligned systems.

Comprehensive AI Services (CAIS)

A relevant text to mention in relation to these arguments is Eric Drexler’s attempt at reframing superintelligence into something more realistic in an economic world. Here, he uses the term “AI services” to describe systems that can solve singular tasks that will be economically relevant. The comprehensive in comprehensive AI services is what we usually call general. The main point is that we will see a lot of highly capable but specialized AI before we get the monolithic artificial general intelligence. We recommend reading the report if you have the time.

Strong AGI coming soon

At the opposite end of the spectrum from Grace, Porby shares why they think AGI will arrive in the next 20 years with convincing arguments on 1) how easy the problem of intelligence is, 2) how immature current machine learning is, 3) how quickly we’ll reach the level of hardware needed, and 4) how we cannot look at current AI systems to predict future abilities.

Other news

  • In other news, in a new survey published in Nature, non-expert users of AI systems think interpretability is important, especially in safety-critical scenarios. However, they prefer accuracy in most tasks.

  • Neel Nanda shares an opinionated reading of his favorite circuits interpretability work.

  • A new method in reinforcement learning shows good results on both performance and how moral its actions are. They take a text-based game and train a reinforcement learning agent with both a task policy and a moral policy.

  • John Wentworth notes how prediction markets might be useful for alignment research.

  • DeepMind has given a language model access to a physics simulation to increase its physics reasoning ability.

  • Nate Soares describes how superintelligent beings do not necessarily leave humans alive on game theoretic grounds.

  • A new research agenda in AI safety seeks to study the theory of deep learning using a pragmatic approach to understand key concepts.


And now, diving into the many opportunities available for all interested in learning and doing more ML safety research!

This has been the ML Safety Progress Update and we look forward to seeing you next week!

Crossposted from EA Forum (30 points, 0 comments)
No comments.