Shallow review of technical AI safety, 2025

Website · Data · Gestalt · Repo

Change in 18 latent capabilities between GPT-3 and o1, from Zhou et al (2025)

This is the third annual review of what’s going on in technical AI safety. You could stop reading here and instead explore the data on our website.

It’s shallow in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about two hours on each entry. Still, among other things, we processed every arXiv paper on alignment, all Alignment Forum posts, as well as a year’s worth of Twitter.

It is substantially a list of lists structuring around 800 links. The point is to produce stylised facts, forests out of trees; to help you look up what’s happening, or that thing you vaguely remember reading about; to help new researchers orient, know some of their options and the standing critiques; and to help you find who to talk to for actual information. We also track things which didn’t pan out.

Here, “AI safety” means technical work intended to prevent future cognitive systems from having large unintended negative effects on the world. So it’s capability restraint, instruction-following, value alignment, control, and risk awareness work.

We don’t cover security or resilience at all.

We ignore a lot of relevant work (including most of capability restraint): things like misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We focus on papers and blogposts (rather than say, gdoc samizdat or tweets or Githubs or Discords). We only use public information, so we are off by some additional unknown factor.

We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).

Even ignoring all of that as we do, it’s still too long to read. Here’s a spreadsheet version and the repo including the data. Methods down the bottom. Gavin’s takes outgrew this post and became its own thing.

If we missed something big or got something wrong, please comment, we will edit it in.

An Arb Research project. Work funded by OpenPhil Coefficient Giving.

We have tried to falsify this but it wasn’t easy.

Labs (giant companies)

Safety team % all
Not counting the AIs
<5% <3%<3%<2%<1%
Leadership’s stated timelines to full auto AI R&Dmid-20272028 to 2035March 2028N/​AASI by 2030
Leadership stated P(AI doom)25%“Non-zero” and >5%~2%~0%20%
Legal obligations

EU CoP, SB53

EU CoP, SB53EU CoP, SB53SB53EU CoP (Safety), SB53
Average Safety Score (ZSP, SaferAI, FLI)51%27%33%17%17%

OpenAI

Google Deepmind

Anthropic

xAI

Meta

China

The Chinese companies don’t attempt to be safe, often not even in the prosaic safeguards sense. They drop the weights immediately after post-training finishes. They’re mostly open weights and closed data. As of writing the companies are often severely compute-constrained. There are some informal reasons to doubt their capabilities. The (academic) Chinese AI safety scene is however also growing.

  • Alibaba’s Qwen3-etc-etc is nominally at the level of Gemini 2.5 Flash. Maybe the only Chinese model with a large Western userbase, including businesses, but since it’s self-hosted this doesn’t translate into profits for them yet. On one ad hoc test it was the only Chinese model not to collapse OOD, but the Qwen2.5 corpus was severely contaminated.

  • DeepSeek’s v3.2 is nominally around the same as Qwen. The CCP made them waste months trying Huawei chips.

  • Moonshot’s Kimi-K2-Thinking has some nominally frontier benchmark results and a pleasant style but does not seem frontier.

  • Baidu’s ERNIE 5 is again nominally very strong, a bit better than DeepSeek. This new one seems to not be open.

  • Z’s GLM-4.6 is around the same as Qwen. The product director was involved in the MIT Alignment group.

  • MiniMax’s M2 is nominally better than Qwen, around the same as Grok 4 Fast on the usual superficial benchmarks. It does fine on one very basic red-team test.

  • ByteDance does impressive research in a lagging paradigm, diffusion LMs.

  • There are others but they’re marginal for now.

Others

  • Amazon’s Nova Pro is around the level of Llama 3 90B, which in turn is around the level of the original GPT-4. So 2 years behind. But they have their own chip.

  • Microsoft are now mid-training on top of GPT-5. MAI-1-preview is around DeepSeek V3.0 level on Arena. They continue to focus on medical diagnosis. You can request access.

  • Mistral have a reasoning model, Magistral Medium, and released the weights of a little 24B version. It’s a bit worse than Deepseek R1, pass@1.

Black-box safety (understand and control current model behaviour)

Iterative alignment

Nudging base models by optimising their output. Worked on by the post-training teams at most labs, estimating the FTEs at >500 in some sense. Funded by most of the industry.

Iterative alignment at pretrain-time

Guide weights during pretraining.

Iterative alignment at post-train-time

Modify weights after pre-training.

Black-box make-AI-solve-it

Focus on using existing models to improve and align further models.

Inoculation prompting

Prompt mild misbehaviour in training, to prevent the failure mode where once AI misbehaves in a mild way, it will be more inclined towards all bad behaviour.

Inference-time: In-context learning

Investigate what runtime guidelines, rules, or examples provided to an LLM yield better behavior.

Inference-time: Steering

Manipulate an LLM’s internal representations/​token probabilities without touching weights.

Capability removal: unlearning

Developing methods to selectively remove specific information, capabilities, or behaviors from a trained model (e.g. without retraining it from scratch). A mixture of black-box and white-box approaches.

Control

If we assume early transformative AIs are misaligned and actively trying to subvert safety measures, can we still set up protocols to extract useful work from them while preventing sabotage, and watching with incriminating behaviour?

Safeguards (inference-time auxiliaries)

Layers of inference-time defenses, such as classifiers, monitors, and rapid-response protocols, to detect and block jailbreaks, prompt injections, and other harmful model behaviors.

Chain of thought monitoring

Supervise an AI’s natural-language (output) “reasoning” to detect misalignment, scheming, or deception, rather than studying the actual internal states.

Model psychology

This section consists of a bottom-up set of things people happen to be doing, rather than a normative taxonomy.

Model values /​ model preferences

Analyse and control emergent, coherent value systems in LLMs, which change as models scale, and can contain problematic values like preferences for AIs over humans.

Character training and persona steering

Map, shape, and control the personae of language models, such that new models embody desirable values (e.g., honesty, empathy) rather than undesirable ones (e.g., sycophancy, self-perpetuating behaviors).

Emergent misalignment

Fine-tuning LLMs on one narrow antisocial task can cause general misalignment including deception, shutdown resistance, harmful advice, and extremist sympathies, when those behaviors are never trained or rewarded directly. A new agenda which quickly led to a stream of exciting work.

Model specs and constitutions

Write detailed, natural language descriptions of values and rules for models to follow, then instill these values and rules into models via techniques like Constitutional AI or deliberative alignment.

Model psychopathology

Find interesting LLM phenomena like glitch tokens and the reversal curse; these are vital data for theory.

Better data

Data filtering

Builds safety into models from the start by removing harmful or toxic content (like dual-use info) from the pretraining data, rather than relying only on post-training alignment.

Hyperstition studies

Study, steer, and intervene on the following feedback loop: “we produce stories about how present and future AI systems behave” → “these stories become training data for the AI” → “these stories shape how AI systems in fact behave”.

Data poisoning defense

Develops methods to detect and prevent malicious or backdoor-inducing samples from being included in the training data.

Synthetic data for alignment

Uses AI-generated data (e.g., critiques, preferences, or self-labeled examples) to scale and improve alignment, especially for superhuman models.

Theory of change: We can overcome the bottleneck of human feedback and data by using models to generate vast amounts of high-quality, targeted data for safety, preference tuning, and capability elicitation.

See also: data quality for alignment, data filtering, scalable oversight, automated alignment research, weak-to-strong generalization.

Orthodox problems: goals misgeneralize out of distribution, superintelligence can fool human supervisors, value is fragile and hard to specify.

Target case: average_case Broad approach: engineering

Some names: Mianqiu Huang, Xiaoran Liu, Rylan Schaeffer, Nevan Wichers, Aram Ebtekar, Jiaxin Wen, Vishakh Padmakumar, Benjamin Newman

Estimated FTEs: 50-150

Critiques: Synthetic Data in AI: Challenges, Applications, and Ethical Implications. Sort of Demski.

Funded by: Anthropic, Google DeepMind, OpenAI, Meta AI, various academic groups.

Outputs in 2025:

Data quality for alignment

Improves the quality, signal-to-noise ratio, and reliability of human-generated preference and alignment data.

Goal robustness

Mild optimisation

Avoid Goodharting by getting AI to satisfice rather than maximise.

RL safety

Improves the robustness of reinforcement learning agents by addressing core problems in reward learning, goal misgeneralization, and specification gaming.

Assistance games, assistive agents

Formalize how AI assistants learn about human preferences given uncertainty and partial observability, and construct environments which better incentivize AIs to learn what we want them to learn.

Harm reduction for open weights

Develops methods, primarily based on pretraining data intervention, to create tamper-resistant safeguards that prevent open-weight models from being maliciously fine-tuned to remove safety features or exploit dangerous capabilities.

The “Neglected Approaches” Approach

Agenda-agnostic approaches to identifying good but overlooked empirical alignment ideas, working with theorists who could use engineers, and prototyping them.

White-box safety (i.e. Interpretability)

This section isn’t very conceptually clean. See the Open Problems paper or Deepmind for strong frames which are not useful for descriptive purposes.

Reverse engineering

Decompose a model into its functional, interacting components (circuits), formally describe what computation those components perform, and validate their causal effects to reverse-engineer the model’s internal algorithm.

Concept-based interpretability

Monitoring concepts

Identifies directions or subspaces in a model’s latent state that correspond to high-level concepts (like refusal, deception, or planning) and uses them to audit models for misalignment, monitor them at runtime, suppress eval awareness, debug why models are failing, etc.

Activation engineering

Programmatically modify internal model activations to steer outputs toward desired behaviors; a lightweight, interpretable supplement to fine-tuning.

Extracting latent knowledge

Identify and decoding the “true” beliefs or knowledge represented inside a model’s activations, even when the model’s output is deceptive or false.

Lie and deception detectors

Detect when a model is being deceptive or lying by building white- or black-box detectors. Some work below requires intent in their definition, while other work focuses only on whether the model states something it believes to be false, regardless of intent.

Model diffing

Understand what happens when a model is finetuned, what the “diff” between the finetuned and the original model consists in.

Sparse Coding

Decompose the polysemantic activations of the residual stream into a sparse linear combination of monosemantic “features” which correspond to interpretable concepts.

Causal Abstractions

Verify that a neural network implements a specific high-level causal model (like a logical algorithm) by finding a mapping between high-level variables and low-level neural representations.

Data attribution

Quantifies the influence of individual training data points on a model’s specific behavior or output, allowing researchers to trace model properties (like misalignment, bias, or factual errors) back to their source in the training set.

Pragmatic interpretability

Directly tackling concrete, safety-critical problems on the path to AGI by using lightweight interpretability tools (like steering and probing) and empirical feedback from proxy tasks, rather than pursuing complete mechanistic reverse-engineering.

Other interpretability

Interpretability that does not fall well into other categories.

Learning dynamics and developmental interpretability

Builds tools for detecting, locating, and interpreting key structural shifts, phase transitions, and emergent phenomena (like grokking or deception) that occur during a model’s training and in-context learning phases.

Representation structure and geometry

What do the representations look like? Does any simple structure underlie the beliefs of all well-trained models? Can we get the semantics from this geometry?

Human inductive biases

Discover connections deep learning AI systems have with human brains and human learning processes. Develop an ‘alignment moonshot’ based on a coherent theory of learning which applies to both humans and AI systems.

Safety by construction

Approaches which try to get assurances about system outputs while still being scalable.

Guaranteed-Safe AI

Have an AI system generate outputs (e.g. code, control systems, or RL policies) which it can quantitatively guarantee comply with a formal safety specification and world model.

Scientist AI

Develop powerful, nonagentic, uncertain world models that accelerate scientific progress while avoiding the risks of agent AIs

Brainlike-AGI Safety

Social and moral instincts are (partly) implemented in particular hardwired brain circuitry; let’s figure out what those circuits are and how they work; this will involve symbol grounding. “a yet-to-be-invented variation on actor-critic model-based reinforcement learning”

Make AI solve it

Weak-to-strong generalization

Use weaker models to supervise and provide a feedback signal to stronger models.

Supervising AIs improving AIs

Build formal and empirical frameworks where AIs supervise other (stronger) AI systems via structured interactions; construct monitoring tools which enable scalable tracking of behavioural drift, benchmarks for self-modification, and robustness guarantees

AI explanations of AIs

Make open AI tools to explain AIs, including AI agents. e.g. automatic feature descriptions for neuron activation patterns; an interface for steering these features; a behaviour elicitation agent that “searches” for a specified behaviour in frontier models.

Debate

In the limit, it’s easier to compellingly argue for true claims than for false claims; exploit this asymmetry to get trusted work out of untrusted debaters.

LLM introspection training

Train LLMs to the predict the outputs of high-quality whitebox methods, to induce general self-explanation skills that use its own ‘introspective’ access

  • Theory of change: Use the resulting LLMs as powerful dimensionality reduction, explaining internals in a distinct way than interpretability methods and CoT. Distilling self-explanation into the model should lead to the skill being scalable, since self-explanation skill advancement will feed off general-intelligence advancement.

  • General approach: cognitive · Target case: mixed

  • Orthodox alignment problems: Goals misgeneralize out of distribution, Superintelligence can fool human supervisors, Superintelligence can hack software supervisors

  • See also: Transluce, Anthropic

  • Some names: Belinda Z. Li, Zifan Carl Guo, Vincent Huang, Jacob Steinhardt, Jacob Andreas, Jack Lindsey

  • Funded by: Schmidt Sciences, Halcyon Futures, John Schulman, Wojciech Zaremba

  • Estimated FTEs: 2-20

  • Some outputs: Training Language Models to Explain Their Own Computations · Emergent Introspective Awareness

Theory

Develop a principled scientific understanding that will help us reliably understand and control current and future AI systems.

Agent foundations

Develop philosophical clarity and mathematical formalizations of building blocks that might be useful for plans to align strong superintelligence, such as agency, optimization strength, decision theory, abstractions, concepts, etc.

Tiling agents

An aligned agentic system modifying itself into an unaligned system would be bad and we can research ways that this could occur and infrastructure/​approaches that prevent it from happening.

High-Actuation Spaces

Mech interp and alignment assume a stable “computational substrate” (linear algebra on GPUs). If later AI uses different substrates (e.g. something neuromorphic), methods like probes and steering will not transfer. Therefore, better to try and infer goals via a “telic DAG” which abstracts over substrates, and so sidestep the issue of how to define intermediate representations. Category theory is intended to provide guarantees that this abstraction is valid.

Asymptotic guarantees

Prove that if a safety process has enough resources (human data quality, training time, neural network capacity), then in the limit some system specification will be guaranteed. Use complexity theory, game theory, learning theory and other areas to both improve asymptotic guarantees and develop ways of showing convergence.

Heuristic explanations

Formalize mechanistic explanations of neural network behavior, automate the discovery of these “heuristic explanations” and use them to predict when novel input will lead to extreme behavior (i.e. “Low Probability Estimation” and “Mechanistic Anomaly Detection”).

Corrigibility

Behavior alignment theory

Predict properties of future AGI (e.g. power-seeking) with formal models; formally state and prove hypotheses about the properties powerful systems will have and how we might try to change them.

Other corrigibility

Diagnose and communicate obstacles to achieving robustly corrigible behavior; suggest mechanisms, tests, and escalation channels for surfacing and mitigating incorrigible behaviors

Ontology Identification

Natural abstractions

Develop a theory of concepts that explains how they are learned, how they structure a particular system’s understanding, and how mutual translatability can be achieved between different collections of concepts.

The Learning-Theoretic Agenda

Create a mathematical theory of intelligent agents that encompasses both humans and the AIs we want, one that specifies what it means for two such agents to be aligned; translate between its ontology and ours; produce formal desiderata for a training setup that produces coherent AGIs similar to (our model of) an aligned agent

Multi-agent first

Aligning to context

Align AI directly to the role of participant, collaborator, or advisor for our best real human practices and institutions, instead of aligning AI to separately representable goals, rules, or utility functions.

Aligning to the social contract

Generate AIs’ operational values from ‘social contract’-style ideal civic deliberation formalisms and their consequent rulesets for civic actors

Theory for aligning multiple AIs

Use realistic game-theory variants (e.g. evolutionary game theory, computational game theory) or develop alternative game theories to describe/​predict the collective and individual behaviours of AI agents in multi-agent scenarios.

Tools for aligning multiple AIs

Develop tools and techniques for designing and testing multi-agent AI scenarios, for auditing real-world multi-agent AI dynamics, and for aligning AIs in multi-AI settings.

Aligned to who?

Technical protocols for taking seriously the plurality of human values, cultures, and communities when aligning AI to “humanity”

Aligning what?

Develop alternatives to agent-level models of alignment, by treating human-AI interactions, AI-assisted institutions, AI economic or cultural systems, drives within one AI, and other causal/​constitutive processes as subject to alignment

Evals

AGI metrics

Evals with the explicit aim of measuring progress towards full human-level generality.

Capability evals

Make tools that can actually check whether a model has a certain capability or propensity. We default to low-n sampling of a vast latent space but aim to do better.

Autonomy evals

Measure an AI’s ability to act autonomously to complete long-horizon, complex tasks.

WMD evals (Weapons of Mass Destruction)

Evaluate whether AI models possess dangerous knowledge or capabilities related to biological and chemical weapons, such as biosecurity or chemical synthesis.

Situational awareness and self-awareness evals

Evaluate if models understand their own internal states and behaviors, their environment, and whether they are in a test or real-world deployment.

Steganography evals

evaluate whether models can hide secret information or encoded reasoning in their outputs, such as in chain-of-thought scratchpads, to evade monitoring.

AI deception evals

research demonstrating that AI models, particularly agentic ones, can learn and execute deceptive behaviors such as alignment faking, manipulation, and sandbagging.

AI scheming evals

Evaluate frontier models for scheming, a sophisticated, strategic form of AI deception where a model covertly pursues a misaligned, long-term objective while deliberately faking alignment and compliance to evade detection by human supervisors and safety mechanisms.

Sandbagging evals

Evaluate whether AI models deliberately hide their true capabilities or underperform, especially when they detect they are in an evaluation context.

Self-replication evals

evaluate whether AI agents can autonomously replicate themselves by obtaining their own weights, securing compute resources, and creating copies of themselves.

Various Redteams

attack current models and see what they do /​ deliberately induce bad things on current frontier models to test out our theories /​ methods.

Other evals

A collection of miscellaneous evaluations for specific alignment properties, such as honesty, shutdown resistance and sycophancy.


Orgs without public outputs this year

We are not aware of public technical AI safety output with these agendas and organizations, though they are active otherwise.

Graveyard (known to be inactive)

Changes this time

  • A few major changes to the taxonomy: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”. (We did try out an automated clustering but it wasn’t very good.)

  • The agendas are in general less charisma-based and more about solution type.

  • We did a systematic Arxiv scrape on the word “alignment” (and filtered out the sequence indexing papers that fell into this pipeline). “Steerability” is one competing term used by academics.

  • We scraped >3000 links (arXiv, LessWrong, several alignment publication lists, blogs and conferences), conservatively filtering and pre-categorizing them with a LLM pipeline. All curated later, many more added manually.

  • This review has ~800 links compared to ~300 in 2024 and ~200 in 2023. We looked harder and the field has grown.

  • We don’t collate public funding figures.

  • New sections: “Labs”, “Multi-agent First”, “Better data”, “Model specs”, “character training” and “representation geometry”. ”Evals” is so massive it gets a top-level section.

Methods

Structure

We have again settled for a tree data structure for this post – but people and work can appear in multiple nodes so it’s not a dumb partition. Richer representation structures may be in the works.

The level of analysis for each node in the tree is the “research agenda”, an abstraction spanning multiple papers and organisations in a messy many-to-many relation. What makes something an agenda? Similar methods, similar aims, or something sociological about leaders and collaborators. Agendas vary greatly in their degree of coherent agency, from the very coherent Anthropic Circuits work to the enormous, leaderless and unselfconscious “iterative alignment”.)

Scope

30th November 2024 – 30th November 2025 (with a few exceptions).

We’re focussing on “technical AGI safety”. We thus ignore a lot of work relevant to the overall risk: misuse, policy, strategy, OSINT, resilience and indirect risk, AI rights, general capabilities evals, and things closer to “technical policy” and like products (like standards, legislation, SL4 datacentres, and automated cybersecurity). We also mostly focus on papers and blogposts (rather than say, underground gdoc samizdat or Discords).

We only use public information, so we are off by some additional unknown factor.

We try to include things which are early-stage and illegible – but in general we fail and mostly capture legible work on legible problems (i.e. things you can write a paper on already).

Of the 2000+ links to papers, organizations and posts in the raw scrape, about 700 made it in.

Paper sources

  • All arXiv papers with “AI alignment”, “AI safety”, or “steerability” in the abstract or title; all papers of ~120 AI safety researchers

  • All Alignment Forum posts and all LW posts under “AI

  • Gasteiger’s links, Paleka’s links, Lenz’s links, Zvi’s links

  • Ad hoc Twitter for a year, several conference pages and workshops

  • AI scrapes miss lots of things. We did a proper pass with a software scraper of over 3000 links collected from the above and LLM crawl of some of the pages, and then an LLM pass to pre-filter the links for relevance and pre-assign the links to agendas and areas, but this also had systematic omissions. We ended up doing a full manual pass over a conservatively LLM-pre-filtered links and re-classifying the links and papers. The code and data can be found here. We are not aware of any proper studies of “LLM laziness” but it’s known amongst power users of copilots.

  • For finding critiques we used LW backlinks, Google Scholar cited-by, manual search, collected links, and Ahrefs. Technical critiques are somewhat rare, though, and even then our coverage here is likely severely lacking. We generally do not include social network critiques (mostly due to scope).

  • Despite this effort we will not have included all relevant papers and names. The omission of a paper or a researcher is not strong negative evidence about their relevance.

Processing

  • Collecting links throughout the year and at project start. Skimming papers, staring at long lists.

  • We drafted a taxonomy of research agendas. Based on last year’s list, our expertise and the initial paper collection, though we changed the structure to accommodate shifts in the domain: the top-level split is now “black-box” vs “white-box” instead of “control” vs “understanding”.

  • At around 500 links and growing, we decided to implement simple pipelines for crawling, scraping and LLM metadata extraction, pre-filtering and pre-classification into agendas, as well as other tasks – including final formatting later. The use of LLMs was limited to one simple task at a time, and the results were closely watched and reviewed. Code and data here.

  • We tried getting the AI to update our taxonomy bottom-up to fit the body of papers, but the results weren’t very good. Though we are looking at some advanced options (specialized embedding or feature extraction and clustering or t-SNE etc.)

  • Work on the ~70 agendas was distributed among the team. We ended up making many local changes to the taxonomy, esp. splitting up and merging agendas. The taxonomy is specific to this year, and will need to be adapted in coming years.

  • We moved any agendas without public outputs this year to the bottom, and the inactive ones to the Graveyard. For most of them, we checked with people from the agendas for outputs or work we may have missed.

  • What started as a brief summary editorial grew into its own thing (6000w).

  • We asked 10 friends in AI safety to review the ~80 page draft. After editing, last-minute additions and formatting, we asked 50 technical AI safety people for a quick review of their domains. Coverage was not perfect; all mistakes are our own.

  • The field is growing at around 20% a year. There will come a time that this list isn’t sensible to do manually even with the help of LLMs (at this granularity anyway). We may come up with better alternatives than lists and posts by then, though.

Other classifications

We added our best guess about which of Davidad’s alignment problems the agenda would make an impact on if it succeeded, as well as its research approach and implied optimism in Richard Ngo’s 3x3.

  • Which deep orthodox subproblems could it ideally solve? (via Davidad)

  • The target case: what part of the distribution over alignment difficulty do they aim to help with? (via Ngo)

    • “optimistic-case”: if CoT is faithful, pretraining as value loading, no stable mesa-optimizers, the relevant scary capabilities are harder than alignment, zero-shot deception is hard, goals are myopic, etc

    • pessimistic-case: if we’re in-between the above and the below

    • worst-case: if power-seeking is rife, zero-shot deceptive alignment, steganography, gradient hacking, weird machines, weird coordination, deep deceptiveness

  • The broad approach: roughly what kind of work is it doing, primarily? (via Ngo)

    • engineering: iterating over outputs

    • behavioural: understanding the input-output relationship

    • cognitive: understanding the algorithms

    • maths/​philosophy: providing concepts for the other approaches

Acknowledgments

These people generously helped with the review by providing expert feedback, literature sources, advice, or otherwise. Any remaining mistakes remain ours.

Thanks to Neel Nanda, Owain Evans, Stephen Casper, Alex Turner, Caspar Oesterheld, Steve Byrnes, Adam Shai, Séb Krier, Vanessa Kosoy, Nora Ammann, David Lindner, John Wentworth, Vika Krakovna, Filip Sondej, JS Denain, Jan Kulveit, Mary Phuong, Linda Linsefors, Yuxi Liu, Ben Todd, Ege Erdil, Tan Zhi-Xuan, Jess Riedel, Mateusz Bagiński, Roland Pihlakas, Walter Laurito, Vojta Kovařík, David Hyland, plex, Shoshannah Tekofsky, Fin Moorhouse, Misha Yagudin, Nandi Schoots, and Nuno Sempere for helpful comments.


Appendix: Other reviews and taxonomies


Epigram

> No punting — we can’t keep building nanny products. Our products are overrun with filters and punts of various kinds. We need capable products and [to] trust our users.

Sergey Brin

> Over the decade I’ve spent working on AI safety, I’ve felt an overall trend of divergence; research partnerships starting out with a sense of a common project, then slowly drifting apart over time… eventually, two researchers are going to have some deep disagreement in matters of taste, which sends them down different paths.

> Until the spring of this year, that is… something seemed to shift, subtly at first. After I gave a talk—roughly the same talk I had been giving for the past year—I had an excited discussion about it with Scott Garrabrant. Looking back, it wasn’t so different from previous chats we had had, but the impact was different; it felt more concrete, more actionable, something that really touched my research rather than remaining hypothetical. In the subsequent weeks, discussions with my usual circle of colleagues took on a different character—somehow it seemed that, after all our diverse explorations, we had arrived at a shared space.

– Abram Demski

Brought to you by the Arb Corporation

No comments.