mesaoptimizer comments on mesaoptimizer’s Shortform

mesaoptimizer 8 Apr 2023 16:30 UTC
1 point
0
Alignment agendas can generally be classified into two categories: blueprint-driven and component-driven. Understanding this distinction is probably valuable for evaluating and comprehending different agendas.

Blueprint-driven alignment agendas are approaches that start with a coherent blueprint for solving the alignment problem. They prioritize the overall structure and goals of the solution before searching for individual components or building blocks that fit within that blueprint. Examples of blueprint-driven agendas include MIRI’s agent foundations, Vanessa Kosoy and Diffractor’s Infrabayesianism, and carado’s formal alignment agenda. Research aimed at developing a more accurate blueprint, such as Nate Soares’ 2022-now posts, Adam Shimi’s epistemology-focused output, and John Wentworth’s deconfusion-style output, also fall into this category.

Component-driven alignment agendas, on the other hand, begin with available components and seek to develop new pieces that work well with existing ones. They focus on making incremental progress by developing new components that can be feasibly implemented and integrated with existing AI systems or techniques to address the alignment problem. OpenAI’s strategy, Deepmind’s strategy, Conjecture’s LLM-focused outputs, and Anthropic’s strategy are examples of this approach. Agendas that serve as temporary solutions by providing useful components that integrate with existing ones, such as ARC’s power-seeking evals, also fall under the component-driven category. Additionally, the Cyborgism agenda and the Accelerating Alignment agenda can be considered component-driven.

The blueprint-driven and component-driven categorization seems to me to be more informative than dividing agendas into conceptual and empirical categories. This is because all viable alignment agendas require a combination of conceptual and empirical research. Categorizing agendas based on the superficial pattern of their current research phase can be misleading. For instance, shard theory may initially appear to be a blueprint-driven conceptual agenda, like embedded agency. However, it is actually a component-driven agenda, as it involves developing pieces that fit with existing components.