Thomas Kwa comments on The Field of AI Alignment: A Postmortem, and What To Do About It

Thomas Kwa 29 Dec 2024 12:31 UTC
33 points
8
TLDR:
- What OP calls “streetlighting”, I call an efficient way to prioritize problems by tractability. This is only a problem insofar as we cannot also prioritize by relevance.
- I think problematic streetlighting is largely due to incentives, not because people are not smart / technically skilled enough. Therefore solutions should fix incentives rather than just recruiting smarter people.
First, let me establish that theorists very often disagree on what the hard parts of the alignment problem are, precisely because not enough theoretical and empirical progress has been made to generate agreement on them. All the lists of “core hard problems” OP lists are different, and Paul Christiano famously wrote a 27-point list of disagreements on Eliezer’s. This means that most people’s views of the problem are wrong, and should they stick to their guns they might perseverate on either an irrelevant problem or a doomed approach.
I’d guess that historically perseveration has been an equally large problem as streetlighting among alignment researchers. Think of all the top alignment researchers in 2018 and all the agendas that haven’t seen much progress. Decision theory should probably not take ~30% of researcher time like it did back in the day.^[1]
In fact, failure is especially likely for people who are trying to tackle “core hard problems” head-on, and not due to lack of intelligence. Many “core hard problems” are observations of lack of structure, or observations of what might happen in extreme generality e.g. Eliezer’s
- “We’ve got no idea what’s actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.”
- (summarized) “Outer optimization doesn’t in general produce aligned inner goals”, or
- “Human beings cannot inspect an AGI’s output to determine whether the consequences will be good.”
which I will note are completely different type signature from subproblems that people can actually tractably research. Sometimes we fail to define a tractable line of attack. Other times these ill-defined problems get turned into entire subfields of alignment, like interpretability, which are filled with dozens of blind alleys of irrelevance that extremely smart people frequently fall victim to. For comparison, some examples of problems ML and math researchers can actually work on:
- Unlearning: Develop a method for post-hoc editing a model, to make it as if it were never trained on certain data points
- Causal inference: Develop methods for estimating the causation graph between events given various observational data.
- Fermat’s last theorem: Prove whether there are integer solutions to aⁿ + bⁿ = cⁿ.
The unit of progress is therefore not “core hard problems” directly, but methods that solve well-defined problems and will also be useful in scalable alignment plans. We must try to understand the problem and update our research directions as we go. Everyone has to pivot because the exact path you expected to solve a problem basically never works. But we have to update on tractability as well as relevance! For example, Redwood (IMO correctly) pivoted away from interp because other plans seemed viable (relevance) and it seemed too hard to explain enough AI cognition through interpretability to solve alignment (tractability).^[2]
OP seems to think flinching away from hard problems is usually cope / not being smart enough. But OP’s list of types of cope are completely valid as either fundamental problem-solving strategies or prioritization. (4 is an incentives problem, which I’ll come back to later)
1. Carol explicitly introduces some assumption simplifying the problem, and claims that without the assumption the problem is impossible. [...]
2. Carol explicitly says that she’s not trying to solve the full problem, but hopefully the easier version will make useful marginal progress.
3. Carol explicitly says that her work on easier problems is only intended to help with near-term AI, and hopefully those AIs will be able to solve the harder problems.
1 and 2 are fundamental problem-solving techniques. 1 is a crucial part of Polya’s step 1: understand the problem, and 2 is a core technique for actually solving the problem. I don’t like relying on 3 as stated, but there are many valid reasons for focusing on near-term AI^[3].
Now I do think there is lots of distortion of research in unhelpful directions related to (1, 2, 3), often due to publication incentives.^[4] But understanding the problem and solving easier versions of it has a great track record in complicated engineering; you just have to solve the hard version eventually (assuming we don’t get lucky with alignment being easy, which is very possible but we shouldn’t plan for).
So to summarize my thoughts:
- Streetlighting is real, but much of what OP calls streetlighting is a justified focus on tractability.
- We can only solve “core hard problems” by creating tractable well-defined problems
- OP’s suggested solution—higher intelligence and technical knowledge—doesn’t seem to fit the problem.
  - There are dozens of ML PhDs, physics PhDs, and comparably smart people working on alignment. As Ryan Kidd pointed out, the stereotypical MATS student is now a physics PhD or technical professional. And presumably according to OP, most people are still streetlighting.
  - Technically skilled people seem equally susceptible to incentives-driven streetlighting, as well as perseveration.
  - If the incentives continue to be wrong, people who defy them might be punished anyway.
- Instead, we should fix incentives, maybe like this:
  - Invest in making “core hard problems” easier to study
  - Reward people who have alignment plans that at least try to scale to superintelligence
  - Reward people who think about whether others’ work will be helpful with superintelligence
  - Develop things like alignment workshops, so people have a venue to publish genuine progress that is not legible to conferences
  - Pay researchers with illegible results more to compensate for their lack of publication / social rewards
1. ^
  MIRI’s focus on decision theory is itself somewhat due to streetlighting. As I understand, 2012ish MIRI leadership’s worldview was that several problems had to be solved for AI to go well, but the one they could best hire researchers for was decision theory, so they did lots of that. Also someone please correct me on the 30% of researcher time claim if I’m wrong.
2. ^
  OP’s research is not immune to this. My sense is that selection theorems would have worked out if there had been more and better results.
3. ^
  e.g. if deploying on near-term AI will yield empirical feedback needed to stay on track, significant risk comes from near-term AI, near-term AI will be used in scalable oversight schemes, …
4. ^
  As I see it, there is lots of distortion by the publishing process now that lots of work is being published. Alignment is complex enough that progress in understanding the problem is a large enough quantity of work to be a paper. But in a paper, it’s very common to exaggerate one’s work, especially the validity of the assumptions^[5], and people need to see through this for the field to function smoothly.
5. ^
  I am probably guilty of this myself, though I try to honestly communicate my feelings about the assumptions in a long limitations section