ryan_greenblatt comments on Rant on Problem Factorization for Alignment

ryan_greenblatt 22 Dec 2023 5:11 UTC
LW: 11 AF: 7
7
AF

About a year ago, Cotra proposed a different class of problem factorization experiments: “sandwiching”. We start with some ML model which has lots of knowledge from many different fields, like GPT-n. We also have a human who has a domain-specific problem to solve (like e.g. a coding problem, or a translation to another language) but lacks the relevant domain knowledge (e.g. coding skills, or language fluency). The problem, roughly speaking, is to get the ML model and the human to work as a team, and produce an outcome at-least-as-good as a human expert in the domain. In other words, we want to factorize the “expert knowledge” and the “having a use-case” parts of the problem.

I don’t think “sandwiching” is best understood as a problem factorization experiment, though this is indeed one possible approach to improve performance in the sandwiching setting.

I prefer to define sandwiching as:

Testbeds where we hold out ground truth (or other validation) for a certain domain and otherwise go no-holds-barred on trying to get maximally good results on that task (perhaps with some restriction in available resources, e.g., only access to non-expert humans). Then we can assess the resulting goodness based on our held-out validation.

I think of sandwiching as the obvious way to assess a certain class of safety/alignment techniques rather than as a particularly opinionated approach.

I think the discussion here or possibly here presents a better perspective on sandwiching.