Adrià Garriga-alonso comments on Alignment remains a hard, unsolved problem

Adrià Garriga-alonso 27 Nov 2025 18:33 UTC
LW: 1 AF: 1
−2
AF
It’s not the AIs that are supposed to get a boost at supervising, it’s the humans. The skills won’t stop being complementary but the AI will be better at it.

The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI. And I think the results are very encouraging—though of course it could stop working.
- Lukas Finnveden 27 Nov 2025 21:19 UTC
  LW: 8 AF: 2
  1
  AF Parent
  ”The skills won’t stop being complementary” — in what sense will they be complementary when the AIs are better at everything?
  ”The thing to test is whether weak humans + strong aligned-ish AI can supervise stronger AI.”
  
  As mentioned, I don’t think it’s very impressive that the human+AI score ends up stronger than the original AI. This is a consequence of the humans being stronger than AIs on certain subskills. This won’t generalize to the scenario with broadly superhuman AI.
  I do think there’s a stronger case that I should be impressed that the human+AI score ends up stronger than the humans. This means that the humans managed to get the AIs to contribute skills/knowledge that the humans didn’t have themselves!
  Now, the first thing to check to test that story: Were the AI assistants trained with human-expert labels? (Including: weak humans with access to google.) If so: no surprise that the models end up being aligned to produce such knowledge! The weak humans wouldn’t have been able to do that alone.
  I couldn’t quickly see the paper saying that they didn’t use human-expert labels. But what if the AI assistants were trained without any labels that couldn’t have been produced by the weak humans? In that case, I would speculate that the key ingredient is that the pre-training data features ”correctly-answering expert human” personas, which are possible to elicit from the models with the right prompt/fine-tuning. But that also won’t easily generalize to the superhuman regime, because there aren’t any correctly-answering superhuman personas in the pre-training data.
  I think the way that IDA is ultimately supposed to operate, in the superhuman regime, is by having the overseer AIs use more compute (and other resources) than the supervised AI. But I don’t think this paper produces a ton of evidence about the feasibility of that.
  (I do think that persona-manipulation and more broadly “generalization science” is still interesting. But I wouldn’t say it’s doing a lot to tackle outer alignment operationalized as “the problem of overseeing systems that are smarter than you are”.)
  - gabrielrecc 29 Nov 2025 13:15 UTC
    1 point
    0
    Parent
    Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
    
    I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
    
    On the QuALITY findings, the original paper noted that
    QuALITY questions are meant to be answerable by English-fluent college-educated adults, but they require readers to thoroughly understand a short story of about 5,000 words, which would ordinarily take 15–30 minutes to read. To create a challenging task that requires model assistance, we ask human participants to answer QuALITY questions under a 5-minute time limit (roughly paralleling Pang et al., 2022; Parrish et al., 2022b,a). This prevents them from reading the story in full and forces them to rely on the model to gather relevant information
    
    so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.