Putting aside issues one might have with (1) framing and (2) construct validity, which this post by RobertM already discusses, I think thatthe conclusion the paper reaches is underdetermined by the experiments it conducts.[2]
The basic idea of their main experiment (Figure 2) is: run some frontier LLMs on some benchmarks (e.g. MMLU/GPQA). Bin the questions by how many reasoning tokens were used on average. Then look at incoherence as a function of reasoning tokens used. They show that more reasoning tokens used is associated with more incoherence.
A natural confound to worry about is task difficulty. That is, maybe the LLM is using more reasoning tokens on harder problems. To their credit they point this out and do an experiment, however I’m not super convinced by the conclusion they draw from it.
Here’s the experiment (Figure 3). For each fixed choice of question, they bucket attempts into below-median and above-median reasoning. They show that the above-median attempts have higher incoherence than the below-median ones, on average. They can only do 2 buckets because it’s on a per-question basis (they collect 30 samples per question). And the reason they do a per-question basis is to control for difficulty. They show that accuracy for both buckets is basically the same. (This is good, otherwise it could be, e.g., that the low reasoning bucket is when it gets it right more, and the high reasoning bucket is when it gets it wrong more.)
However, here’s another explanation that is equally consistent with the data. It seems likely that for many questions, fewer answer choices can be justified by a short (plausible) reasoning chain than a long (plausible) reasoning chain. This would just be an inherent property of the questions, and have nothing to do with the models answering the questions. So this below-median vs above-median experiment doesn’t really convince me.
One response you might have to this critique is that their main experiment (Figure 2) is just making a claim about correlation, not causation. Indeed, I am not disputing the correlation part. However, if the outcomes of your experiments are also consistent with a plausible hypothesis that is completely independent of the models being tested, then I simply do not believe that you can draw conclusions about the models being tested from those experiments.
In conclusion, even if you put aside high-level objections related to framing (i.e., “does this paper fit in a broader AI safety case”) and construct validity (i.e., “does their definition of coherence capture what we actually think of as coherence”), I think that the conclusion the paper reaches is underdetermined by the experiments it conducts.
Note: in the explanation that follows, I will not comprehensively discuss all experiments in the paper. Instead, I will only discuss experiments relevant to the particular critique I am making.
Another short critique of the Anthropic “Hot Mess” paper
Link post
This post is about the Anthropic paper “The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?“.[1]
Putting aside issues one might have with (1) framing and (2) construct validity, which this post by RobertM already discusses, I think that the conclusion the paper reaches is underdetermined by the experiments it conducts.[2]
The basic idea of their main experiment (Figure 2) is: run some frontier LLMs on some benchmarks (e.g. MMLU/GPQA). Bin the questions by how many reasoning tokens were used on average. Then look at incoherence as a function of reasoning tokens used. They show that more reasoning tokens used is associated with more incoherence.
A natural confound to worry about is task difficulty. That is, maybe the LLM is using more reasoning tokens on harder problems. To their credit they point this out and do an experiment, however I’m not super convinced by the conclusion they draw from it.
Here’s the experiment (Figure 3). For each fixed choice of question, they bucket attempts into below-median and above-median reasoning. They show that the above-median attempts have higher incoherence than the below-median ones, on average. They can only do 2 buckets because it’s on a per-question basis (they collect 30 samples per question). And the reason they do a per-question basis is to control for difficulty. They show that accuracy for both buckets is basically the same. (This is good, otherwise it could be, e.g., that the low reasoning bucket is when it gets it right more, and the high reasoning bucket is when it gets it wrong more.)
However, here’s another explanation that is equally consistent with the data. It seems likely that for many questions, fewer answer choices can be justified by a short (plausible) reasoning chain than a long (plausible) reasoning chain. This would just be an inherent property of the questions, and have nothing to do with the models answering the questions. So this below-median vs above-median experiment doesn’t really convince me.
One response you might have to this critique is that their main experiment (Figure 2) is just making a claim about correlation, not causation. Indeed, I am not disputing the correlation part. However, if the outcomes of your experiments are also consistent with a plausible hypothesis that is completely independent of the models being tested, then I simply do not believe that you can draw conclusions about the models being tested from those experiments.
In conclusion, even if you put aside high-level objections related to framing (i.e., “does this paper fit in a broader AI safety case”) and construct validity (i.e., “does their definition of coherence capture what we actually think of as coherence”), I think that the conclusion the paper reaches is underdetermined by the experiments it conducts.
Or more specifically, research done as part of the Anthropic Fellows program.
Note: in the explanation that follows, I will not comprehensively discuss all experiments in the paper. Instead, I will only discuss experiments relevant to the particular critique I am making.