A 1-layer LSTM gets inputs of the form [operand 1][operator][operand 2], e.g.1+2 or 3*5
It is trained (I think with a regression loss? but it’s not clear[1]) to predict the numerical result of the binary operation
The paper proposes an auxiliary loss that is supposed to improve “compositionality.”
As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
But, in the actual code, what is actually is instead the average squared difference between successive inputembeddings
The paper finds (unsurprisingly) that this extra loss doesn’t help on the main task[2], while making various other errors and infelicities along the way
e.g. there’s train-test leakage, and (hilariously) it doesn’t cite the right source for the LSTM architecture[3]
The theoretical justification presented for the “compositional loss” is very brief and unconvincing. But if I read into it a bit, I can see why it might make sense for the loss described in the paper (on hidden states).
This could regularize an LSTM to be produce something closer to a simple sum or average of the input embeddings, which is “compositional” in the sense that inputs for different timesteps might end up in different subspaces. It’s still not very clear why you would want this property (it seems like at this point you’re just saying you don’t really want an LSTM, as this is trying to regularize away some of the LSTM’s flexibility), nor why an LSTM was chosen in the first place (in 2025!), but I can at least see where the idea came from.
However, the loss actually used in the code makes no sense at all. The embeddings can’t see one another, and the inputs are sampled independently from one another in data generation, so the code’s auxiliary loss is effectively just trying to make the input embeddings for all vocab tokens closer to one another in L2 norm. This has nothing to do with compositionality, and anyway, I suspect that the rest of the network can always “compensate for it” in principle by scaling up the input weights of the LSTM layer.[4]
If it had actually used the loss on hidden states as described, this would still be a bad paper: it reports a negative result and under-motivates the idea so that it’s not clear why the negative result might be noteworthy. (Plus: LSTM, weird arithmetic regression toy task, etc.)
Once you take into account the nonsensical loss that was actually used, it’s just… nothing. The idea makes no sense, was not motivated at all, was inaccurately described in the paper, and does not work in practice.
To Sakana’s credit, they did document all of these problems in their notes on the paper – although they were less critical than I am. In principle they could have hidden away the code issue rather than mentioning it, and the paper would have seemed less obviously bad… I guess this is a really low bar, but still, it’s something.
The Sakana code review shows that it’s evaluated with a regression loss, but it’s not clear out of context what criterion (the training loss function) is AFAICT.
(Edit after looking over the code again: the model returns a number, and the data generation code also returns the targets as numbers. So the loss function is comparing a number to another one. Unless it’s doing some cursed re-tokenization thing internally, it’s regression.)
Note that it never directly tests for “compositionality,” just for test set performance on the main task.
Although in a few places it conflates its so-called “compositional loss” with compositionality itself, e.g. claims that the regularization “effectively enforces compositionality” when in fact the evidence just shows that it decreases the auxiliary loss, which of course it does – that’s what happens when you minimize something, it goes down.
Hochreiter & Schmidhuber 1997 has over 100k citations, it’s one of the most familiarly cited references in all of ML, you’d think an LLM would have memorized that at least!
Although in practice this is limited by the learning rate and the duration of training, which may explain why the paper got worse main-task performance with stronger regularization even though the regularization is “conceptually” a no-op.
The first thing that comes to mind is to beg the question of what proportion of human-generated papers are publishing-worthier (since a lot of them are slop), but let’s not forget that publication matters little for catastrophic risk, it’s actually getting results that would be important. So I recommend not updating at all on AI risk based on Sakana’s results (or updating negatively if you expected that R&D automation would come faster, or that this might slow down human augmentation).
For our choice of workshop, we believe the ICBINB workshop is a highly relevant choice for the purpose of our experiment. As we wrote in the main text, we selected this workshop because of its broader scope, challenging researchers (and our AI Scientist) to tackle diverse research topics that address practical limitations of deep learning, unlike most workshops with a narrow focus on one topic.
This workshop focuses particularly on understanding limitations of deep learning methods applied to real world problems, and encourages participants to study negative experimental outcomes. Some may criticize our choice of a workshop that encourages discussion of “negative results” (implying that papers discussing negative results are failed scientific discoveries), but we disagree, and we believe this is an important topic.
and while it is true that “negative results” are important to report, “we report a negative result because our AI agent put forward a reasonable and interesting hypothesis, competently tested the hypothesis, and found that the hypothesis was false” looks a lot like “our AI agent put forward a reasonable and interesting hypothesis, flailed around trying to implement it, had major implementation problems, and wrote a plausible-sounding paper describing its failure as a fact about the world rather than a fact about its skill level”.
The paper has a few places with giant red flags where it seems that the reviewer assumes that there were solid results that the author of the paper was simply not reporting skillfully, for example in section B2
,
I favor an alternative hypothesis: the Sakana agent determines where a graph belongs, what would be on the X and Y axis of that graph, what it expects that the graph would look like, and how to generate that graph. It then generates the graph and inserts the caption the graph would show if its hypothesis was correct. The agent has no particular ability to notice that its description doesn’t work with the graph.
The first automatically produced, (human) peer-reviewed, (ICLR) workshop-accepted[/able] AI research paper: https://sakana.ai/ai-scientist-first-publication/
This is a very low-quality paper.
Basically, the paper does the following:
A 1-layer LSTM gets inputs of the form
[operand 1][operator][operand 2]
, e.g.1+2
or3*5
It is trained (I think with a regression loss? but it’s not clear[1]) to predict the numerical result of the binary operation
The paper proposes an auxiliary loss that is supposed to improve “compositionality.”
As described in the paper, this loss is is the average squared difference between successive LSTM hidden states
But, in the actual code, what is actually is instead the average squared difference between successive input embeddings
The paper finds (unsurprisingly) that this extra loss doesn’t help on the main task[2], while making various other errors and infelicities along the way
e.g. there’s train-test leakage, and (hilariously) it doesn’t cite the right source for the LSTM architecture[3]
The theoretical justification presented for the “compositional loss” is very brief and unconvincing. But if I read into it a bit, I can see why it might make sense for the loss described in the paper (on hidden states).
This could regularize an LSTM to be produce something closer to a simple sum or average of the input embeddings, which is “compositional” in the sense that inputs for different timesteps might end up in different subspaces. It’s still not very clear why you would want this property (it seems like at this point you’re just saying you don’t really want an LSTM, as this is trying to regularize away some of the LSTM’s flexibility), nor why an LSTM was chosen in the first place (in 2025!), but I can at least see where the idea came from.
However, the loss actually used in the code makes no sense at all. The embeddings can’t see one another, and the inputs are sampled independently from one another in data generation, so the code’s auxiliary loss is effectively just trying to make the input embeddings for all vocab tokens closer to one another in L2 norm. This has nothing to do with compositionality, and anyway, I suspect that the rest of the network can always “compensate for it” in principle by scaling up the input weights of the LSTM layer.[4]
If it had actually used the loss on hidden states as described, this would still be a bad paper: it reports a negative result and under-motivates the idea so that it’s not clear why the negative result might be noteworthy. (Plus: LSTM, weird arithmetic regression toy task, etc.)
Once you take into account the nonsensical loss that was actually used, it’s just… nothing. The idea makes no sense, was not motivated at all, was inaccurately described in the paper, and does not work in practice.
To Sakana’s credit, they did document all of these problems in their notes on the paper – although they were less critical than I am. In principle they could have hidden away the code issue rather than mentioning it, and the paper would have seemed less obviously bad… I guess this is a really low bar, but still, it’s something.
The Sakana code review shows that it’s evaluated with a regression loss, but it’s not clear out of context what
criterion
(the training loss function) is AFAICT.(Edit after looking over the code again: the model returns a number, and the data generation code also returns the targets as numbers. So the loss function is comparing a number to another one. Unless it’s doing some cursed re-tokenization thing internally, it’s regression.)
Note that it never directly tests for “compositionality,” just for test set performance on the main task.
Although in a few places it conflates its so-called “compositional loss” with compositionality itself, e.g. claims that the regularization “effectively enforces compositionality” when in fact the evidence just shows that it decreases the auxiliary loss, which of course it does – that’s what happens when you minimize something, it goes down.
Hochreiter & Schmidhuber 1997 has over 100k citations, it’s one of the most familiarly cited references in all of ML, you’d think an LLM would have memorized that at least!
Although in practice this is limited by the learning rate and the duration of training, which may explain why the paper got worse main-task performance with stronger regularization even though the regularization is “conceptually” a no-op.
The first thing that comes to mind is to beg the question of what proportion of human-generated papers are publishing-worthier (since a lot of them are slop), but let’s not forget that publication matters little for catastrophic risk, it’s actually getting results that would be important.
So I recommend not updating at all on AI risk based on Sakana’s results (or updating negatively if you expected that R&D automation would come faster, or that this might slow down human augmentation).
I’m skeptical.
Did the Sakana team publish the code that their scientist agent used to write the compositional regularization paper? The post says
and while it is true that “negative results” are important to report, “we report a negative result because our AI agent put forward a reasonable and interesting hypothesis, competently tested the hypothesis, and found that the hypothesis was false” looks a lot like “our AI agent put forward a reasonable and interesting hypothesis, flailed around trying to implement it, had major implementation problems, and wrote a plausible-sounding paper describing its failure as a fact about the world rather than a fact about its skill level”.
The paper has a few places with giant red flags where it seems that the reviewer assumes that there were solid results that the author of the paper was simply not reporting skillfully, for example in section B2
I favor an alternative hypothesis: the Sakana agent determines where a graph belongs, what would be on the X and Y axis of that graph, what it expects that the graph would look like, and how to generate that graph. It then generates the graph and inserts the caption the graph would show if its hypothesis was correct. The agent has no particular ability to notice that its description doesn’t work with the graph.