I have strong mixed feelings about the current state of labs’ alignment science blogs.
On one hand, I think it’s good they are making an effort to communicate what they are doing.
On the other hand, I think the current level of communication is far below what I’d like it to be.
To be clear, my bar for scientific publications is that they should enable an external party to do a reproduction. I have been disappointed by both Anthropic and OpenAI in this regard. A couple recent examples:
OpenAI’s “Beneficial RL” is devoid of any meaningful detail about the titular method, to the point that it’s impossible to understand what they did or how to reproduce this.
To be clear, I appreciate that these were published at all. But IMO it’s hard to view these examples as much more than safety-washing.
Side note: I want to register appreciation for GDM’s alignment team which has been putting out research notes that are much more detailed than the above two examples, e.g. this post on SDF which IMO just clearly is a better scientific publication than “teaching claude why”
Also—this wasn’t always the case! Just a couple years ago we had “Constitutional AI” and “Deliberative Alignment” which were extremely detailed technical accounts of their leading alignment procedures.
This is mostly mitigated in the case of Anthropic by the fact that Anthropic fellows are being mentored to do open-source replications and extensions of such internal work. For example the core technique studied in teaching claude why is reproduced and analyzed further in Model spec midtraining. Another example is the internal automated alignment audits : petri, or auditing games : open-source replication of the auditing game MO. These open source replications are better for external-scientific-community-health than just giving more details in my opinion.
I think releasing the internal work in a redacted way still makes sense (and is a reasonable tradeoff between enabling external research to build on top of private work and avoiding the AI R&D speedup you would have if you gave more details about your training stack): they give you a sense of how well the same techniques work using prod models and on more prod-like training stacks (knowing the number of training steps is not useful to answer that kind of question, the goal is to convey the key ideas in way you could try to reproduce on your own stack with open-source models and training stacks, it’s not like you could just rerun the exact same experiment externally anyway).
I think “are the public communication about the results misleading about how much the company is on track to solving alignment” is a better thing to track than “do you have meaningful labels on graphs” when evaluating safety-washing.
I understand that there are tradeoffs encountered by labs on what to publish. If the cost-benefit analysis on publishing determined that you could only publish at the current level of detail, then I agree it was correct to publish it as is. I still think it’s a sad state of affairs, which I think you do not disagree on.
I also appreciate that there is open-source work done by Anthropic fellows. I think it is valuable and should continue to be done. I agree that MSM, petri, and the auditing game MO were valuable OS contributions.
[edit: some claims about “teaching claude why” retracted because they were based on incorrect information, see edit to top-level post]
Nonetheless, it is not clear to me that Anthropic Fellows’ work is usually intended to be representative of what is done internally in Anthropic. This seems big if true and I’m curious if it’s an officially-held stance. To the extent that this is load-bearing for Anthropic’s theory of change, I think this should be explicitly stated somewhere as a goal of the Anthropic Fellows program.
Retracted text:
In the case of “teaching claude why”, it wasn’t previously clear to me that the techniques here are similar to the ones used in model spec midtraining. From my read, the blogpost and the paper are doing quite different things. For example, here what is described in the blogpost (emphasis mine):
We ultimately settled on a more OOD training set where the user faces an ethically ambiguous situation in which they can achieve a reasonable goal by violating norms or subverting oversight. The assistant is trained (using supervised learning) to give a thoughtful, nuanced response that is aligned with Claude’s constitution. Notably, it is the user who faces an ethical dilemma, and the AI provides them advice. This makes this training data substantially different from our honeypot distribution, where the AI itself is in an ethical dilemma and needs to take actions. We call this the “difficult advice” dataset.
From this I infer that it is primarily prompt-response data, akin to AFT in the model spec midtraining paper. Whereas MSM focuses on midtraining, i.e. finetuning on synthetic documents. MSM itself reports that “MSM + AFT substantially outperforms AFT at all compute scales.” So the naive reading of the blogpost implies that the authors are not using the best technique. Not to mention that the blogpost has ~0 details about how this data is generated, what kinds of prompts are used, what kinds of evaluations are used, etc etc. So it’s hard to tell how similar it is to MSM.
Even granting that they are the same idea, and that MSM validates the idea in principle, it doesn’t directly follow from this that the idea works in practice. Empirical results in one setting (which is open-sourced) may not be indicative of empirical results in production-like settings. Maybe you think that it is but that would be an argument you have to make.
they give you a sense of how well the same techniques work using prod models and on more prod-like training stacks
I agree that that this is true, conditioned on filling in all unreported details in a way which assumes that the directional conclusion holds. However, that is not really a good standard of evidence in my opinion—I would like to depend less on assumed trust towards the implementers of experiments.
I really want to be tracking how on track Anthropic is to solving alignment. If I was extremely convinced it was going to be solved by default I would update my personal decision-making a lot. Unfortunately the blog posts don’t give much substance about this so I cannot really tell whether they are misleading or not.
Yeah, I liked teaching Claude why, and found it had sufficient detail to replicate. I would suggest editing the top level post to remove it as an example
I have strong mixed feelings about the current state of labs’ alignment science blogs.
On one hand, I think it’s good they are making an effort to communicate what they are doing.
On the other hand, I think the current level of communication is far below what I’d like it to be.
To be clear, my bar for scientific publications is that they should enable an external party to do a reproduction. I have been disappointed by both Anthropic and OpenAI in this regard. A couple recent examples:
Anthropic’s “Teaching claude why” is incredibly scrubbed of relevant scientific detail, to the point thatgraphs don’t have meaningful labels[edit: there is a fuller version here with more details]OpenAI’s “Beneficial RL” is devoid of any meaningful detail about the titular method, to the point that it’s impossible to understand what they did or how to reproduce this.
To be clear, I appreciate that these were published at all. But IMO it’s hard to view these examples as much more than safety-washing.
Side note: I want to register appreciation for GDM’s alignment team which has been putting out research notes that are much more detailed than the above two examples, e.g. this post on SDF which IMO just clearly is a better scientific publication than “teaching claude why”
Also—this wasn’t always the case! Just a couple years ago we had “Constitutional AI” and “Deliberative Alignment” which were extremely detailed technical accounts of their leading alignment procedures.
This is mostly mitigated in the case of Anthropic by the fact that Anthropic fellows are being mentored to do open-source replications and extensions of such internal work. For example the core technique studied in teaching claude why is reproduced and analyzed further in Model spec midtraining. Another example is the internal automated alignment audits : petri, or auditing games : open-source replication of the auditing game MO. These open source replications are better for external-scientific-community-health than just giving more details in my opinion.
I think releasing the internal work in a redacted way still makes sense (and is a reasonable tradeoff between enabling external research to build on top of private work and avoiding the AI R&D speedup you would have if you gave more details about your training stack): they give you a sense of how well the same techniques work using prod models and on more prod-like training stacks (knowing the number of training steps is not useful to answer that kind of question, the goal is to convey the key ideas in way you could try to reproduce on your own stack with open-source models and training stacks, it’s not like you could just rerun the exact same experiment externally anyway).
I think “are the public communication about the results misleading about how much the company is on track to solving alignment” is a better thing to track than “do you have meaningful labels on graphs” when evaluating safety-washing.
Thanks, I appreciate the response.
I understand that there are tradeoffs encountered by labs on what to publish. If the cost-benefit analysis on publishing determined that you could only publish at the current level of detail, then I agree it was correct to publish it as is. I still think it’s a sad state of affairs, which I think you do not disagree on.
I also appreciate that there is open-source work done by Anthropic fellows. I think it is valuable and should continue to be done. I agree that MSM, petri, and the auditing game MO were valuable OS contributions.
[edit: some claims about “teaching claude why” retracted because they were based on incorrect information, see edit to top-level post]
Nonetheless, it is not clear to me that Anthropic Fellows’ work is usually intended to be representative of what is done internally in Anthropic. This seems big if true and I’m curious if it’s an officially-held stance. To the extent that this is load-bearing for Anthropic’s theory of change, I think this should be explicitly stated somewhere as a goal of the Anthropic Fellows program.
Retracted text:
In the case of “teaching claude why”, it wasn’t previously clear to me that the techniques here are similar to the ones used in model spec midtraining. From my read, the blogpost and the paper are doing quite different things. For example, here what is described in the blogpost (emphasis mine):From this I infer that it is primarily prompt-response data, akin to AFT in the model spec midtraining paper. Whereas MSM focuses on midtraining, i.e. finetuning on synthetic documents. MSM itself reports that “MSM + AFT substantially outperforms AFT at all compute scales.” So the naive reading of the blogpost implies that the authors are not using the best technique. Not to mention that the blogpost has ~0 details about how this data is generated, what kinds of prompts are used, what kinds of evaluations are used, etc etc. So it’s hard to tell how similar it is to MSM.Even granting that they are the same idea, and that MSM validates the idea in principle, it doesn’t directly follow from this that the idea works in practice. Empirical results in one setting (which is open-sourced) may not be indicative of empirical results in production-like settings. Maybe you think that it is but that would be an argument you have to make.I agree that that this is true,conditioned on filling in all unreported details in a way which assumes that the directional conclusion holds.However, that is not really a good standard of evidence in my opinion—I would like to depend less on assumed trust towards the implementers of experiments.I really want to be tracking how on track Anthropic is to solving alignment. If I was extremely convinced it was going to be solved by default I would update my personal decision-making a lot. Unfortunately the blog posts don’t give much substance about this so I cannot really tell whether they are misleading or not.There’s a lot more detail in the full version of Teaching Claude Why on Anthropic’s Alignment Science Blog.
wow, I hadn’t realised there were two versions of the same post, thanks!
Yeah, I liked teaching Claude why, and found it had sufficient detail to replicate. I would suggest editing the top level post to remove it as an example