It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.
It’s exciting to see OpenAI acknowledge that pre-training data filtering is a part of their safety stack. When it comes to advanced technical content, minimizing the model’s exposure to sensitive material seems pretty intuitive. However, it is difficult to draw any strong conclusions about data filtering effectiveness from this work, given the understandably few details. They do not indicate the effort invested, the volume of data removed, or the sophistication of their filtering pipeline. I expect a company could share far more details about this process without divulging trade secrets.
Was it public knowledge that they did data filtering for GPT-4o? I’ve been studying this space and was not aware of this. It’s also interesting that they’re using the “same” filtering pipeline a year later.