My default reaction to this post is pretty strongly negative, because I feel that it doesn’t adequately engage with the opposing point-of-view, and why people who work on evaluations may think that their work is impactful and good. You are not required to do that before posting your opinion, but I think if you had tried slightly more to pass the ITT of people who work on evaluations, this post would read quite differently and perhaps more convincingly.
In any case, to object to some specific points:
Regulations do exist. I don’t understand the objection to evaluations being that they rely on regulations that don’t exist. It may be that you think the regulations are inadequate, but it is simply not the case that “regulations don’t exist.” Work by evaluators was, in fact, often quite instrumental in getting these regulations passed and taken seriously and implemented, as far as I’m aware.
Some examples: the EU AI act, New York’s RAISE act, Texas’ Responsible AI Governance act, California’s SB 53. South Korea has passed an AI Framework Act. The situation in China is harder to know about in detail. Even in the absence of codified “AI regulation,” for example, in the UK, existing regulators often can apply pre-existing laws to AI – even if these don’t cover the harms we are most concerned about, they are helpful on the margin, and further progress is being made here (led by UK AISI!)
I’m more sympathetic to your second objection, but I think this mostly depends on the jurisdiction. In the common-law frameworks with which I am most familiar, the burden of proof simply is on the public to make the case to legislators to pass AI regulation. Otherwise it is unregulated, and thus there are no legal requirements specific to AI. So yes, if you want to get AI regulated, you need to convince decision-makers that AI has to be regulated. Naturally, I agree with you that we ought to be significantly more conservative in the development of AI than we currently are, but nevertheless if you want to get regulations, you need to pass them (often in the face of overwhelming lobbying efforts against such regulations by AI and tech firms).
It is hard to draw an analogy here – perhaps a good one would be drug discovery in the US. Before the FDA existed and put the burden of proof on those bringing drugs to market, there were no regulations on selling drugs, people could sell whatever they wanted to. Over time, in the early-to-mid 20th century, more regulations were brought in. Eventually in 1962, the government gave the FDA much more power, and forced drug manufacturers to prove both efficacy and safety before bringing a drug to market, however the only reason it was able to establish an organization with such teeth and put the onus on drug manufacturers was due to public outcry over a specific disastrous case and because the groundwork had already been laid by prior regulations empowering the USDA. We’re trying to speedrun this process with AI, and evaluations are going to form a part of any sufficient regulatory framework. Perhaps you would say that there is no way for us to prove AI systems are safe and we need to pause, and I would be deeply sympathetic. Unfortunately, I’m pessimistic about the prospects of such a suggestion, and so I would rather lay the groundwork for an effective and powerful regulatory regime.
There are two more specific points that I would take exception to:
1) That Goodfire is well-described as “a startup leveraging interpretability for capabilities.” I think this is a part of what Goodfire does – and I wish they weren’t doing quite as much of that as they are – though this is certainly not the full story, and I think this is an unfair descriptor as currently written (i.e. it would lead someone unfamiliar with Goodfire to have false beliefs about them).
2) The criticism of the revolving door culture. This is often the case in regulatory setups (e.g. the SEC, the Treasury, and Wall Street). This makes sense! You want good economists and people who actually understand financial systems to be the ones regulating them, and this is inevitably going to lead to something of a revolving-door culture. It’s not optimal, but this is the world we live in. Expertise is unfortunately rare. I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
Thanks for engaging. I have karma upvoted but disagreed voted.
You wrote a lot, so I’ll focus on your two bounded points.
About Goodfire
I think you are wrong. If I had to guess, it’s in the same way that people are misguided about Anthropic.
For comparison, this is the copy at the top of Anthropic’s Research page:
Our research teams investigate the safety, inner workings, and societal impacts of AI models – so that artificial intelligence has a positive impact as it becomes increasingly capable.
This is their copy at the top of Goodfire’s About page:
Goodfire is a research company using interpretability to understand, learn from, and design AI systems. Our mission is to build the next generation of safe and powerful AI—not by scaling alone, but by understanding the intelligence we’re building.
Fundamental interpretability research to understand and intentionally design advanced AI systems
They specifically state their wish to design advanced and the next generation of AI systems. To the extent I may have mis-characterised them in that people may misunderstand that they are not just doing capabilities on top of existing systems. What I have described could lead to people thinking they are just doing RAG or something.
About Revolving Doors
I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
I think you are fractally wrong in this paragraph. I’ll try to show it to you.
1) “It is better to [A] than [B]” is a false dilemma. In the article, as you point out, I just do both.
2) “Pointing associations between people and inferring harmful behavior without evidence” does not describe what I am doing, again as you point out.
3) It is crucial to point association between people and infer harmful behaviour from it even without confirmation! One should not reject evidence just because it is not “very fruitful” when used by itself. It is obviously relevant that the head of safety of US-AISI, the CEO of Open Phil, and the CEO of Anthropic were all roommates, and it is trivial to infer things from that.
4) I literally wrote “This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.”, and then proceeded to list the whole picture.
So probably this cashes out as more fundamental generator disagreements that aren’t worth hashing out here. Broadly I think it’s okay for a company to say “We are designing and advancing the next generation of AI systems” and I think to analyze whether they should be bucketed with capabilities labs like OpenAI or Anthropic (which I think are also meaningfully different places), one should look at and critically assess their research output.
Like, if someone believes that interpretability with be both helpful to build better systems and helpful to build safer systems, I think it’s justifiable for them to do the thing that builds better systems in the hopes that those systems are also safer than the next-best-thing that would’ve been built (and that’s probably reliant on a bunch of other beliefs where we may differ as I said before).
About revolving doors
Sorry I may have been unclear. What I meant is that [A] is good and [B] is bad. I am criticizing [B] here which is playing the associations-game. I think that is generally bad and you should not do it.
I think it describes some of what you are doing though not all of it, yes.
I am not saying one should reject it as evidence. It’s fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly. It is, in my opinion, better to apply criticism to actual actions you disagree with, or strategic choices, or whatever, instead of who people were roommates with or not. Otherwise you get to games like what happened with your point about Goodfire/Apollo where we have: Apollo is suspect because it was co-founded by two people who then went to Goodfire and anyone who works at Goodfire is suspect because in some way they think what they’re working on will be useful for AI development (even though I’m pretty sure those people continue to see their work as being heavily motivated by safety, and differentially useful for safety). I think this chain of inferences is bad, and so we should cut it at the root. Most professional criticism should be about what people have actually done, not about who they’ve been associated with.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem? Or is it a PR problem? Or just a feature of the space? And I think for answering these questions, it is better to look at behaviour (which again, you did do as well as I mentioned originally, I just think the association-game stuff isn’t great).
Genuine thanks for your response. I have again karma-upvoted and disagreed-voted.
About Goodfire
Broadly, I don’t care much about the moral intents of OpenAI, Anthropic and Goodfire. I think everyone is the hero of their own story.
To the extent that people state their intents, I basically remove all the moral colours and emotional valence. For instance, “we build capabilities for The Good and for Good Reasons” becomes “we build capabilities”.
Another example may be “we care a lot about rationality”. Rationality is the practice of “reason”, which is “thinking correctly”. “Care” is also “we have good feelings for”. So, through this process, it becomes “we have a lot of feelings around the practice of thinking”.
A last example could be “effective altruism”. Here, it’s hard to see what’s left after you go through this process. Something like “we do things connected to others (altruism) that perform highly on measures of our own choice (effective)”.
I strongly recommend adopting this frame of analysis to any of my readers, not only “GenericModel”. If you are this deep in this thread, you most likely need it.
About revolving doors
On 3, I think you are forgetting the basics of LessWrong.
I am not saying one should reject it as evidence. It’s fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly.
“If [X], then necessarily [Y] is true” is logical entailment. Evidence is “If [X], then [Y] is more likely.” A single counter-example does not disprove it.
If you mean that housemates (by choice, not necessity!) who have worked together are not more likely to be synchronised on their view, then I think you are clearly wrong.
Just so you know, Holden has moved from CEO of Open Philanthropy to Anthropic a year ago. This type of revolving door was entirely predictable through this type of evidence.
I have been polite in the article. Had I included public links of marriage, things would already look worse. Had a journalist included private affairs, the picture would look much worse. This would be entirely valid probabilistically.
On 4, I think you are forgetting that everything is adversarial.
it is better to look at behaviour
Yes, but people lie, they deceive, they “strategically withhold” information, and so on. Even when they don’t, it also just takes time to document everything, and vet it for publication.
This lack of transparency is why we use other types of evidence. They tend to be less accurate, but also more plentiful, less filtered, and harder to fake.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem?
Yes, and I have listed the adversarial considerations in the article. It is the part that literally starts with “The considerations around independence are structural. Namely, [...]”
My default reaction to this post is pretty strongly negative, because I feel that it doesn’t adequately engage with the opposing point-of-view, and why people who work on evaluations may think that their work is impactful and good. You are not required to do that before posting your opinion, but I think if you had tried slightly more to pass the ITT of people who work on evaluations, this post would read quite differently and perhaps more convincingly.
In any case, to object to some specific points:
Regulations do exist. I don’t understand the objection to evaluations being that they rely on regulations that don’t exist. It may be that you think the regulations are inadequate, but it is simply not the case that “regulations don’t exist.” Work by evaluators was, in fact, often quite instrumental in getting these regulations passed and taken seriously and implemented, as far as I’m aware.
Some examples: the EU AI act, New York’s RAISE act, Texas’ Responsible AI Governance act, California’s SB 53. South Korea has passed an AI Framework Act. The situation in China is harder to know about in detail. Even in the absence of codified “AI regulation,” for example, in the UK, existing regulators often can apply pre-existing laws to AI – even if these don’t cover the harms we are most concerned about, they are helpful on the margin, and further progress is being made here (led by UK AISI!)
I’m more sympathetic to your second objection, but I think this mostly depends on the jurisdiction. In the common-law frameworks with which I am most familiar, the burden of proof simply is on the public to make the case to legislators to pass AI regulation. Otherwise it is unregulated, and thus there are no legal requirements specific to AI. So yes, if you want to get AI regulated, you need to convince decision-makers that AI has to be regulated. Naturally, I agree with you that we ought to be significantly more conservative in the development of AI than we currently are, but nevertheless if you want to get regulations, you need to pass them (often in the face of overwhelming lobbying efforts against such regulations by AI and tech firms).
It is hard to draw an analogy here – perhaps a good one would be drug discovery in the US. Before the FDA existed and put the burden of proof on those bringing drugs to market, there were no regulations on selling drugs, people could sell whatever they wanted to. Over time, in the early-to-mid 20th century, more regulations were brought in. Eventually in 1962, the government gave the FDA much more power, and forced drug manufacturers to prove both efficacy and safety before bringing a drug to market, however the only reason it was able to establish an organization with such teeth and put the onus on drug manufacturers was due to public outcry over a specific disastrous case and because the groundwork had already been laid by prior regulations empowering the USDA. We’re trying to speedrun this process with AI, and evaluations are going to form a part of any sufficient regulatory framework. Perhaps you would say that there is no way for us to prove AI systems are safe and we need to pause, and I would be deeply sympathetic. Unfortunately, I’m pessimistic about the prospects of such a suggestion, and so I would rather lay the groundwork for an effective and powerful regulatory regime.
There are two more specific points that I would take exception to:
1) That Goodfire is well-described as “a startup leveraging interpretability for capabilities.” I think this is a part of what Goodfire does – and I wish they weren’t doing quite as much of that as they are – though this is certainly not the full story, and I think this is an unfair descriptor as currently written (i.e. it would lead someone unfamiliar with Goodfire to have false beliefs about them).
2) The criticism of the revolving door culture. This is often the case in regulatory setups (e.g. the SEC, the Treasury, and Wall Street). This makes sense! You want good economists and people who actually understand financial systems to be the ones regulating them, and this is inevitably going to lead to something of a revolving-door culture. It’s not optimal, but this is the world we live in. Expertise is unfortunately rare. I think it is better to criticize specific decisions made by evaluations companies, or decisions they haven’t taken, rather than pointing at associations between people and inferring harmful behavior without evidence (I think your example of people at evaluations orgs being concerned about losing API access if they speak out is a much better criticism here). In general I think this sort of guilt-by-association is not very fruitful.
Thanks for engaging. I have karma upvoted but disagreed voted.
You wrote a lot, so I’ll focus on your two bounded points.
About Goodfire
I think you are wrong. If I had to guess, it’s in the same way that people are misguided about Anthropic.
For comparison, this is the copy at the top of Anthropic’s Research page:
This is their copy at the top of Goodfire’s About page:
And at the top of their Research page:
They specifically state their wish to design advanced and the next generation of AI systems. To the extent I may have mis-characterised them in that people may misunderstand that they are not just doing capabilities on top of existing systems. What I have described could lead to people thinking they are just doing RAG or something.
About Revolving Doors
I think you are fractally wrong in this paragraph. I’ll try to show it to you.
1) “It is better to [A] than [B]” is a false dilemma. In the article, as you point out, I just do both.
2) “Pointing associations between people and inferring harmful behavior without evidence” does not describe what I am doing, again as you point out.
3) It is crucial to point association between people and infer harmful behaviour from it even without confirmation! One should not reject evidence just because it is not “very fruitful” when used by itself. It is obviously relevant that the head of safety of US-AISI, the CEO of Open Phil, and the CEO of Anthropic were all roommates, and it is trivial to infer things from that.
4) I literally wrote “This is not only about having a couple of senior staff from the industry. That in itself can be good! It’s the whole picture that looks bad.”, and then proceeded to list the whole picture.
About Goodfire
So probably this cashes out as more fundamental generator disagreements that aren’t worth hashing out here. Broadly I think it’s okay for a company to say “We are designing and advancing the next generation of AI systems” and I think to analyze whether they should be bucketed with capabilities labs like OpenAI or Anthropic (which I think are also meaningfully different places), one should look at and critically assess their research output.
Like, if someone believes that interpretability with be both helpful to build better systems and helpful to build safer systems, I think it’s justifiable for them to do the thing that builds better systems in the hopes that those systems are also safer than the next-best-thing that would’ve been built (and that’s probably reliant on a bunch of other beliefs where we may differ as I said before).
About revolving doors
Sorry I may have been unclear. What I meant is that [A] is good and [B] is bad. I am criticizing [B] here which is playing the associations-game. I think that is generally bad and you should not do it.
I think it describes some of what you are doing though not all of it, yes.
I am not saying one should reject it as evidence. It’s fine to say it publicly. What I disagree with is the inference that is implied. I do not think it is trivial to infer things from that. I have had roommates with whom I disagree and I would say so publicly. It is, in my opinion, better to apply criticism to actual actions you disagree with, or strategic choices, or whatever, instead of who people were roommates with or not. Otherwise you get to games like what happened with your point about Goodfire/Apollo where we have: Apollo is suspect because it was co-founded by two people who then went to Goodfire and anyone who works at Goodfire is suspect because in some way they think what they’re working on will be useful for AI development (even though I’m pretty sure those people continue to see their work as being heavily motivated by safety, and differentially useful for safety). I think this chain of inferences is bad, and so we should cut it at the root. Most professional criticism should be about what people have actually done, not about who they’ve been associated with.
It looks bad in the same way as other revolving door cultures can look bad to outsiders, but the question should be is it bad? Is it an actual structural problem? Or is it a PR problem? Or just a feature of the space? And I think for answering these questions, it is better to look at behaviour (which again, you did do as well as I mentioned originally, I just think the association-game stuff isn’t great).
Genuine thanks for your response. I have again karma-upvoted and disagreed-voted.
About Goodfire
Broadly, I don’t care much about the moral intents of OpenAI, Anthropic and Goodfire. I think everyone is the hero of their own story.
To the extent that people state their intents, I basically remove all the moral colours and emotional valence. For instance, “we build capabilities for The Good and for Good Reasons” becomes “we build capabilities”.
Another example may be “we care a lot about rationality”. Rationality is the practice of “reason”, which is “thinking correctly”. “Care” is also “we have good feelings for”. So, through this process, it becomes “we have a lot of feelings around the practice of thinking”.
A last example could be “effective altruism”. Here, it’s hard to see what’s left after you go through this process. Something like “we do things connected to others (altruism) that perform highly on measures of our own choice (effective)”.
I strongly recommend adopting this frame of analysis to any of my readers, not only “GenericModel”.
If you are this deep in this thread, you most likely need it.
About revolving doors
On 3, I think you are forgetting the basics of LessWrong.
“If [X], then necessarily [Y] is true” is logical entailment.
Evidence is “If [X], then [Y] is more likely.” A single counter-example does not disprove it.
If you mean that housemates (by choice, not necessity!) who have worked together are not more likely to be synchronised on their view, then I think you are clearly wrong.
Just so you know, Holden has moved from CEO of Open Philanthropy to Anthropic a year ago. This type of revolving door was entirely predictable through this type of evidence.
I have been polite in the article. Had I included public links of marriage, things would already look worse. Had a journalist included private affairs, the picture would look much worse. This would be entirely valid probabilistically.
On 4, I think you are forgetting that everything is adversarial.
Yes, but people lie, they deceive, they “strategically withhold” information, and so on. Even when they don’t, it also just takes time to document everything, and vet it for publication.
This lack of transparency is why we use other types of evidence. They tend to be less accurate, but also more plentiful, less filtered, and harder to fake.
Yes, and I have listed the adversarial considerations in the article. It is the part that literally starts with “The considerations around independence are structural. Namely, [...]”