I’m currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I’m working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I’m very grateful to Anthropic and the Alignment Stress-Testing team for providing this access and supporting this work. I expect that this access and the collaboration with various members of the alignment stress testing team (primarily Carson Denison and Evan Hubinger so far) will be quite helpful in finishing this project.
I think that this sort of arrangement, in which an outside researcher is able to get employee-level access at some AI lab while not being an employee (while still being subject to confidentiality obligations), is potentially a very good model for safety research, for a few reasons, including (but not limited to):
For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I think this could make it easier to avoid duplicating work between various labs. I’m aware of some duplication that could potentially be avoided by ensuring more work happened at external organizations.
For these and other reasons, I think that external researchers with employee-level access is a promising approach for ensuring that safety research can proceed quickly and effectively while reducing conflicts of interest and unfortunate concentration of power. I’m excited for future experimentation with this structure and appreciate that Anthropic was willing to try this. I think it would be good if other labs beyond Anthropic experimented with this structure.
(Note that this message was run by the comms team at Anthropic.)
Yay Anthropic. This is the first example I’m aware of of a lab sharing model access with external safety researchers to boost their research (like, not just for evals). I wish the labs did this more.
[Edit: OpenAI shared GPT-4 access with safety researchers including Rachel Freedman before release. OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023. Yay OpenAI. GPT-4 fine-tuning access is still not public; some widely-respected safety researchers I know recently were wishing for it, and were wishing they could disable content filters.]
I’d be surprised if this was employee-level access. I’m aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.
(I’m a full-time employee at Anthropic.) It seems worth stating for the record that I’m not aware of any contract I’ve signed whose contents I’m not allowed to share. I also don’t believe I’ve signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn’t be legally restricted from saying things like “I believe that Anthropic behaved recklessly by releasing [model]”.
I think I could share the literal language in the contractor agreement I signed related to confidentiality, though I don’t expect this is especially interesting as it is just a standard NDA from my understanding.
I do not have any non-disparagement, non-solicitation, or non-interference obligations.
I’m not currently going to share information about any other policies Anthropic might have related to confidentiality, though I am asking about what Anthropic’s policy is on sharing information related to this.
Here is the full section on confidentiality from the contract:
Confidential Information.
(a) Protection of Information. Consultant understands that during the Relationship, the Company intends to provide Consultant with certain information, including Confidential Information (as defined below), without which Consultant would not be able to perform Consultant’s duties to the Company. At all times during the term of the Relationship and thereafter, Consultant shall hold in strictest confidence, and not use, except for the benefit of the Company to the extent necessary to perform the Services, and not disclose to any person, firm, corporation or other entity, without written authorization from the Company in each instance, any Confidential Information that Consultant obtains from the Company or otherwise obtains, accesses or creates in connection with, or as a result of, the Services during the term of the Relationship, whether or not during working hours, until such Confidential Information becomes publicly and widely known and made generally available through no wrongful act of Consultant or of others who were under confidentiality obligations as to the item or items involved. Consultant shall not make copies of such Confidential Information except as authorized by the Company or in the ordinary course of the provision of Services.
(b) Confidential Information. Consultant understands that “Confidential Information” means any and all information and physical manifestations thereof not generally known or available outside the Company and information and physical manifestations thereof entrusted to the Company in confidence by third parties, whether or not such information is patentable, copyrightable or otherwise legally protectable. Confidential Information includes, without limitation: (i) Company Inventions (as defined below); and (ii) technical data, trade secrets, know-how, research, product or service ideas or plans, software codes and designs, algorithms, developments, inventions, patent applications, laboratory notebooks, processes, formulas, techniques, biological materials, mask works, engineering designs and drawings, hardware configuration information, agreements with third parties, lists of, or information relating to, employees and consultants of the Company (including, but not limited to, the names, contact information, jobs, compensation, and expertise of such employees and consultants), lists of, or information relating to, suppliers and customers (including, but not limited to, customers of the Company on whom Consultant called or with whom Consultant became acquainted during the Relationship), price lists, pricing methodologies, cost data, market share data, marketing plans, licenses, contract information, business plans, financial forecasts, historical financial data, budgets or other business information disclosed to Consultant by the Company either directly or indirectly, whether in writing, electronically, orally, or by observation.
(c) Third Party Information. Consultant’s agreements in this Section 5 are intended to be for the benefit of the Company and any third party that has entrusted information or physical material to the Company in confidence. During the term of the Relationship and thereafter, Consultant will not improperly use or disclose to the Company any confidential, proprietary or secret information of Consultant’s former clients or any other person, and Consultant will not bring any such information onto the Company’s property or place of business.
(d) Other Rights. This Agreement is intended to supplement, and not to supersede, any rights the Company may have in law or equity with respect to the protection of trade secrets or confidential or proprietary information.
(e) U.S. Defend Trade Secrets Act. Notwithstanding the foregoing, the U.S. Defend Trade Secrets Act of 2016 (“DTSA”) provides that an individual shall not be held criminally or civilly liable under any federal or state trade secret law for the disclosure of a trade secret that is made (i) in confidence to a federal, state, or local government official, either directly or indirectly, or to an attorney; and (ii) solely for the purpose of reporting or investigating a suspected violation of law; or (iii) in a complaint or other document filed in a lawsuit or other proceeding, if such filing is made under seal. In addition, DTSA provides that an individual who files a lawsuit for retaliation by an employer for reporting a suspected violation of law may disclose the trade secret to the attorney of the individual and use the trade secret information in the court proceeding, if the individual (A) files any document containing the trade secret under seal; and (B) does not disclose the trade secret, except pursuant to court order.
Do you feel like there are any benefits or drawbacks specifically tied to the fact that you’re doing this work as a contractor? (compared to a world where you were not a contractor but Anthropic just gave you model access to run these particular experiments and let Evan/Carson review your docs)
Being a contractor was the most convenient way to make the arrangement.
I would ideally prefer to not be paid by Anthropic[1], but this doesn’t seem that important (as long as the pay isn’t too overly large). I asked to be paid as little as possible and I did end up being paid less than would otherwise be the case (and as a contractor I don’t receive equity). I wasn’t able to ensure that I only get paid a token wage (e.g. $1 in total or minimum wage or whatever).
I think the ideal thing would be a more specific legal contract between me and Anthropic (or Redwood and Anthropic), but (again) this doesn’t seem important.
At least for this current primary purpose of this contracting. I do think that it could make sense to be paid for some types of consulting work. I’m not sure what all the concerns are here.
It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.
Many of the people / orgs involved in evals research are also important figures in policy debates. With this incentive Anthropic may gain more ability to control the narrative around AI risks.
It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.
As in, if at some point I am currently a contractor with model access (or otherwise have model access via some relationship like this) it will at that point be more costly to criticize Anthropic?
AI labs may provide model access (or other goods), so people who might want to obtain model access might be incentivized to criticize AI labs less.
Is that accurate?
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor. (Unless you think me being a contractor will make me more likely to want model acess for whatever reason.)
I agree that this is a concern in general with researchers who could benefit from various things that AI labs might provide (such as model access). So, this is a downside of research agendas with a dependence on (e.g.) model access.
I think various approaches to mitigate this concern could be worthwhile. (Though I don’t think this is worth getting into in this comment.)
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor.
In your comment you say
For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I’m essentially disagreeing with this point. I expect that most of the conflict of interest concerns remain when a big lab is giving access to a smaller org / individual.
(Unless you think me being a contractor will make me more likely to want model access for whatever reason.)
From my perspective the main takeaway from your comment was “Anthropic gives internal model access to external safety researchers.” I agree that once you have already updated on this information, the additional information “I am currently receiving access to Anthropic’s internal models” does not change much. (Although I do expect that establishing the precedent / strengthening the relationships / enjoying the luxury of internal model access, will in fact make you more likely to want model access again in the future).
I’m currently working as a contractor at Anthropic in order to get employee-level model access as part of a project I’m working on. The project is a model organism of scheming, where I demonstrate scheming arising somewhat naturally with Claude 3 Opus. So far, I’ve done almost all of this project at Redwood Research, but my access to Anthropic models will allow me to redo some of my experiments in better and simpler ways and will allow for some exciting additional experiments. I’m very grateful to Anthropic and the Alignment Stress-Testing team for providing this access and supporting this work. I expect that this access and the collaboration with various members of the alignment stress testing team (primarily Carson Denison and Evan Hubinger so far) will be quite helpful in finishing this project.
I think that this sort of arrangement, in which an outside researcher is able to get employee-level access at some AI lab while not being an employee (while still being subject to confidentiality obligations), is potentially a very good model for safety research, for a few reasons, including (but not limited to):
For some safety research, it’s helpful to have model access in ways that labs don’t provide externally. Giving employee level access to researchers working at external organizations can allow these researchers to avoid potential conflicts of interest and undue influence from the lab. This might be particularly important for researchers working on RSPs, safety cases, and similar, because these researchers might naturally evolve into third-party evaluators.
Related to undue influence concerns, an unfortunate downside of doing safety research at a lab is that you give the lab the opportunity to control the narrative around the research and use it for their own purposes. This concern seems substantially addressed by getting model access through a lab as an external researcher.
I think this could make it easier to avoid duplicating work between various labs. I’m aware of some duplication that could potentially be avoided by ensuring more work happened at external organizations.
For these and other reasons, I think that external researchers with employee-level access is a promising approach for ensuring that safety research can proceed quickly and effectively while reducing conflicts of interest and unfortunate concentration of power. I’m excited for future experimentation with this structure and appreciate that Anthropic was willing to try this. I think it would be good if other labs beyond Anthropic experimented with this structure.
(Note that this message was run by the comms team at Anthropic.)
Yay Anthropic. This is the first example I’m aware of of a lab sharing model access with external safety researchers to boost their research (like, not just for evals). I wish the labs did this more.
[Edit: OpenAI shared GPT-4 access with safety researchers including Rachel Freedman before release. OpenAI shared GPT-4 fine-tuning access with academic researchers including Jacob Steinhardt and Daniel Kang in 2023. Yay OpenAI. GPT-4 fine-tuning access is still not public; some widely-respected safety researchers I know recently were wishing for it, and were wishing they could disable content filters.]
OpenAI did this too, with GPT-4 pre-release. It was a small program, though — I think just 5-10 researchers.
I’d be surprised if this was employee-level access. I’m aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.
It also wasn’t employee level access probably?
(But still a good step!)
Source?
It was a secretive program — it wasn’t advertised anywhere, and we had to sign an NDA about its existence (which we have since been released from). I got the impression that this was because OpenAI really wanted to keep the existence of GPT4 under wraps. Anyway, that means I don’t have any proof beyond my word.
Thanks!
To be clear, my question was like where can I learn more + what should I cite, not I don’t believe you. I’ll cite your comment.
Yay OpenAI.
Would you be able to share the details of your confidentiality agreement?
(I’m a full-time employee at Anthropic.) It seems worth stating for the record that I’m not aware of any contract I’ve signed whose contents I’m not allowed to share. I also don’t believe I’ve signed any non-disparagement agreements. Before joining Anthropic, I confirmed that I wouldn’t be legally restricted from saying things like “I believe that Anthropic behaved recklessly by releasing [model]”.
I think I could share the literal language in the contractor agreement I signed related to confidentiality, though I don’t expect this is especially interesting as it is just a standard NDA from my understanding.
I do not have any non-disparagement, non-solicitation, or non-interference obligations.
I’m not currently going to share information about any other policies Anthropic might have related to confidentiality, though I am asking about what Anthropic’s policy is on sharing information related to this.
I would appreciate the literal language and any other info you end up being able to share.
Here is the full section on confidentiality from the contract:
(Anthropic comms was fine with me sharing this.)
This seems fantastic! Kudos to Anthropic
This project has now been released; I think it went extremely well.
Do you feel like there are any benefits or drawbacks specifically tied to the fact that you’re doing this work as a contractor? (compared to a world where you were not a contractor but Anthropic just gave you model access to run these particular experiments and let Evan/Carson review your docs)
Being a contractor was the most convenient way to make the arrangement.
I would ideally prefer to not be paid by Anthropic[1], but this doesn’t seem that important (as long as the pay isn’t too overly large). I asked to be paid as little as possible and I did end up being paid less than would otherwise be the case (and as a contractor I don’t receive equity). I wasn’t able to ensure that I only get paid a token wage (e.g. $1 in total or minimum wage or whatever).
I think the ideal thing would be a more specific legal contract between me and Anthropic (or Redwood and Anthropic), but (again) this doesn’t seem important.
At least for this current primary purpose of this contracting. I do think that it could make sense to be paid for some types of consulting work. I’m not sure what all the concerns are here.
It seems a substantial drawback that it will be more costly for you to criticize Anthropic in the future.
Many of the people / orgs involved in evals research are also important figures in policy debates. With this incentive Anthropic may gain more ability to control the narrative around AI risks.
As in, if at some point I am currently a contractor with model access (or otherwise have model access via some relationship like this) it will at that point be more costly to criticize Anthropic?
I’m not sure what the confusion is exactly.
If any of
you have a fixed length contract and you hope to have another contract again in the future
you have an indefinite contract and you don’t want them to terminate your relationship
you are some other evals researcher and you hope to gain model access at some point
you may refrain from criticizing Anthropic from now on.
Ok, so the concern is:
Is that accurate?
Notably, as described this is not specifically a downside of anything I’m arguing for in my comment or a downside of actually being a contractor. (Unless you think me being a contractor will make me more likely to want model acess for whatever reason.)
I agree that this is a concern in general with researchers who could benefit from various things that AI labs might provide (such as model access). So, this is a downside of research agendas with a dependence on (e.g.) model access.
I think various approaches to mitigate this concern could be worthwhile. (Though I don’t think this is worth getting into in this comment.)
Yes that’s accurate.
In your comment you say
I’m essentially disagreeing with this point. I expect that most of the conflict of interest concerns remain when a big lab is giving access to a smaller org / individual.
From my perspective the main takeaway from your comment was “Anthropic gives internal model access to external safety researchers.” I agree that once you have already updated on this information, the additional information “I am currently receiving access to Anthropic’s internal models” does not change much. (Although I do expect that establishing the precedent / strengthening the relationships / enjoying the luxury of internal model access, will in fact make you more likely to want model access again in the future).
As in, there aren’t substantial reductions in COI from not being an employee and not having equity? I currently disagree.
Yeah that’s the crux I think. Or maybe we agree but are just using “substantial”/”most” differently.
It mostly comes down to intuitions so I think there probably isn’t a way to resolve the disagreement.
So you asked anthropic for uncensored model access so you could try to build scheming AIs, and they gave it to you?
To use a biology analogy, isn’t this basically gain of function research?
Please read the model organisms for misalignment proposal.