Beth Barnes

Karma: 2,958

Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/

Beth Barnes 2 Jul 2024 22:32 UTC
LW: 9 AF: 6
7
AF
in reply to: Rachel Freedman’s comment on: ryan_greenblatt’s Shortform
I’d be surprised if this was employee-level access. I’m aware of a red-teaming program that gave early API access to specific versions of models, but not anything like employee-level.

Beth Barnes 5 Jun 2024 3:05 UTC
4 points
0
in reply to: Beth Barnes’s comment on: Non-Disparagement Canaries for OpenAI
Also FWIW I’m very confident Chris Painter has never been under any non-disparagement obligation to OpenAI

Beth Barnes 4 Jun 2024 22:57 UTC
118 points
10
on: Non-Disparagement Canaries for OpenAI
I signed the secret general release containing the non-disparagement clause when I left OpenAI. From more recent legal advice I understand that the whole agreement is unlikely to be enforceable, especially a strict interpretation of the non-disparagement clause like in this post. IIRC at the time I assumed that such an interpretation (e.g. where OpenAI could sue me for damages for saying some true/reasonable thing) was so absurd that couldn’t possibly be what it meant.
^[1]
I sold all my OpenAI equity last year, to minimize real or perceived CoI with METR’s work. I’m pretty sure it never occurred to me that OAI could claw back my equity or prevent me from selling it. ^[2]
OpenAI recently informally notified me by email that they would release me from the non-disparagement and non-solicitation provisions in the general release (but not, as in some other cases, the entire agreement.) They also said OAI “does not intend to enforce” these provisions in other documents I have signed. It is unclear what the legal status of this email is given that the original agreement states it can only be modified in writing signed by both parties.

As far as I can recall, concern about financial penalties for violating non-disparagement provisions was never a consideration that affected my decisions. I think having signed the agreement probably had some effect, but more like via “I want to have a reputation for abiding by things I signed so that e.g. labs can trust me with confidential information”. And I still assumed that it didn’t cover reasonable/factual criticism.

That being said, I do think many researchers and lab employees, myself included, have felt restricted from honestly sharing their criticisms of labs beyond small numbers of trusted people. In my experience, I think the biggest forces pushing against more safety-related criticism of labs are:

(1) confidentiality agreements (any criticism based on something you observed internally would be prohibited by non-disclosure agreements—so the disparagement clause is only relevant in cases where you’re criticizing based on publicly available information)
(2) labs’ informal/soft/not legally-derived powers (ranging from “being a bit less excited to collaborate on research” or “stricter about enforcing confidentiality policies with you” to “firing or otherwise making life harder for your colleagues or collaborators” or “lying to other employees about your bad conduct” etc)
(3) general desire to be researchers / neutral experts rather than an advocacy group.

To state what is probably obvious: I don’t think labs should have non-disparagement provisions. I think they should have very clear protections for employees who wanted to report safety concerns, including if this requires disclosing confidential information. I think something like the asks here are a reasonable start, and I also like Paul’s idea (which I can’t now find the link for) of having labs make specific “underlined statements” to which employees can anonymously add caveats or contradictions that will be publicly displayed alongside the statements. I think this would be especially appropriate for commitments about red lines for halting development (e.g. Responsible Scaling Policies) - a statement that a lab will “pause development at capability level x until they have implemented mitigation y” is an excellent candidate for an underlined statement
1. ^
  Regardless of legal enforceability, it also seems like it would be totally against OpenAI’s interests to sue someone for making some reasonable safety-related criticism.
2. ^
  I would have sold sooner but there are only intermittent opportunities for sale. OpenAI did not allow me to donate it, put it in a DAF, or gift it to another employee. This maybe makes more sense given what we know now. In lieu of actually being able to sell, I made a legally binding pledge in Sep 2021 to donate 80% of any OAI equity.

Clarifying METR’s Auditing Role

Beth Barnes30 May 2024 18:41 UTC

103 points

1 comment2 min readLW link

Beth Barnes 26 Apr 2024 23:03 UTC
LW: 17 AF: 11
4
AF
on: Preventing model exfiltration with upload limits
If anyone wants to work on this or knows people who might, I’d be interested in funding work on this (or helping secure funding—I expect that to be pretty easy to do).

Beth Barnes 19 Mar 2024 21:38 UTC
6 points
0
on: Join the AI Evaluation Tasks Bounty Hackathon
I’m super excited for this—hoping to get a bunch of great tasks and improve our evaluation suite a lot!

Introducing METR’s Autonomy Evaluation Resources

Megan Kinniment and Beth Barnes

15 Mar 2024 23:16 UTC

90 points

0 comments1 min readLW link

(metr.github.io)

Beth Barnes 2 Mar 2024 4:44 UTC
LW: 2 AF: 1
0
AF
in reply to: emilfroberg’s comment on: Bounty: Diverse hard tasks for LLM agents
Good point!
Hmm I think it’s fine to use OpenAI/Anthropic APIs for now. If it becomes an issue we can set up our own Llama or whatever to serve all the tasks that need another model. It should be easy to switch out one model for another.

Beth Barnes 2 Mar 2024 4:42 UTC
LW: 2 AF: 1
0
AF
in reply to: ryan_greenblatt’s comment on: Bounty: Diverse hard tasks for LLM agents
Yep, that’s right. And also need it to be possible to check the solutions without needing to run physical experiments etc!

Beth Barnes 4 Feb 2024 0:42 UTC
LW: 24 AF: 13
−1
AF
on: The case for more ambitious language model evals
I’m pretty skeptical of the intro quotes without actual examples; I’d love to see the actual text! Seems like the sort of thing that gets a bit exaggerated when the story is repeated, selection bias for being story-worthy, etc, etc.

I wouldn’t be surprised by the Drexler case if the prompt mentions something to do with e.g. nanotech and writing/implies he’s a (nonfiction) author—he’s the first google search result for “nanotechnology writer”. I’d be very impressed if it’s something where e.g. I wouldn’t be able to quickly identify the author even if I’d read lots of Drexler’s writing (ie it’s about some unrelated topic and doesn’t use especially distinctive idioms).

More generally I feel a bit concerned about general epistemic standards or something if people are using third-hand quotes about individual LLM samples as weighty arguments for particular research directions.

Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information—the handle of a private alt, for example—and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.

I think how I’m imagining the more targeted, ‘working-with-finetuning’ version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone. Currently models seem really far from being able to do this, in most cases. The hope would be that, if you’ve ruled out exploration hacking, then if you can’t elicit the models to utilise their crazy text prediction superskills in service of a goal, then the model can’t do this either.

But I agree it would definitely be nice to know that the crazy text prediction superskills are there and it’s just a matter of utilization. I think that looking at elicitation gap might be helpful for this type of thing.

Beth Barnes 24 Jan 2024 3:08 UTC
LW: 2 AF: 1
0
AF
in reply to: Bruce G’s comment on: Bounty: Diverse hard tasks for LLM agents
Hey! It sounds like you’re pretty confused about how to follow the instructions for getting the VM set up and testing your task code. We probably don’t have time to walk you through the Docker setup etc—sorry. But maybe you can find someone else who’s able to help you with that?

Beth Barnes 24 Jan 2024 2:25 UTC
LW: 2 AF: 1
0
AF
in reply to: David Mears’s comment on: Bounty: Diverse hard tasks for LLM agents
I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want. I think it’s fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.

One thing is that tasks where there’s a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance—in the sense that the agent’s score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it’s a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.

I have some code for setting up a simple black box game inside our infra that you could adapt for this if that’s useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it.
I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.

Beth Barnes 16 Jan 2024 22:05 UTC
14 points
1
on: Decent plan prize announcement (1 paragraph, $1k)
Some pragmatic things:
- Minimize risk of theft / exfiltration:
  - Weights live only in secure datacenter, which can’t run general-purpose code, and all it does is serve the model
  - Only one physical line out of the datacenter, with physically limited bandwidth
    To enable use of model with low bandwidth, compress the text before it leaves the datacenter with a smaller model, and decompress after it’s past the bottleneck
  - Datacenter is physically guarded by people with intense background screening, large area around it cleared, (maybe everything is in a faraday cage?), 3-person rule for everything, nothing leaves the datacenter (e.g. if some part is replaced, the old part + any tools used are not removed, but are kept in secure storage or incinerated)
- Monitoring / control:
  - Use dumber and probably safe models for monitoring:
    Every (prompt, generation) pair going to and from the datacenter is overseen by (finetuned) GPT-4. GPT-4 is instructed to flag anything that seems dangerous, or that it can’t understand, or looks like a jailbreak/advex.
    To avoid advexes against GPT-4, subsample and perturb the prompts/generations in a few different ways, and have GPT-4 also screen these
  - Have large workforce of fairly smart humans (e.g. 1000 workers with technical undergrad degrees, paid $50/hr, 50 experts in a range of relevant domains paid $200/hr) reviewing anything flagged by GPT-4, and some random sample of things that weren’t flagged
  - Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc.

Beth Barnes 13 Jan 2024 1:27 UTC
LW: 2 AF: 1
0
AF
in reply to: David Mears’s comment on: Bounty: Diverse hard tasks for LLM agents
I like this genre of task. I didn’t quite understand what you meant about being able to score the human immediately—presumably we’re interested in how to human could do given more learning, also?

The ‘match words’ idea specifically is useful—I was trying to think of a simple “game theory”-y game that wouldn’t be memorized.

You can email task-support@evals.alignment.org if you have other questions and also to receive our payment form for your idea.

Beth Barnes 30 Dec 2023 15:14 UTC
LW: 4 AF: 3
0
AF
in reply to: tjade273’s comment on: Bounty: Diverse hard tasks for LLM agents
I responded to all your submissions—they look great!

Beth Barnes 28 Dec 2023 22:25 UTC
LW: 2 AF: 1
0
AF
in reply to: tjade273’s comment on: Bounty: Diverse hard tasks for LLM agents
Yes! From Jan 1st we will get back to you within 48hrs of submitting an idea / spec.
(and I will try to respond tomorrow to anything submitted over the last few days)

Beth Barnes 28 Dec 2023 22:23 UTC
LW: 6 AF: 1
0
AF
in reply to: Sunishchal Dev’s comment on: Bounty: Diverse hard tasks for LLM agents
Ah, sorry about that. I’ll update the file in a bit. Options are “full_internet”, “http_get”. If you don’t pass any permissions the agent will get no internet access. The prompt automatically adjusts to tell the agent what kind of access it has.

METR is hiring!

Beth Barnes26 Dec 2023 21:00 UTC

65 points

1 comment1 min readLW link

Beth Barnes 26 Dec 2023 2:19 UTC
LW: 9 AF: 2
0
AF
in reply to: Sunishchal Dev’s comment on: Bounty: Diverse hard tasks for LLM agents
Great questions, thank you!

1. Yep, good catch. Should be fixed now.

2. I wouldn’t be too worried about it, but very reasonable to email us with the idea you plan to start working on.

3. I think fine to do specification without waiting for approval, and reasonable to do implementation as well if you feel confident it’s a good idea, but feel free to email us to confirm first.
4. That’s a good point! I think using an API-based model is fine for now—because the scoring shouldn’t be too sensitive to the exact model used, so should be fine to sub it out for another model later. Remember that it’s fine to have human scoring also.

Beth Barnes 20 Dec 2023 16:29 UTC
LW: 2 AF: 1
0
AF
in reply to: Ape in the coat’s comment on: Bounty: Diverse hard tasks for LLM agents
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are:
(1) it seems hard to make a good environment that’s not unfair in various ways
(2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.

Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.

Beth Barnes

Clar­ify­ing METR’s Au­dit­ing Role

In­tro­duc­ing METR’s Au­ton­omy Eval­u­a­tion Resources

METR is hiring!

Clarifying METR’s Auditing Role

Introducing METR’s Autonomy Evaluation Resources