Alignment researcher. Views are my own and not those of my employer. https://www.barnes.page/
Beth Barnes
Introducing METR’s Autonomy Evaluation Resources
Good point!
Hmm I think it’s fine to use OpenAI/Anthropic APIs for now. If it becomes an issue we can set up our own Llama or whatever to serve all the tasks that need another model. It should be easy to switch out one model for another.
Yep, that’s right. And also need it to be possible to check the solutions without needing to run physical experiments etc!
I’m pretty skeptical of the intro quotes without actual examples; I’d love to see the actual text! Seems like the sort of thing that gets a bit exaggerated when the story is repeated, selection bias for being story-worthy, etc, etc.
I wouldn’t be surprised by the Drexler case if the prompt mentions something to do with e.g. nanotech and writing/implies he’s a (nonfiction) author—he’s the first google search result for “nanotechnology writer”. I’d be very impressed if it’s something where e.g. I wouldn’t be able to quickly identify the author even if I’d read lots of Drexler’s writing (ie it’s about some unrelated topic and doesn’t use especially distinctive idioms).
More generally I feel a bit concerned about general epistemic standards or something if people are using third-hand quotes about individual LLM samples as weighty arguments for particular research directions.
Another way in which it seems like you could achieve this task however, is to refer to a targeted individual’s digital footprint, and make inferences of potentially sensitive information—the handle of a private alt, for example—and use that to exploit trust vectors. I think current evals could do a good job of detecting and forecasting attack vectors like this one, after having identified them at all. Identifying them is where I expect current evals could be doing much better.
I think how I’m imagining the more targeted, ‘working-with-finetuning’ version of evals to handle this kind of case is that you do your best to train the model to use its full capabilities, and approach tasks in a model-idiomatic way, when given a particular target like scamming someone. Currently models seem really far from being able to do this, in most cases. The hope would be that, if you’ve ruled out exploration hacking, then if you can’t elicit the models to utilise their crazy text prediction superskills in service of a goal, then the model can’t do this either.
But I agree it would definitely be nice to know that the crazy text prediction superskills are there and it’s just a matter of utilization. I think that looking at elicitation gap might be helpful for this type of thing.
Hey! It sounds like you’re pretty confused about how to follow the instructions for getting the VM set up and testing your task code. We probably don’t have time to walk you through the Docker setup etc—sorry. But maybe you can find someone else who’s able to help you with that?
I think doing the AI version (bots and/or LLMs) makes sense as a starting point, then we should be able to add the human versions later if we want. I think it’s fine for the thing anchoring it to human performance is to be comparison of performance compared to humans playing against the same opponents, not literally playing against humans.
One thing is that tasks where there’s a lot of uncertainty about what exactly the setup is and what distribution the opponents / black box functions / etc are drawn from, this can be unhelpfully high-variance—in the sense that the agent’s score depends really heavily on its assumptions about what the other agents are and what they will do, rather than only measuring capability. So I think it’s a good idea to give the agent reasonable information about the distribution of opponents, even if you still include uncertainty.
I have some code for setting up a simple black box game inside our infra that you could adapt for this if that’s useful. In general I think the structure of starting a server on localhost that implements the game and then telling the agent in the prompt that it needs to send queries to that server works well if you want the agent to interact with some program without being able to see / edit it.
I think open-source versions could also be interesting, where you tell the model more about the opponents including the prompts for the other models or the source code, and see how well it can use that information to perform better.
Some pragmatic things:
Minimize risk of theft / exfiltration:
Weights live only in secure datacenter, which can’t run general-purpose code, and all it does is serve the model
Only one physical line out of the datacenter, with physically limited bandwidth
To enable use of model with low bandwidth, compress the text before it leaves the datacenter with a smaller model, and decompress after it’s past the bottleneck
Datacenter is physically guarded by people with intense background screening, large area around it cleared, (maybe everything is in a faraday cage?), 3-person rule for everything, nothing leaves the datacenter (e.g. if some part is replaced, the old part + any tools used are not removed, but are kept in secure storage or incinerated)
Monitoring / control:
Use dumber and probably safe models for monitoring:
Every (prompt, generation) pair going to and from the datacenter is overseen by (finetuned) GPT-4. GPT-4 is instructed to flag anything that seems dangerous, or that it can’t understand, or looks like a jailbreak/advex.
To avoid advexes against GPT-4, subsample and perturb the prompts/generations in a few different ways, and have GPT-4 also screen these
Have large workforce of fairly smart humans (e.g. 1000 workers with technical undergrad degrees, paid $50/hr, 50 experts in a range of relevant domains paid $200/hr) reviewing anything flagged by GPT-4, and some random sample of things that weren’t flagged
Maybe: multiple independent monitoring programs that intervene at different points, are managed by different people, use slightly different versions of GPT-4, etc.
I like this genre of task. I didn’t quite understand what you meant about being able to score the human immediately—presumably we’re interested in how to human could do given more learning, also?
The ‘match words’ idea specifically is useful—I was trying to think of a simple “game theory”-y game that wouldn’t be memorized.
You can email task-support@evals.alignment.org if you have other questions and also to receive our payment form for your idea.
I responded to all your submissions—they look great!
Yes! From Jan 1st we will get back to you within 48hrs of submitting an idea / spec.
(and I will try to respond tomorrow to anything submitted over the last few days)
Ah, sorry about that. I’ll update the file in a bit. Options are “full_internet”, “http_get”. If you don’t pass any permissions the agent will get no internet access. The prompt automatically adjusts to tell the agent what kind of access it has.
METR is hiring!
Great questions, thank you!
1. Yep, good catch. Should be fixed now.
2. I wouldn’t be too worried about it, but very reasonable to email us with the idea you plan to start working on.
3. I think fine to do specification without waiting for approval, and reasonable to do implementation as well if you feel confident it’s a good idea, but feel free to email us to confirm first.4. That’s a good point! I think using an API-based model is fine for now—because the scoring shouldn’t be too sensitive to the exact model used, so should be fine to sub it out for another model later. Remember that it’s fine to have human scoring also.
I think some tasks like this could be interesting, and I would definitely be very happy if someone made some, but doesn’t seem like a central example of the sort of thing we most want. The most important reasons are:
(1) it seems hard to make a good environment that’s not unfair in various ways
(2) It doesn’t really play to the strengths of LLMs, so is not that good evidence of an LLM not being dangerous if it can’t do the task. I can imagine this task might be pretty unreasonably hard for an LLM if the scaffolding is not that good.
Also bear in mind that we’re focused on assessing dangerous capabilities, rather than alignment or model “disposition”. So for example we’d be interested in testing whether the model is able to successfully avoid a shutdown attempt when instructed, but not whether it would try to resist such attempts without being prompting to.
Did you try following the instructions in the README.md in the main folder for setting up the docker container and running an agent on the example task?
I think your computer is reading the .ts extension and thinking it’s a translation file: https://doc.qt.io/qt-6/linguist-translating-strings.html
But it’s actually a typescript file. You’ll need to open it with a text editor instead.
Yeah, doing a walkthrough of a task submission could be great. I think it’s useful if you have a decent amount of coding experience though—if you happen to be a non-coder there might be quite a lot of explaining required.
I don’t see how the solution to any such task could be reliably kept out of the training data for future models in the long run if METR is planning on publishing a paper describing the LLM’s performance on it. Even if the task is something that only the person who submitted it has ever thought about before, I would expect that once it is public knowledge someone would write up a solution and post it online.
We will probably not make full details public for all the tasks. We may share privately with researchers
Thanks a bunch for the detailed questions, helpful!
1. Re agents—sorry for the confusing phrasing. A simple agent is included in the starter pack, in case this is helpful as you develop + test your task.
2. Submission: You need to submit the task folder containing any necessary resources, the yourtask.py file which does any necessary setup for the task and implements automatic scoring if applicable, and the filled out README. The README has sections which will need you to attach some examples of walkthroughs / completing the task.
Any suggestions for making this clearer? The existing text is:A submission consists of a task folder including task.py file and any resources needed, as well as detailed documentation of the task and how you tested it. In particular, you need to have somebody else run through the task to do quality assurance—making sure that the instructions aren’t ambiguous, all the necessary resources are actually present, there isn’t an accidental shortcut, etc.
Even if it has already been published we’re still interested. Especially ones that were only published fairly recently, and/or only have the description of the puzzle rather than the walkthrough online, and/or there are only a few copies of the solutions rather than e.g. 20 public repos with different people’s solutions
I think we’d be super interested in you making custom ones! In terms of similarity level, I think it would be something like “it’s not way easier for a human to solve it given solutions to similar things they can find online”.
I imagine we’d be interested in at least 10, as long as they don’t all have the same trick or something, and maybe more like 50 if they’re pretty diverse? (but I think we’d be at more like $1000 for marginal task at those sort of numbers)
I don’t expect there to be a hard deadline, expect we’ll still want more of these for next year or two at least. Sooner is better, next week or so would be awesome.
To be clear, with (b) you could still have humans play it—just would have to put it up in a way where it won’t get scraped (e.g. you email it to people after they fill in an interest form, or something like that)
I’m super excited for this—hoping to get a bunch of great tasks and improve our evaluation suite a lot!