Black Box Investigation Research Hackathon

TLDR; Join the Black Box Investigation Research Hackathon to participate with others in a weekend of alignment research on language models using a cognitive psychology-style approach.

Purpose

In the Black Box Investigation Research Hackathon, the goal is to explore where black box language models work well and where they break down. It is inspired by this post by Buck Shlegeris which calls for investigators of language models in a cognitive psychology style fashion. Become the AI psychologist of the future.

The meta-goal of this Alignment Jam is to investigate if hackathons work well for 1) providing real research results and ideas, 2) increasing the research experience of the participants, and 3) being a scalable solution to exciting ways of joining the alignment community to work on interesting and well-defined problems.

Outcome

Participants are encouraged to discover interesting behavior of black box language models. At the end of the hackathon each group / participant should have created a short (<5 min) presentation showcasing their results. A jury will then vote for their favorite projects according to the review rubric specified in the appendix.

After the hackathon, we will write a LessWrong post (or series of posts) describing the main findings and the overall experience. We hope to encourage more people to do this kind of work and we believe the hackathon format works well for generating novel research perspectives on alignment.

If this hackathon is successful, we expect to repeat the success with other research agendas.

Participation instructions

It is hosted on the hackathon platform itch.io where you will follow the instructions to sign up (see here): Make an itch.io account, click “Join jam”, and come along on the 30th of September!

We will work all weekend in Gathertown in groups of 2-6. The link to Gathertown will be announced in the itch.io community tab and sent to your itch.io email when we get closer to the date. You can ask questions here.

When the hackathon starts, you will be able to continually upload new PDF / Google Doc / Colab / Github repositories during the extent of the research competition.

For interacting with the black box models, we will use 20b.eleuther.ai, beta.openai.com/playground, and you will receive research API credits to work directly with the APIs of OpenAI. There will be premade scripts to initialize your research project. We encourage you to sign up to beta.openai.com before and experiment with your free credits. This should only take 2-3 minutes.

You can get inspiration from aisafetyideas.com or from the list of previous research related to black box investigation on the itch.io page.

Asks

Funders: You can add more to the prize pool through super-linear.org if you’re a funder. This will enable us to provide more credits for working with the largest language models, provide hackathon merch, and to increase the prize pool.

Jury, mentor, or speaker: Contact us on our Discord or by email esben@apartresearch.com if you are interested in mentoring, joining the jury, or giving a talk.

Appendix

Reviewer rubric

Each submission will be evaluated by a group of judges on 1-10 scale for 4 different qualities.

Criterion	Weight	Description
Alignment	2	How good are your arguments for how this result informs the longterm alignment of large language models? How informative are the results for the field in general?
AI Psychology	1	Have you come up with something that might guide the “field” of AI Psychology in the future?
Novelty	1	Have the results not been seen before and are they surprising compared to what we expect?
Generality	1	Do your research results show a generalization of your hypothesis? E.g. if you expect language models to overvalue evidence in the prompt compared to in its training data, do you test more than just one or two different prompts? A top score might be a statistical testing of 200+ prompt examples.
Reproducibility	1	Are we able to easily reproduce the research and do we expect the results to reproduce? A high score here might be a high Generality and a well-documented Github repository to rerun all experiments from the report.