I think there’s a missing link on https://alignmentproject.aisi.gov.uk/how-to-apply :
”The GFA for DSIT/AISI funded projects is standard and not subject to negotiation. The GFA can be accessed here: link to GFA attachment.”
gabrielrecc
Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo’s README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I’d certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.
PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance
Love pieces that manage to be both funny and thought-provoking. And +1 for fitting a solar storm in there. There is now better evidence of very large historical solar storms than there had been during David Roodman’s Open Phil review in late 2014, have been meaning to write something up about that but other things have taken priority.
This is cool, although I suspect that you’d get something similar from even very simple models that aren’t necessarily “modelling the world” in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .
Leopold and Pavel were out (“fired for allegedly leaking information”) in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601
Nice job! I’m working on something similar.
> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I’d be curious to know how much effort you put into these (I’m generally curious how much of your agent’s ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn’t getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?
Cybersecurity seems in a pretty bad state globally—it’s not completely obvious to me that a historical norm of “people who discover things like SQL injection are pretty tight-lipped about them and share them only with governments / critical infrastructure folks / other cybersecurity researchers” would have led to a worse situation than the one we’re in cybersecuritywise...
I’d recommend participating in AGISF. Completely online/virtual, a pretty light commitment (I’d describe it more as a reading group than a course personally), cohorts are typically run by AI alignment researchers or people who are quite well-versed in the field, and you’ll be added to a Slack group which is pretty large and active and a reasonable way to try to get feedback.
This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it’s hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?
For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.
Nice, thanks for this!
If you want to norm this for your own demographic, you can get a very crude estimate by entering your demographic information in this calculator, dividing your risk of hospitalization by 3 and multiplying the total by 0.4 (which includes the 20% reduction from vaccination and the 50% reduction from Paxlovid)
Anecdotally, I feel like I’ve heard a number of instances of folks with what pretty clearly seemed to be long Covid coming on despite not having required hospitalization? And in this UK survey of “Estimated number of people (in thousands) living in private households with self-reported long COVID of any duration”, it looks like only 4% of such people were hospitalized (March 2023 dataset table 1)
Irving’s team’s terminology has been “behavioural alignment” for the green box—https://arxiv.org/pdf/2103.14659.pdf
Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results
The byte-pair encoding is probably hurting it somewhat here; forcing it to unpack it will likely help. Try using this as a one-shot prompt:
How many Xs are there in “KJXKKLJKLJKXXKLJXKJL”?
Numbering the letters in the string, we have: 1 K, 2 J, 3 X, 4 K, 5 K, 6 L, 7 J, 8 K, 9 L, 10 J, 11 K, 12 X, 13 X, 14 K, 15 L, 16 J, 17 X, 18 K, 19 J, 20 L. There are Xs at positions 3, 12, 13, and 17. So there are 4 Xs in total.
How many [character of interest]s are there in “[string of interest goes here]”?
If it’s still getting confused, add more shots—I suspect it can figure out how to do it most of the time with a sufficient number of examples.
It seems like you’re claiming something along the lines of “absolute power corrupts absolutely” … that every set of values that could reasonably be described as “human values” to which an AI could be aligned—your current values, your CEV, [insert especially empathetic, kind, etc. person here]’s current values, their CEV, etc. -- would endorse subjecting huge numbers of beings to astronomical levels of suffering, if the person with that value system had the power to do so.
I guess I really don’t find that claim plausible. For example, here is my reaction to the following two questions in the post:
”How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?”
… a very, very small percentage of them? (minor point: with CEV, you’re specifically thinking about what one’s values would be in the absence of social pressure, etc...)
”What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?”
It sounds like you think “hatred of the outgroup” is the fundamental reason this happens, but in the real world it seems like “hatred of the outgroup” is driven by “fear of the outgroup”. A godlike AI that is so powerful that it has no reason to fear the outgroup also has no reason to hate it. It has no reason to behave like the classic tyrant whose paranoia of being offed leads him to extreme cruelty in order to terrify anyone who might pose a threat, because no one poses a threat.
This reminded me of some findings associated with “latent semantic analysis”, an old-school information retrieval technique. You build a big matrix where each unique term in a corpus (excluding a stoplist of extremely frequent terms) is assigned to a row, each document is assigned to a column, and each cell holds the number of times that term appeared in document , and with some kind of weighting scheme that downweights frequent terms), and you take the SVD. This also gives you interpretable dimensions, at least if you use varimax rotation. See for example pgs. 9-11 & pgs. 18-20 of this paper. Also, I seem to recall that the positive and negative singular values after doing latent semantic analysis are often both semantically interpretable, sometimes with antipodal pairs, although I can’t find the paper where I saw this.
I’m not sure whether the right way to think about this is “you should be very circumspect about saying that ‘semantic processing’ is going on just because the SVD has interpretable dimensions, because you get that merely by taking the SVD of a slightly preprocessed word-by-document matrix”, or rather “a lot of what we call ‘semantic processing’ in humans is probably just down to pretty simple statistical associations, which the later layers seem to be picking up on”, but it seemed worth mentioning in any case!edit: seems likely that the “association clusters” seen in the earlier layers might map onto what latent semantic analysis is picking up on, whereas the later layers might be picking up on semantic relationships that aren’t as directly reflected in the surface-level statistical associations. could be tested!
Why do you expect Bitcoin to be excepted from being labelled a security along with the rest?
(Apologies if the answer is obvious to those who know more about the subject than me, am just genuinely curious)
Had a similar medical bill story from when I was a poor student: Medical center told me that insurance would cover an operation. They failed to mention that they were only talking about the surgeon’s fee; the hospital at which they arranged the operation was out-of-network and I was stuck with 50% of the facility’s costs. I explained my story to the facility. They said I still had to pay but that a payment plan would be possible, and that I could start by paying a small amount each month. I took that literally and just started paying a (very) small amount monthly. At some point they called back to tell me to formally arrange a payment plan through their online portal, which gave me options with such high interest rates that there was no way my future earnings would increase at a fast enough rate to make a payment plan make any sense whatsoever. I called back and explained this, and said that if those were the only options I guess I would just have to try to scrape the money together now, and that I was prepared to try to do this. The administrator, bless her heart, asked me to hold for awhile, and eventually came back to say “I’ve spoken with my colleagues, and your current balance owed to us is now zero dollars”.
This (along with a few other experiences in my life) has underscored how sometimes an apparently immovable constraint can evaporate if you can manage to talk to the right person. That said, I felt very lucky to have been taken pity on in this way—I feel like having one’s balance explicitly zeroed out in this way is rare! But it’s interesting to hear that Zvi knows of cases where someone just didn’t pay, with no consequences. I would have assumed that they’d normally report nonpayers to credit agencies and crater their credit scores after long enough, as it costs them nothing or almost nothing to do so. Would be interested either to hear other people’s anecdotes of what happened after nonpayment of a large hospital bill (positive or negative), or to see data on this if anyone knows of any.
I was using medical questions as just one example of the kind of task that’s relevant to sandwiching. More generally, what’s particularly useful for this research programme are
tasks where we have “models which have the potential to be superhuman at [the] task”, and “for which we have no simple algorithmic-generated or hard-coded training signal that’s adequate”; and
for which there is some set of reference humans who are currently better at the task than the model;
and for which there is some set of reference humans for whom the task is difficult enough that they would have trouble even evaluating/recognizing good performance. (you also want this set of reference humans to be capable of being helped to evaluate/recognize good performance in some way)
Prime examples are task types that require some kind of niche expertise to do and evaluate. Cotra’s examples involve “[fine-tuning] a model to answer long-form questions in a domain (e.g. economics or physics) using demonstrations and feedback collected from experts in the domain”, “[fine-tuning] a coding model to write short functions solving simple puzzles using demonstrations and feedback collected from expert software engineers”, “[fine-tuning] a model to translate between English and French using demonstrations and feedback collected from people who are fluent in both languages”. I was just making the point that Surge can help with this kind of thing in some domains (coding), but not in others.
It’s worth knowing that there are some categories of data that Surge is not well positioned to provide. For example, while they have a substantial pool of participants with programming expertise, my understanding from speaking with a Surge rep is that they don’t really have access to a pool of participants with (say) medical expertise—although for small projects it sounds like they are willing to try to see who they might already have with relevant experience in their existing pool of ‘Surgers’. This kind of more niche expertise does seem likely to become increasingly relevant for sandwiching experiments. I’d be interested in learning more about companies or resources that can help collect RLHF data from people with uncommon (but not super-rare) kinds of expertise for exactly this reason.
I assume that both were inspired by https://arxiv.org/abs/2108.12099 and are related via that shared ancestor