This is fun, though felt somewhat like shooting fish in a barrel. IIRC, dismantling some of Socrates’ claims was an exercise put to all students in an intro-level philosophy course I took in college, and that course left me with the impression that most modern philosophers view Socrates as primarily of historical interest. That said, apparently 49 of the 1,785 survey respondents in the PhilPapers survey marked Socrates when asked “For which nonliving philosophers X would you describe yourself or your work as X-ian, or the equivalent?” , which is more than I would have guessed, and was the 15th most frequently name checked in reponse to this question out of a rather long list.
gabrielrecc
In the UK, exceeding £100,000 in household income for individuals making use of “Free Childcare for Working Parents” is indeed one of these pathological cases. Benefits cliffs frequently make things nonlinear unfortunately!
That said I do recognize that OP’s original point was not about benefits. Even setting benefits aside however, my understanding is that there are cases where you might not want to be pushed into a higher bracket (e.g. drawing down from a retirement pot of fixed size, where funds drawn from it are considered “income” only in the year(s) in which they are withdrawn: often better to draw down relatively evenly over n years rather than most of it in a single year, due to tax brackets). Most talk of “not wanting to be pushed into a higher tax bracket” that I have heard comes from people in this and similar situations.
Some colleagues and I did some follow-up on the paper in question and I would highly endorse “probably it worked because humans and AIs have very complementary skills”. Regarding their MMLU findings, appendix E of our preprint points out that that participants were significantly less likely to answer correctly when engaging in more than one turn of conversation. Engaging in very short (or even zero-turn) conversations happened often enough to provide reasonable error bars on the plot below (data below from MMLU, combined from the paper study & the replication mentioned in appendix E):
I think this suggests that there were some questions that humans knew the answer to and the models didn’t, and vice versa, and some participants seemed to employ a strategy of deferring to the model primarily when uncertain.
On the QuALITY findings, the original paper noted thatQuALITY questions are meant to be answerable by English-fluent college-educated adults, but they require readers to thoroughly understand a short story of about 5,000 words, which would ordinarily take 15–30 minutes to read. To create a challenging task that requires model assistance, we ask human participants to answer QuALITY questions under a 5-minute time limit (roughly paralleling Pang et al., 2022; Parrish et al., 2022b,a). This prevents them from reading the story in full and forces them to rely on the model to gather relevant information
so it’s not surprising that an LLM that does get to read the full story outperforms humans here. Based on looking at some of the QuALITY transcripts, I think the uplift for humans + LLMs here came from the fact that humans were better at reading comprehension than 2022-era LLMs. For instance, in the first transcript I looked at the LLM suggested one answer, the human asked for the excerpt of the story that supported their claim, and when the LLM provided it the human noticed that the excerpt contained the relevant information but supported a different answer than the one the LLM had provided.
I assume that both were inspired by https://arxiv.org/abs/2108.12099 and are related via that shared ancestor
I think there’s a missing link on https://alignmentproject.aisi.gov.uk/how-to-apply :
”The GFA for DSIT/AISI funded projects is standard and not subject to negotiation. The GFA can be accessed here: link to GFA attachment.”
Agree that it would be better not to have them up as readily downloadable plaintext, and it might even be worth going a step farther and encrypting the gzip or zip file, and making the password readily available in the repo’s README. This is what David Rein did with GPQA and what we did with FindTheFlaws. Might be overkill, but if I were working for a frontier lab building scrapers to pull in as much data from the web as possible, I’d certainly have those scrapers unzip any unencrypted gzips they came across, and I assume their scrapers are probably doing the same.
PS to the original posters: seems like nice work! Am planning to read the full paper and ask a more substantive follow-up question when I get the chance
Love pieces that manage to be both funny and thought-provoking. And +1 for fitting a solar storm in there. There is now better evidence of very large historical solar storms than there had been during David Roodman’s Open Phil review in late 2014, have been meaning to write something up about that but other things have taken priority.
This is cool, although I suspect that you’d get something similar from even very simple models that aren’t necessarily “modelling the world” in any deep sense, simply due to first and second order statistical associations between nearby place names. See e.g. https://onlinelibrary.wiley.com/doi/pdfdirect/10.1111/j.1551-6709.2008.01003.x , https://escholarship.org/uc/item/2g6976kg .
Leopold and Pavel were out (“fired for allegedly leaking information”) in April. https://www.silicon.co.uk/e-innovation/artificial-intelligence/openai-fires-researchers-558601
Nice job! I’m working on something similar.
> Next, I might get my agent to attempt the last three tasks in the report
I wanted to clarify one thing: Are you building custom prompts for the different tasks? If so, I’d be curious to know how much effort you put into these (I’m generally curious how much of your agent’s ability to complete more tasks might be due to task-specific prompting, vs. the use of WebDriverIO and other affordances of your scaffolding). If not, isn’t getting the agent to attempt the last three tasks as simple as copy-pasting the task instructions from the ARC Evals task specs linked in the report, and completing the associated setup instructions?
Cybersecurity seems in a pretty bad state globally—it’s not completely obvious to me that a historical norm of “people who discover things like SQL injection are pretty tight-lipped about them and share them only with governments / critical infrastructure folks / other cybersecurity researchers” would have led to a worse situation than the one we’re in cybersecuritywise...
I’d recommend participating in AGISF. Completely online/virtual, a pretty light commitment (I’d describe it more as a reading group than a course personally), cohorts are typically run by AI alignment researchers or people who are quite well-versed in the field, and you’ll be added to a Slack group which is pretty large and active and a reasonable way to try to get feedback.
This is great. One nuance: This implies that behavioral RL fine-tuning evals are strictly less robust than behavioral I.I.D. fine-tuning evals, and that as such they would only be used for tasks that you know how to evaluate but not generate. But it seems to me that there are circumstances in which the RL-based evals could be more robust at testing capabilities, namely in cases where it’s hard for a model to complete a task by the same means that humans tend to complete it, but where RL can find a shortcut that allows it to complete the task in another way. Is that right or am I misunderstanding something here?
For example, if we wanted to test whether a particular model was capable of getting 3 million points in the game of Qbert within 8 hours of gameplay time, and we fine-tuned on examples of humans doing the same, it might not be able to: achieving this in the way an expert human does might require mastering numerous difficult-to-learn subskills. But an RL fine-tuning eval might find the bug discovered by Canonical ES, illustrating the capability without needing the subskills that humans lean on.
Nice, thanks for this!
If you want to norm this for your own demographic, you can get a very crude estimate by entering your demographic information in this calculator, dividing your risk of hospitalization by 3 and multiplying the total by 0.4 (which includes the 20% reduction from vaccination and the 50% reduction from Paxlovid)
Anecdotally, I feel like I’ve heard a number of instances of folks with what pretty clearly seemed to be long Covid coming on despite not having required hospitalization? And in this UK survey of “Estimated number of people (in thousands) living in private households with self-reported long COVID of any duration”, it looks like only 4% of such people were hospitalized (March 2023 dataset table 1)
Irving’s team’s terminology has been “behavioural alignment” for the green box—https://arxiv.org/pdf/2103.14659.pdf
Automated Sandwiching & Quantifying Human-LLM Cooperation: ScaleOversight hackathon results
The byte-pair encoding is probably hurting it somewhat here; forcing it to unpack it will likely help. Try using this as a one-shot prompt:
How many Xs are there in “KJXKKLJKLJKXXKLJXKJL”?
Numbering the letters in the string, we have: 1 K, 2 J, 3 X, 4 K, 5 K, 6 L, 7 J, 8 K, 9 L, 10 J, 11 K, 12 X, 13 X, 14 K, 15 L, 16 J, 17 X, 18 K, 19 J, 20 L. There are Xs at positions 3, 12, 13, and 17. So there are 4 Xs in total.
How many [character of interest]s are there in “[string of interest goes here]”?
If it’s still getting confused, add more shots—I suspect it can figure out how to do it most of the time with a sufficient number of examples.
It seems like you’re claiming something along the lines of “absolute power corrupts absolutely” … that every set of values that could reasonably be described as “human values” to which an AI could be aligned—your current values, your CEV, [insert especially empathetic, kind, etc. person here]’s current values, their CEV, etc. -- would endorse subjecting huge numbers of beings to astronomical levels of suffering, if the person with that value system had the power to do so.
I guess I really don’t find that claim plausible. For example, here is my reaction to the following two questions in the post:
”How many ordinary, regular people throughout history have become the worst kind of sadist under the slightest excuse or social pressure to do so to their hated outgroup?”
… a very, very small percentage of them? (minor point: with CEV, you’re specifically thinking about what one’s values would be in the absence of social pressure, etc...)
”What society hasn’t had some underclass it wanted to put down in the dirt just to lord power over them?”
It sounds like you think “hatred of the outgroup” is the fundamental reason this happens, but in the real world it seems like “hatred of the outgroup” is driven by “fear of the outgroup”. A godlike AI that is so powerful that it has no reason to fear the outgroup also has no reason to hate it. It has no reason to behave like the classic tyrant whose paranoia of being offed leads him to extreme cruelty in order to terrify anyone who might pose a threat, because no one poses a threat.
I remember looking into exactly this question when my wife and I were looking into pros and cons of daycare. One thing that I think the analysis here misses is that this is generally worst for the first 1 or 2 years, and then much less so. I don’t remember exactly what studies I was looking at back when I was researching this, but asking Claude just now “Is there a study that looks at the frequency of child illnesses by year after first enrolled in daycare?” yielded the following references:
https://jamanetwork.com/journals/jamapediatrics/fullarticle/191522
https://pubmed.ncbi.nlm.nih.gov/2007922/
https://pubmed.ncbi.nlm.nih.gov/11296076/
https://pmc.ncbi.nlm.nih.gov/articles/PMC5588939/
a CNN article that pointed to https://pubmed.ncbi.nlm.nih.gov/21135342/
and an Emily Oster blog post which links to some other relevant studies.
I haven’t gone through any of the above links in detail just now, but the general message one gets from the abstracts seems to be an increase in frequency for years 1-2, then back to baseline. Some suggest some protective effect in early elementary school years (the first link, which is the Tuscon study OP mentioned; the Côté paper that the CNN article pointed to; and apparently the Hullegie et al. 2016 study OP mentioned, which wasn’t among those that Claude dug up).
The Søegaard et al. study highlighted by the OP has an interesting couple of figures 1 and 2, for boys and girls respectively. These are differences in infection rate per year for four groups, compared to children never in childcare. Since this is Denmark I’m guessing the “instition enrollment at 3 yrs” is kids who started børnehave (preschool) at age 3.
This does look like it shows some amount of immunity happening: otherwise, we’d presumably expect to see group (b) having a spike as high as group (d) at age 3 yrs. Though importantly it isn’t enough to compensate if what you care about is total number of illnesses avoided. [1]
Also, although the spikes look quite dramatic, the y-axis shows that the difference in infection rate per year is approximately 1 for the highest spike in each graph. Similarly, the abstract notes that children enrolled in childcare before age 12 months had experienced 0.5 − 0.7 more infections than peers enrolled at 3 years, cumulatively, by the time they got to age 6 years. To be sure, that certainly corresponds to more than one actual infection, since “infection” in this paper means an infection serious enough to result in an antimicrobial (usu. an antibiotic) being prescribed, but is not an enormous effect.
Regarding the beliefs and confidences listed in the post:
(Quite confident) The most common illnesses (colds and flu) don’t build immunity in general (in kids or adults) because they mutate every year
(Quite confident) The same illness has a greater risk of complications in babies vs. older children and adults
(Moderately confident) The same illness has a greater duration in babies vs. older children and adults
(Moderately confident) Illness during early development is probably more harmful than illness during adulthood
(Weak guess) Daycare environments are more conducive to disease spread than schools for older kids and the number of possible illnesses is very high; there isn’t just a limited number of things you catch once
For #1, I think my level of agreement depends on exactly what is meant by “immunity in general”. Claude’s answer to “Does catching viruses improve your immune system long term?” can be summed up as (in Claude’s words): “Surviving one virus generally makes you better at fighting that specific virus (and sometimes closely related ones). It doesn’t broadly upgrade your immune system’s ability to handle unrelated threats.” This matches my previous understanding.
However, due to the caveat about “and sometimes closely related ones”, I think this is consistent with the claims of lower rates of illness in early school ages reported by Tuscon / Côté / Hullegie, and the difference between lines (b) and (d) at age 3 in the Søegaard graphs. My understanding is that even though viruses mutate all the time, many remain “closely related” to the versions they mutated from, and this confers some protection from infection and/or severity. For example, if I remember correctly from back when I was doing a lot of reading on COVID, the consensus was that after repeated and significant mutations, protection given by a vaccination based on an older strain gave limited or no protection from infection (no longer recognized by B cells), but still gave significant protection from severe infection, since the epitopes recognized by T cells remained consistent. Something like this goes for flu viruses also (look up “heterosubtypic immunity”) and I believe common cold coronaviruses too.
That said, I am a bit baffled by the lack of any dip at all for the nursery groups below the baseline at age 6 (when Danish compulsory education starts) in the Søegaard graphs.
#2 seems true for most illnesses, and seems likely to be an underappreciated consideration. My understanding is that children under 1 year old are particularly vulnerable.
I hadn’t really thought about #3-4 and haven’t taken the time to dig up relevant literature to see if I agree, but they seem plausible: if true, then they should also inform one’s calculus.
Less certain about #5 (seems likely to be technically true, but not sure it moves me very much one way or another given my beliefs on the others and the data from the studies above).
An additional consideration is that there is some evidence that catching COVID can have long-term negative effects on the immune system, although COVID is also weird in that children fare better than adults with it overall.
So, considerations pointing in both directions. I will say that in our own case I am happy we made the decision to start our son in daycare shortly after his first birthday, particularly given that our alternative was one parent quitting a fulfilling job to stay at home (IIRC there were no available full-time nannies or au pairs in our area, or at least none at a cost we felt we could remotely afford). This would have been a financial hit that probably would have required us to take on substantial debt, and also would have been incredibly challenging. Additionally, the level of different experiences and socialization our son gets on a daily basis is well beyond what we could realistically provide on our own, and he loves it.
So for our family, daycare has been worth the illnesses. But of course we would be biased to prefer the decision we actually made, and might feel differently if we’d had a worse experience. And I won’t pretend that the first year of it was easy: due in part to her asthma, my wife got pneumonia twice. (We’ve used this as an excuse to get the daycare to allow us to wait to pick him up outdoors, rather than in the cramped coatroom in which every other child and parent breathes in from 5.30-6pm).
Regarding the reference group and the weird increase after age 14, the authors write: