Yes; edited to clarify; ty.
abstractapplic
D&D.Sci: Serial Healers
I hereby voice strong approval of the meta-level approaches on display (being willing to do unpopular and awkward things to curate our walled garden, noticing that this particular decision is worth justifying in detail, spending several thousand words explaining everything out in the open, taking individual responsibility for making the call, and actively encouraging (!) anyone who leaves LW in protest or frustration to do so loudly), coupled with weak disapproval of the object-level action (all the complicating and extenuating factors still don’t make me comfortable with “we banned this person from the rationality forum for being annoyingly critical”).
I like the gradual increase in typos and distortions as the viewer’s eyes go down the picture.
With this concept in mind, the entire rationalist project seems like the grayspace to a whitespace that hasn’t been created yet.
This is a very neatly-executed and polished resource. I’m a little leery of the premise—the real world doesn’t announce “this is a Sunk Cost Fallacy problem” before putting you in a Sunk Cost Fallacy situation, and the “learn to identify biases” approach has been done before by a bunch of other people (CFAR and https://yourbias.is/ are the ones which immediately jump to mind) - but given you’re doing what you’re doing I think you’ve done it about as well as it could plausibly be done (especially w.r.t. actually getting people to relive the canonical experiments). Strong-upvoted.
Details of evals and benchmarks are typically kept secret so they can be reused in future; this makes a lot of performance-measuring analysis impossible to replicate.
There’s an upside of conventional education which no-one on any side of any debate ever seems to bring up, but which was a major benefit (possibly the major benefit) of my post-primary studies. Namely: it lets students discover what they have a natural aptitude for (or lack thereof) relative to a representative peer group. The most valuable things I learned in my Engineering courses at university were:
.I’m pretty mediocre at Engineering, especially sub-subjects which aren’t strictly Structural and/or Mechanical.
.In particular, I’m significantly worse than the average would-be Engineer at working with electronics, so I should change my plans to specialize in that field.
.Conversely, I’m significantly better than the average would-be Engineer at work involving code, money, probability, simulation and inference. (Learning this is a large part of why I eventually left Engineering for Data Science and Finance.)
Findings like this aren’t perfect since they can give false negatives (a poorly-taught course can lead someone to conclude they don’t like the subject when they actually don’t like the teacher) and false positives (my initial showing at coding courses made me think I had a genius-level talent for programming; turns out I actually just have a genius-level talent for beginner programming, and much of my lead evaporates when doing anything nontrivial); also, there’s no hard reason they can’t also be found in the workforce (if people joined companies at the same time as two dozen similarly-aged and -educated peers with whom they were frequently compared, they’d get this benefit while being paid[1]). Still, I think it’s an undervalued and underdiscussed output of our current educational paradigm.
- ^
Soldiers get this, but I’m not thinking of that kind of company.
- ^
Curated posts get their dates reset to the time they were curated. I don’t know why.
Your question cuts to the heart of our movement: it’s been over two decades and (afaik, afaict) we still don’t have a robust & general way to test whether people are getting more rational.
D&D.Sci: The Choosing Ones [Answerkey and Ruleset]
Typo in title: “SOLELY” has more Ls in it.
ETA: fixed now.
D&D.Sci: The Choosing Ones
This seems to imply there are better and worse times of day to try and get immediate feedback. If so, what are they?
Obligatory shill/reminder for any teacher reading this that if they want a never-before-seen educational data science challenge which can’t be solo’d by current-gen AI (and is incidentally super easy to mark) they can just DM me; I might want some kind of compensation if it needs to be extensively customized and/or never released to the public, but just sharing scenarios a few months before I post them on LW is something I’d absolutely do for the love of the game. (and if they want any other kind of challenge then uh good luck with that)
I’m not sure exactly what quantity you are calculating when you refer to the singularity date. Is this the extrapolated date for 50% success at 1 month (167hr) tasks?
. . . not quite: I’d forgotten that your threshold was a man-month, instead of a month of clock time. I’ll redo things with the task length being a month of work for people who do need to eat/sleep/etc: luckily this doesn’t change results much, since 730 hours and 167 hours are right next door on a log(t) scale.
SWAA fits most naturally into the ‘fully_private’ category in the HCAST parlance
Your diagnosis was on the money. Filtering for the union of fully_private HCAST tasks and SWAA tasks (while keeping the three models which caused crashes without SWAAs) does still make forecasts more optimistic, but only nets half an extra year for the every-model model, and two extra years for the since-4o model.
I’ll edit the OP appropriately; thank you for your help. (In retrospect, I probably should have run the numbery stuff past METR before posting, instead of just my qualitative concerns; I figured that if I was successfully reproducing the headline results I would be getting everything else right, but it would still have made sense to get a second opinion.)
Potential biases:
The aforementioned fact that privacy levels are upper bounds on how private things might be: afaik, METR has no airtight way to know what fully_private tasks were leaked or plagiarized or had extremely similar doppelganger tasks coincidentally created by unrelated third parties, or which public_problems had solutions posted somewhere they couldn’t see but which nevertheless made it into the training data, or whether some public_solutions had really good summaries written down somewhere online.
The way later models’ training sets can only ever have more benchmark tasks present in them than earlier ones. (I reiterate that at least some of these tasks were first created around the apparent gradient discontinuity starting with GPT-4o.)
The fact they’re using non-fully_private challenges at all, and therefore testing (in part) the ability of LLMs to solve problems present in (some of their) training datasets. (I get that this isn’t necessarily reassuring as some problems we’d find it scary for AI to solve might also have specific solutions (or hints, or analogies) in the training data.)
The fact they’re using preternaturally clean code-y tasks to measure competence. (I get that this isn’t necessarily reassuring as AI development is arguably a preternaturally clean code-y task.)
The way Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter. (To my surprise, Estimated tasks didn’t make much difference to (my reconstruction of) the analysis due (possibly/plausibly/partially) to current task horizons currently being under an hour; but if someone used HCAST to evaluate more capable future models without doing another round of Baselining . . .)
Possibly some other stuff I forgot, idk. (All I can tell you is I don’t remember thinking “this seems like it’s potentially biased against reaching scary conclusions” at any point when reading the papers.)
Notes on the Long Tasks METR paper, from a HCAST task contributor
Looked into it more and you’re right: conventional symbolic regression libraries don’t seem to have the “calculate a quantity then use that as a new variable going forward” behavior I’d have needed to get Total Value and then decide&apply a tax rate based on that. I . . . probably should have coded up a proof-of-concept before impugning everyone including myself.
You get scored based on the number of mages you correctly accuse; and a valid accusation requires you to specify at least one kind of illegal healing they’ve done. (So if you’ve already got Exampellius the Explanatory for healing Chucklepox, you don’t get anything extra for identifying that he’s also healed several cases of Rumblepox.)