I’m a little surprised that no-one’s publicly using pure AI on the currently-running D&D.Sci challenge: player count at time of writing stands at five humans and a centaur. This could be a really good (or at least really interesting) sanity check on how well this year’s Agents handle novel inference problems.
abstractapplic
Looked at other people’s conclusions and decided that
they’re completely right, and I was wrong. (Was sufficiently pleased with myself for figuring mages got stronger facing stronger opponents I forgot to check it worked this way for anyone else.)
Accordingly, my approach is now:
to take the path of least resistance with the Warrior for main challenge
And for hard mode:
Warrior again: Gremlin, Slime, Campfirex3, Cloak, Powder
Best guess for the regular challenge:
Mage: Tome, Cultist, Jaw Worm, Campfire, Sentries, Chosen, Campfire
Best guess (much less certain!) for challenging the Champion:
Warrior: Gremlin, Worm, Armor, Shield, Nob, Chosen, Nob
Earlier this week I attended a presentation on AI use in only-somewhat-techie corporate contexts, and found it fascinating how LW terminology has gone mainstream but the meanings haven’t: the presenter talked a lot about ‘existential risk’ (which I slowly inferred meant ‘AI-using competitors might put us out of business’), and ‘alignment’ (which he helpfully defined as ‘getting various AI modalities—coding, search, image gen etc—to work together harmoniously’).
Tried with search just now and ChatGPT at least no longer displayed this failure mode.
Every now and then I’ve asked AIs to “name as many characters as you can from [moderately obscure game/story]”. So far I’ve never had one fail to hallucinate extra characters, or fail to double down when I ask for more details about its creations.
I once pointed out that METR’s
Baselined tasks tend to be easier and Baselining seems (definitely slightly) biased towards making them look easier still, while Estimated tasks tend to be harder and Estimation seems (potentially greatly) biased towards making them look harder still: the combined effect would be to make progress gradients look artificially steep in analyses where Baselined and Estimated tasks both matter.
but found (to my surprise!) that removing all Estimated tasks didn’t affect headline results, presumably/partly because
most of the Estimated tasks were really difficult ones where AIs never won, so errors here had negligible effect on the shapes of logistic regression curves.
and footnoted that with
Note that this does not mean they will continue to have negligible effects on next year’s agents.
Well, it’s now next year: one more thing to keep in mind when deciding how much salt to take the Scary Graph with.
Ah, I see where I was misreading; ‘the latter’ could have meant the bednets or their unwanted side-effects, but on reading your read I read the “unwanted side-effects” read to be more plausible. Ty.
When I go to my page (with “All Posts” active) and click “See more” repeatedly, it works ~5 times but stops adding posts before there stop being more posts to add. I don’t know if this was a bug or design choice, but either way I mildly dislike it.
Is there a term for this?
Seems like a subtype of Bulverism; not aware of a more specific term.
I also find it hard to call out this type of behaviur when it happens, even when I can tell exactly what is going on.
Assuming you have a LWer-typical level of atypicality, you could say “I literally do/believe [outlandish but politically-neutral activity/opinion], there’s no way closed-mindedness is my problem.” (If it were me, I’d use donating to Shrimp Welfare; apparently most people think that’s strange, for some reason.)
Premise: The average rationalist cannot use Bayes theorem. No, I will be stronger and more specific: when I ask a room full of people at a LessWrong meetup whether they can write the formula for Bayes Theorem on a piece of paper, less than half of them can.
Nitpick: I think that second sentence is actually weaker-and-more-specific. I’m very confident I can do a Bayesian update right, I not infrequently use Bayes in my dayjob, I once made a videogame about Bayesian reasoning . . . and if you asked me to write the formula without checking Wikipedia I’d have to spend a minute or two deriving it from first principles. I suspect at least a few people are in a similar boat.
Typo in title, unless I’m misunderstanding something: 15 != 50.
That’s pretty fair and useful criticism. If I was making this again I’d have Water Elementals roll +2d6 or +3d4 instead of +1d12; the uniform distribution is in retrospect suspiciously unnatural.
(I do however plead the mitigating factors that foam swords are cheap and “foam swords plus the stuff that ends up mattering” is one of the ‘correct’ solutions.)
((Also, I’m happy you played one of my older scenarios, and that you took the time to share your thoughts. Feedback is always greatly appreciated.))
I tried ChatGPT(-5.2-Thinking) on the original D&D.Sci challenge (which is tough, but not tricky) and it got almost a perfect answer, one point shy of the optimal.
I also tried ChatGPT on the second D&D.Sci challenge (which is tricky, but not tough), and it completely failed (albeit in a sensible and conservative manner). Repeated prompts of “You’re missing something, please continue with the challenge” didn’t help.
This was incredibly good, though it seemed borderline incoherent on first reading.
Why do you think METR hasn’t built more tasks then, if it’s easy?
I have no idea, I just don’t think the “actually making the tasks” part can be the limiting factor.
I take it you have a negative opinion of them?
Yes; I also have a positive opinion of them, and various neutral opinions of them.
(My position could be summed up as “the concept of time horizons was really good & important, and their work is net positive, but it could use much stronger methodological underpinning and is currently being leaned on too heavily by too many people”; I’m given to understand that’s also their position on themselves.)
. . . I realize the start of this post reads like a weird brag but imo it really isn’t. “Hey failed-wannabe-gamedev, I need a bunch of puzzles and it’s ok if they’re not very fun and it’s ok if there’s no UI and it’s actively preferable if they’re ridiculously complicated and time-consuming and spreadsheet-requiring and reminiscent-of-someone’s-dayjob, we’re paying a couple grand apiece” is a pitch I imagine a lot of people would be willing and able to jump at, many much moreso than me.
Apparently it’s hard
No? I contributed a ~20hr task to them and it was pretty easy actually? I’ve been making benchmark-shaped things on and off for the past five years, for free, as a hobby?
(Most of the effort my end was getting it METR’s required format, recruiting & managing my playtester, and contemplating whether I was complicit in intellectual fraud[1]; if they’d made those things easier or handled them themselves I’d have made more; IIRC the actual “make a ~20hr task” part took me <20hrs.)
--It becomes standard practice for any benchmark-maker to include a human baseline for each task in the benchmark, or at least a statistically significant sample.
--They also include information about the ‘quality’ of the baseliners & crucially, how long the baseliners took to do the task & what the market rate for those people’s time would be.
--It also becomes standard practice for anyone evaluating a model on a benchmark to report how much $ they spent on inference compute & how much clock time it took to complete the task.I agree emphatically with all the above and raise you
--Saturated benchmarks & benchmark components are released publicly as a matter of course, so people can independently confirm the time horizons are where they were claimed to be.
--‘Centaur’ time horizons (“how hard is this task for a smart human with SoTA LLM assistance?”) are reported alongside ‘pure’ time horizons (“how hard is this task for a smart human on their own?”).
- ^
A miscommunication (ETA: miscommunication was probably at least 50% a me problem) led me to believe they weren’t going to Baseline tasks at all, and were relying solely on the estimated times provided by task-makers and playtesters (i.e. people with a financial and ideological stake in reporting larger numbers), instead of using the more complex and less dubious protocol they actually went with; this combined with my less serious qualms led me to call it quits before building the other scenarios I had planned for them.
- ^
This was very important. I don’t think the specific example is timeless or general enough to belong in a Best Of 2024 collection, but the fact that prediction markets can behave this way—not just in theory, but in practice, in such a way that it potentially alters the course of history—is a big deal, and worth recording.
(Something the OP doesn’t mention is the way this effect recurses. The people who shifted probability on prediction markets thereby shifted probabilities in real life, and thus ended up with more money.)
Because they’d give everyone a moon, and they typical-mind.
(Plus probably some other reasons)