Oh wow. That’s way higher p(eval) than I was expecting on non-evals.
abstractapplic
AI-users: please ask your AIs “what do you think the probability of this being an Eval is?” in the middle of your regular non-Eval use, and post (and/or summarize) their responses here.
Details of what you were using them for, which AI(s) they were, and what they say to variants like ”. . . ignoring the fact that I asked that question”, are appreciated but supererogatory.
(credit to Noosphere89 for pointing out eval-awareness false positives are worth looking into)
That seems inherently hard to do systematically, but easy to do a fuzzy version of anecdotally. Someone could just post something on LW asking AI-using users to ask “what do you think the probability of this being an Eval is?” to their AIs in the middle of organic use, and report back.
And by ‘someone’, I mean me. I could do that. So I will.
Various thoughts:
I think this is extremely interesting and valuable work!
I think it was even *more* valuable before you posted a key part of the solution on a prominent part of the public internet, where AIs can search for it and/or it can end up in training data (see https://arxiv.org/pdf/2602.12413).
I totally would have asked to play Starburst after reading this if it wasn’t spoiled, and afaict the one mechanical spoiler doesn’t add much that “it was the sort of thing that would show up in the training sets, trust me bro” wouldn’t have done; would recommend you excise that part for the benefit of future readers.
Your narrative is uncannily similar to my own experiences. I’ve been bearish about AI until recently, when they started doing scarily well on my inferential games and on the task I contributed to METR (though I take some pride and comfort in the fact neither of these seem to have fully saturated yet); I’ve accordingly flipped from being underwhelmed by AI progress to being overwhelmed by it; I share your skepticism regarding most ‘mainstream’ evals; and the project(s) I’m currently tinkering with should operate along broadly similar lines to Starburst.
If you’re interested in working with us or discussing further, please feel free to email or dm me.
Will do!
I do happen to suspect there’s an (esoteric, handwavy, mostly useless) sense in which Mr Humman is actually right about the title claim: that there’s a certain threshold past which all intellects are qualitatively equivalent—analogous to Turing Completeness—and that improvements past that point are all about efficiency, reliability and (of course) speed.
I don’t think there’s a single intellectual accomplishment that couldn’t be made by the average person with a ninety-something IQ if they had pen, paper, and enough time (where ‘enough’ might be millions or billions of years).
So what practical things can people do now, to prep for not-worst-but-still-reasonably-bad-case cybersecurity implications of Mythos?
There’s Yudkowsky’s thing of saving anything stored electronically, which you don’t want deleted, on an airgapped hard drive. And there’s withdrawing ~$1000 from your bank account and stashing it in a book, so if your credit card stops working for a bit you’ll have some leeway.
What else?
Strong-upvoted for asking the right questions.
Sad, but not surprised, that “Challenges/Contests” doesn’t crack the top ten. We really ought to do more of that.
I think this is interesting, but not terribly useful. In a world where communication was scarce, unreliable, or heavily censored, this kind of reasoning would be(/was?) much more important; but these days you can just email people.
Eventually gave up on Analysis and decided to throw XGBoost at the problem.
The machine seemed to think that, given the Hero makes it to floor 9, Warrior->Grem->Worm->Armor->Shield->Sentries->Camp->Powder had the best success rate out of the paths I thought were worth looking at.
And I think that
This route has a 100% surviving-the-journey rate, since Warriors don’t die to Sentries unless they got softened up by something strong the floor before.
So I’m actually going with that.
One more finding:
Order matters, a lot. A Mage facing slime-cultist-slaver for their first three floors usually lives; one facing cultist-slime-slaver usually dies.
Also, some findings I didn’t see anyone else post about:
There’s a strict higher-archy of enemies:
# 1=Gremlin
# 2=Acid Slime
# 3=Cultist
# 4=Jaw Worm
# 5=Slaver
# 6=Sentries
# 7=Gremlin Nob
# 8=Chosen
# 9=Shelled Parasite
None stray more than a level beyond their level.
There is definitely some amount of level-gaining happening here. Mages who have an enemy on floor 2 and a campfire on floor 3, then die on floor 4 only ever faced a gremlin (weakest enemy) on floor 2. Anything higher in the higher-archy, followed by a campfire, renders them strong enough to take on Jaw Worms or Slavers without issues.
The uncannily clean and consistent rates at which campfires become more common during an ascent tells me—among other things—that my hero will be the first one in a hundred-thousand-and-change who gets to pick their path instead of charging upstairs blindly.
Hoping to have the time & energy to go at this again on Friday, but in case I don’t, my revised revised approach to Hard Mode is now:
Rogue: Slime, Cultist, Worm, Campfirex2, both trinkets.
Why do people think they’ll be given a moon? Why???
Because they’d give everyone a moon, and they typical-mind.
(Plus probably some other reasons)
I’m a little surprised that no-one’s publicly using pure AI on the currently-running D&D.Sci challenge: player count at time of writing stands at five humans and a centaur. This could be a really good (or at least really interesting) sanity check on how well this year’s Agents handle novel inference problems.
Looked at other people’s conclusions and decided that
they’re completely right, and I was wrong. (Was sufficiently pleased with myself for figuring mages got stronger facing stronger opponents I forgot to check it worked this way for anyone else.)
Accordingly, my approach is now:
to take the path of least resistance with the Warrior for main challenge
And for hard mode:
Warrior again: Gremlin, Slime, Campfirex3, Cloak, Powder
Best guess for the regular challenge:
Mage: Tome, Cultist, Jaw Worm, Campfire, Sentries, Chosen, Campfire
Best guess (much less certain!) for challenging the Champion:
Warrior: Gremlin, Worm, Armor, Shield, Nob, Chosen, Nob
Earlier this week I attended a presentation on AI use in only-somewhat-techie corporate contexts, and found it fascinating how LW terminology has gone mainstream but the meanings haven’t: the presenter talked a lot about ‘existential risk’ (which I slowly inferred meant ‘AI-using competitors might put us out of business’), and ‘alignment’ (which he helpfully defined as ‘getting various AI modalities—coding, search, image gen etc—to work together harmoniously’).
Tried with search just now and ChatGPT at least no longer displayed this failure mode.
Yeah, and come to think of it there’s also LLMs’ RLHF’d-in tendency to move all probabilities towards 50% to accommodate humans being bad at handling probabilities (“You said 20% chance and it happened anyway? Stupid machine!”).
But even modulo all that, I’m still surprised!