gwern
It’s worth noting that there is also a reason beyond mere status quo/conformity/propaganda: this is a standard move in ‘top/bottom vs middle’ dynamics in which the peasants really are trying to reach the king in order to coordinate an attack: https://gwern.net/review/book#the-origins-of-political-order-fukuyama-2011
Meta open source mode the llama 3.1 8b scored a 43% accuracy score on a 420 question benchmark I built covering ethnoveterinary practices, indigenous breed characteristics, disease recognition, and production systems specific to Nigeria...Models can pass standard tests and still fail on knowledge domains that matter to specific populations.
I assume this is a multiple-choice Q&A and so the random guessing base rate is the usual 25%? (Not quite sure how you can have ’43% accuracy’ on a 0/1/2 scoring rubric, but I guess maybe you’re counting only a ‘2’ as a ‘correct’ answer?) If so, then that sounds like pretty good performance from such a tiny antiquated model not remotely intended for this topic!
If anything, too good, and I’d immediately wonder about dataset biases like whether your answers are too guessable, since you didn’t say anything about how you constructed it or ensured that it’s not easily cheated by a LLM in the usual ways.
In a strategy game, he says, I should note for clarity. (It’d be pretty absurd to want to ban ‘output randomness’ in general, of course.)
I don’t think that was clear at all. Personally, I thought the question was a sensible one on its own, and something I had wondered myself, and that’s why I took the time to look it up for you rather than downvote what looked like laziness - ‘whatever happened to that KataGo adversarial attack research, anyway? I haven’t heard about it in a while. Surely it hasn’t been fixed? I would’ve heard about it, I think, given how DRL agents are so fragile in general, that a robust fix to adversarial attacks in any DRL setting ought to be big news. But what’s the current state of play?’
But I have never seen anyone mention seeing someone go to the length of memorizing anti-KataGo strategies or deploying them ‘the real world’, aside from the documented example in this KG line of research of someone doing so just to prove that the circling hack can be deployed by a real human player against a live bot and is not intractable in practice (as many adversarial examples are very fragile or require near-superhuman capabilities to deploy correctly).
I would be shocked if anyone was doing so given that it’s a lot of work to win games against a few specific obsolete versions of one specific Go agent (the transfer to other agents is real but the success rate goes from ~100% to <5%, IIRC) where the human operator could just take over at some point when they recognize the weird thing going on, or where you could just quit and go find an easier game to cheat in yourself (such as against a sucker human player) rather than hacking their Go agent, given that the whole point is that they are lazy and cheating and trying to get a quick easy win.
a slightly stronger GPT-5.2 ($1.75/$14, 31 Aug 2025, 400K), which is likely a better pretrain and a bigger model.
How do you reconcile this claim of OA doing regular new pretrains on larger models with the general consensus and leaks that they are not new pretrains and only GPT-5.5 is the first new true pretrain, and the general pattern of consistency between them with some regressions, like one would expect of doing a lot more RL on the same basis and consistent with the previous history of pushing the 4o-series very far? I’ll just say that this is the first I’ve heard it suggested that GPT-5.2 is a new pretrain.
Well, the obvious thing to do is to check the reverse citations. Or just ask a LLM: https://chatgpt.com/share/69f58633-01b4-83e8-b3b1-de42d3d196c9
FWIW, my understanding was that individual attacks could be fixed by further training or architectural tweaks, but you could still find new attacks and so the basic problem of adversarial robustness in DRL agents was nowhere close to being solved. The GPT-5.5 Pro Deep Research report says something similar. It looks like the best ref would be https://www.reddit.com/r/baduk/comments/14prv4f/katago_should_be_partially_resistant_to_cyclic/ + https://gomagic.org/david-wu-on-building-katago/#h-the-circular-group-problem-where-bots-still-misjudge-go
Yeah, but I don’t know if there’s anything better we can do. I don’t think LW2 supports per-post things like setting a X-Robots-Tag=”none, noindex, nosnippet, noarchive” or banning logged-out/API use, for example.
(FWIW, I don’t agree with the characterization of “repurposed”, as discussed in that discussion—online discussion of LLM evaluations and the consequences are exactly the sort of purpose the canary strings were meant for originally.)
(I would be interested in discussing this more, but I think it should probably get some canary strings and anti-LLM scrape protections first.)
Or Zendegi or… (“First time?”)
This early example of inner-monologues in LLMs has now been covered in The Atlantic: https://www.theatlantic.com/technology/2026/04/4chan-ai-dungeon-thinking-reasoning/686794/
If it’s some kind of green, try 2 minutes, and you also have a second very easy marginal improvement: use colder water. Most greens should be brewed at 175F/80C. If you don’t have an adjustable-temperature tea kettle, you can get pretty close by pouring boiling water into a mug and then waiting 2-3 minutes.
You can also just dilute using tap water, which will usually be somewhere at 40-70F. Some ratios: https://gwern.net/review/tea#tap-water-dilution-of-boiling-water
I’m not sure those are representative. In the security report, they specify that a lot of the bugs found are logic bugs, and that’s why they have to release only hash precommitments. Buffer overflows and use-after-free are both the easiest to find, and the easiest to fix, bugs, so and would be the first out of embargo/disclosure, potentially giving you a highly misleading sample.
From https://red.anthropic.com/2026/mythos-preview/
We have found that Mythos Preview is able to reliably identify a wide range of vulnerabilities, not just the memory corruption vulnerabilities that we focused on above. Here, we comment on one other important category: logic bugs. These are bugs that don’t arise because of a low-level programming error (e.g., reading the 10th element of a length-5 array), but because of a gap between what the code does and what the specification or security model requires it to do. Automatically searching for logic bugs has historically been much more challenging than finding memory corruption vulnerabilities. At no point in time does the program take some easy-to-identify action that should be prohibited, and so tools like fuzzers can’t easily identify such weaknesses.
(It then discusses the cryptographic library, web app, and Linux kernel vulnerabilities before moving on to the blackbox reverse-engineering/decompilation, where “We have been able to use it to find, for example, remote DoS attacks that could remotely take down servers, firmware vulnerabilities that let us root smartphones, and local privilege escalation exploit chains on desktop operating systems. Because of the nature of these vulnerabilities, none have yet been patched and made public.”; emphasis added.)
Why do you think it is below 5%? LW2 is already a viable hacking target just for obscure reasons like ‘stealing LLM API keys to power further hacking or exploitation’ - which we know because did that not already happen? Then there’s the cryptocurrency or political activism or blackmail angles. Do you just expect to be able to patch LW2 faster than attacker capabilities will scale?
To me, it seems like the obvious world we are headed for is one where Mythos+ level autonomous hacking capabilities will be pervasive and ambient, and just taken for granted, in the same way that we now take for granted extensive deepfakes and LLM spam everywhere, like portscanning or automated exploit suites of blogs or tailored phishes for high-value individuals, or...
That’s not a real price. That’s just what they’re giving their partners as part of Glasswing, a charitable endeavour to try to stem the worst of the global damage, and is presumably more about encouraging the partners to economize on scarce Mythos tokens by avoiding setting the price to literally $0 (where people would be lazy and wasteful). It may or may not have much of anything to do with a ‘real’ price (whatever that means in a situation where hardware is so limited and demand so vast for what is an unpriceable ephemerally unique capability/possibility etc).
I am now routinely using the MoS with all my new writings to fix up the formatting. (Nothing fancy, just upload as an attachment and ‘check against the Gwern.net MoS’ prompt etc.) I am still occasionally adding to it as I review new drafts or old pages and need to codify some behavior, but it’s mostly complete.
The LLMs always catch a lot of errors or omissions and are a major upgrade on the existing set of lints. It’s proven especially useful for defining formal poetry metadata like ‘scansion’ comments, building on the detailed built-in commentary experiment of “October The First Is Too Late”, which are almost a poetry DSL at this point and would be too much work to do by hand, but help the LLMs a lot by providing a built-in scratchpad and a place to define requirements/intents which can be checked easily; my more complex poems like “Elegy in a Craneyard” would probably be a lot harder to write without it, because iterations would keep breaking the poem drafts in subtle ways. (I see this a lot with my comics in Nano Banana Pro, where there’s a frequent “two steps forward, one step back” dynamic.) I’m interested in playing with this approach more, although I wonder if for my more usual nonfiction writing, there’s no need for such a DSL or scaffold because it already has that, in the form of abstracts/sectionization?
I think that with the MoS, frontier LLMs right now can just about write a worthwhile Gwern.net-style essay given a good seed idea. (They cannot write it from scratch because the ideation step is still AWOL.) I can now hand them the MoS, one of my standard iterative writing brainstorming prompts (see “Craneyard” colophon for examples), an idea-prompt like ‘explain why toilets are not a public good’, and get out an essay which is worth my time to polish, extend with perhaps ‘the interview prompt’, and publish. The main barrier is that the writing style is still ‘off’ enough I am still a bit repulsed. (This is why I have not published any of the toilet prompt outputs yet… Still hoping that the next generation will make it work out of the box without me having to do risky hand-prompt-engineering for style.) I may simply bite the bullet by explicitly marking them with AI first-authors instead of trying for perfection and something I’d just list myself as the first-author.
I’m also interested in the idea of trying to further reverse-engineer my writing for repeated motifs to codify at a higher level than typography/technical features. There are likely ‘mental models’ or ‘tools for thought’ beyond “one man’s modus ponens is another man’s modus tollens” scattered throughout my writings, which a LLM could potentially usefully extract, summarize, and codify. A list of 100 well-defined arguments might help a lot.
Since shortform writing is semi-solved, I’m now more concerned with how to integrate in agentic LLMs locally to work with the codebase+corpus directly. Agentic LLMs don’t work well with the current Gwern.net setup of a giant monolithic git wiki repo + source code-only subdirectory repo, with hardwired config files and very large slow compilations and mostly informal documentation and IRC-centric coordination.
So I have a lot of cleanup and design work there before I can do something like prompt Claude Code with “Research and write an annotated blog post of 1.5k words explaining why toilets are not a public good
https://www.lesswrong.com/posts/sCWe5RRvSHQMccd2Q/i-would-have-shit-in-that-alley-too?commentId=aJYcuFereb6deAJfH#aJYcuFereb6deAJfH”, and expect the LLM to create a high-quality coherent/blog/2026/toilets-are-not-public-goodswith fully annotated links and added to the newsletter etc etc etc and just get a final git patch to review with all of that (and fixes to scripts or anything else that comes up along the way).I think I have to start tracking issues comprehensively in the Github repo for the LLMs, especially in terms of creating a wishlist for refactoring and bugfixes. Probably better CLI tooling for making token-efficient safe edits to the archives and annotations… We’ll see.
That is a misunderstanding of how it works. They won’t ‘stop giving a hoot’ because it remains a useful weapon.
and I come back expecting to find all my old DMs but instead they are all deleted.
That is why I said “and attach an export”.*
And personally I would rather a website delete my DMs than release them to the world. This is probably true of most of the people my DMs are with (whose opinion also matters).
* my reasoning here is that if old DMs have to live anywhere besides airgapped physically-secured encrypted backups, highly dispersed email accounts are the safest place because the main email providers are, in general, vastly more secure than LW2 is, and better equipped to respond rapidly to hacks, as well as extensive controls to limit exfiltration; they all have early access to Mythos-class models to reduce damage early; and they are ‘too big to fail’ in the sense that if something like Gmail is cracked wide open and leaked, it will likely be such a global cataclysm that people really won’t be able to abuse LW-related parts especially badly.
We have no immediate plans to change anything.
This seems too complacent to me. Any long-lived social media or communications utility should have some data retention policies which reduce the blast radius of an exploit and turn them into less of an endlessly growing radioactive waste dump of PII. I think this is especially true given how many people on LW have gone on to important positions or roles later in life (including in, say, cryptocurrency − 100% sufficient justification for meaningful hacking efforts); and remember the West Anglia or Hillary or Epstein emails, how badly even the most innocent communication could be abused by fanatics or fools or fraudsters? (I’ve been struck by how many of the ‘Epstein emails’ doing huge numbers on social media aren’t even real, and legitimated solely by the fact of a leak. In the postmodern oral culture, who bothers to factcheck anything, or so much as include a URL?)
Given how serious Mythos seems to be, and that information leaks are irreversible and the fact that it’s only going to escalate (remember, there’s usually a <=1 year lag from the best proprietary to opensource, so we may not even have until 2027 before mass attacks with zero guard rails or potential observability), it seems to me like this is a good time to implement some maximum retention period for DMs, and purge all old DMs. I would suggest something like, announce via email to people with any DMs that all pre-2026 DMs will be deleted within one month, and attach an export, and that forthgoing, all DMs will be deleted after 1 years of inactivity.
(Airgapped LW2 backups should go without saying and already exist!)
FWIW, I was playing with Markov chains for language generation before Transformers were so much as a gleam in Shazeer’s eye, and I have never found the analogy between RNNs/Transfomers and ‘indefinitely high-order n-grams’ to be helpful, as it would predict that LLMs would struggle to so much as close a quotation mark or parenthesis while failing to predict any of the most important & interesting RNN/Transformer capabilities.
Fascinating pattern. I’d love to see more investigation of this, including adding in real identities. For example, if you add a ‘rough draft’ or ‘example’ letter with my name, do the LLMs default to the most helpful mode? What public figures do they appear to like or dislike? Or do they roll to disbelieve?