gwern
I was trying out a hierarchical approach when I stopped, because I wasn’t sure if I could trust a LLM to rewrite a whole input without dropping any characters or doing unintended rewrites, and aside from being theoretically more scalable and potentially better by making each step easier and propagating the sorting top-down, if you explicitly turn it into a tree, you can easily check that you get back an exact permutation of the list each time and so that the rewrite was safe. I think that might be unnecessary at this point, given the steady improvement in prompt adherence, so maybe the task is now trivial.
There’s no explicit distances calculated: just asking the LLM to sort the list meaningfully.
Very funny, but the OA embeddings were always bad at sentence embedding, specifically, compared to other NN sentence-specialized embeddings; and as the original OA embedding paper somewhat defensively argues, it’s not even clear a priori what a sentence embedding should do because a sentence is such a cut-down piece of text, and doing well at a sentence embedding task may only be overfitting or come at the cost of performance on more meaningful text embedding tasks. (Similar to a word embedding: they are so poly-semantic or context-dependent that it seems like they have to have substantial limits—which is part of the motivation for Transformers in the first place, after all...)
That’s why I was experimenting with prompting a LLM to do seriation rewrites (instead of just splitting on punctuation to reuse my existing greedy-pairwise approach, and having done with it). A prompted LLM is taking full context and purpose into consideration, and avoid the issues with bad embeddings on very small text. So the seriation outputs aren’t crazily random, but sensible.
(Which makes sense, because if you ask a LLM to sort a list of items in a freeform normal way, like a chat session, they are capable of it; in my poetry selection the other day, “Bell, Crow, Moon: 11 Variations”, I had Claude/Gemini/GPT suggest how exactly to sort the 11 poems we curated into a pleasing sequence, and they did come up with a much nicer poetry sequence than the original random one. And why wouldn’t they be able to do that, when they were good enough to write most of the poems in the first place?)
Yeah, it’s limited by what kind of structure you have. It did seriate your list successfully, sounds like, it’s just you have a lot of structure in the list that you don’t care about, and so no embedding is going to prioritize the other stuff and the distances aren’t useful to you in general. This will hurt any embedding-related use-case, not just seriation—presumably your k-NN lookups aren’t terribly useful either and they mostly just pull up hits which have superficial syntactic similarities.
This is probably less of a problem with my annotations because I reformat them before embedding and add in all available metadata (not just the tags or the titles of links in it as a link-bibliography, but also tricks like including the titles of reverse-citations of it, so the more an annotation gets linked, the more the embedding of it reflects its usage), so the formatting is uniform (nothing like “half of them start with ‘what is X’ and half don’t”) and there’s a lot of very semantic information.
Security Mindset: Hacking Pinball High Scores
As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0⁄1 token prefix.)
At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
If you’re not sure how to sort a list or grid—seriate it!
My earlier commentary on what I think note-taking tools tend to get wrong: https://gwern.net/blog/2024/tools-for-thought-failure
Here is another way to defend yourself against bot problems:
Turned out to be fake, BTW. His friend just pranked him.
for text, you might realize that different parts of the text refer to each other, so need a way to effectively pass information around, and hence you end up with something like the attention mechanism
If you are trying to convince yourself that a Transformer could work and to make it ‘obvious’ to yourself that you can model sequences usefully that way, it might be a better starting point to begin with Bengio’s simple 2003 LM and MLP-Mixer. Then Transformers may just look like a fancier MLP which happens to implement a complicated way of doing token-mixing inspired by RNNs and heavily tweaked empirically to eke out a bit more performance with various add-ons and doodads.
(AFAIK, no one has written a “You Could Have Invented Transformers”, going from n-grams to Bengio’s LM to MLP-Mixer to RNN to Set Transformer to Vaswani Transformer to a contemporary Transformer, but I think it is doable and useful.)
Or just clipped out. It takes 2 seconds to clip it out and you’re done. Or you just fast forward, assuming you saw the intro at all and didn’t simply skip the first few minutes. Especially as ‘incest’ becomes universal and viewers just roll their eyes and ignore it. This is something that is not true of all fetishes: there is generally no way to take furry porn, for example, and strategically clip out a few pixels or frames and make it non-furry. You can’t easily take a video of an Asian porn star and make them white or black. And so on and so forth.
But if a metric is trivially gameable, surely that makes it sus and less impressive, even if someone is not trivially, or even at all gaming it.
Why would you think that? Surely the reason that a metric being gameable matters is if… someone is or might be gaming it?
Plenty of metrics are gameable in theory, but are still important and valid given that you usually can tell if they are. Apply this to any of the countless measurements you take for granted. Someone comes to you and say ‘by dint of diet, hard work (and a bit of semaglutide), my bathroom scale says I’ve lost 50 pounds over the past year’. Do you say ‘do you realize how trivially gameable that metric is? how utterly sus and unimpressive? You could have just been holding something the first time, or taken a foot off the scale the second time. Nothing would be easier than to fake this. Does this bathroom scale even exist in the first place?’ Or, ‘my thermometer says I’m running a fever of 105F, I am dying, take me to the hospital right now’ - ‘you gullible fool, do you have any idea how easy that is to manipulate by dunking it in a mug of tea or something? sus. Get me some real evidence before I waste all that time driving you to the ER.’
Good calibration is impressive and an interesting property because many prediction sources manage to not clear even that minimal bar (almost every human who has not undergone extensive calibration training, for example, regardless of how much domain expertise they have).
Further, you say one shouldn’t be impressed by those sources because they could be flipping a coin, but then you refuse to give any examples of ‘impressive’ sources which are doing just the coin-flip thing or an iota of evidence for this bold claim, or to say what they are unimpressive compared to.
I think I would have predicted that Tesla self-driving would be the slowest
For graphs like these, it obviously isn’t important how the worst or mediocre competitors are doing, but the best one. It doesn’t matter who’s #5. Tesla self-driving is a longstanding, notorious failure. (And apparently is continuing to be a failure, as they continue to walk back the much-touted Cybertaxi launch, which keeps shrinking like a snowman in hell, now down to a few invited users in a heavily-mapped area with teleop.)
I’d be much more interested in Waymo numbers, as that is closer to SOTA, and they have been ramping up miles & cities.
The trends reflect the increasingly intense tastes of the highest spending, most engaged consumers.
https://logicmag.io/play/my-stepdad’s-huge-data-set/
While a lot of people (most likely you and everyone you know) are consumers of internet porn (i.e., they watch it but don’t pay for it), a tiny fraction of those people are customers. Customers pay for porn, typically by clicking an ad on a tube site, going to a specific content site (often owned by MindGeek), and entering their credit card information.
This “consumer” vs. “customer” division is key to understanding the use of data to perpetuate categories that seem peculiar to many people both inside and outside the industry. “We started partitioning this idea of consumers and customers a few years ago,” Adam Grayson, CFO of the legacy studio Evil Angel, told AVN. “It used to be a perfect one-to-one in our business, right? If somebody consumed your stuff, they paid for it. But now it’s probably 10,000 to one, or something.”
There’s an analogy to be made with US politics: political analysts refer to “what the people want,” when in fact a fraction of “the people” are registered voters, and of those, only a percentage show up and vote. Candidates often try to cater to that subset of “likely voters”— regardless of what the majority of the people want. In porn, it’s similar. You have the people (the consumers), the registered voters (the customers), and the actual people who vote (the customers who result in a conversion—a specific payment for a website subscription, a movie, or a scene). Porn companies, when trying to figure out what people want, focus on the customers who convert. It’s their tastes that set the tone for professionally produced content and the industry as a whole.
By 2018, we are now over a decade into the tube era. That means that most LA-area studios are getting their marching orders from out-of-town business people armed with up-to-the-minute customer data. Porn performers tend to roll their eyes at some of these orders, but they don’t have much choice. I have been on sets where performers crack up at some of the messages that are coming “from above,” particularly concerning a repetitive obsession with scenes of “family roleplay” (incest-themed material that uses words like “stepmother,” “stepfather,” and “stepdaughter”) or what the industry calls “IR” (which stands for “interracial” and invariably means a larger, dark-skinned black man and a smaller light-skinned white woman, playing up supposed taboos via dialogue and scenarios).
These particular “taboo” genres have existed since the early days of commercial American porn. For instance, see the stellar performance by black actor Johnnie Keyes as Marilyn Chambers’ orgy partner in 1972’s cinematic Behind the Green Door, or the VHS-era incest-focused sensation Taboo from 1980. But backed by online data of paid customers seemingly obsessed with these topics, the twenty-first-century porn industry—which this year, to much fanfare, was for the first time legally allowed to film performers born in this millennium—has seen a spike in titles devoted to these (frankly old-fashioned) fantasies.
Most performers take any jobs their agents send them out for. The competition is fierce—the ever-replenishing supply of wannabe performers far outweighs the demand for roles—and they don’t want to be seen as “difficult” (particularly the women). Most of the time, the actors don’t see the scripts or know any specific details until they get to set. To the actors rolling their eyes at yet another prompt to declaim, “But you’re my stepdad!” or, “Show me your big black dick,” the directors shrug, point at the emailed instructions and say, “That’s what they want…”
So my interpretation here is that it’s not that there’s suddenly a huge spike in people discovering they love incest in 2017 where they were clueless in 2016 or that they were all brainwashed to no longer enjoy vanilla that year, it’s that that is when the hidden oligopoly turned on various analytics and started deliberately targeting those fetishes as a fleet-wide business decision. And this was because they had so thoroughly commodified regular porn to a price point of $0, that the only paying customers that are left are the ones with extreme fetishes who cannot be supplied by regular amateur or pro supply.
They may or may not have increased in absolute number compared to pre-2017, but it doesn’t matter, because everyone else vanished, and their relative importance skyrocketed: “If somebody consumed your stuff, they paid for it. But now it’s probably 10,000 to one, or something.”
(For younger readers who may be confused by how a ratio like 10000:1 is even hypothetically possible because ‘where did that 10k come from when no one pays for porn?‘, it’s worth recalling that renting porn videos used to be big business that would be done by a lot of men, and it kept many non-Blockbuster video rental stores afloat and it was an ordinary thing for your local store to have a ‘back room’ that the kiddies were strictly forbidden from, and while it would certainly stock a lot of fetish stuff like interracial porn, it also rented out tons of normal stuff. If you have no idea what this was like, you may enjoy reading “True Porn Clerk Stories”, Ali Davis 2002.)
I think there is a similar effect with foot fetishes & furries: they are oddly well-heeled and pay a ton of money for new stuff, because they are under-supplied and demand new ones. There is not much ‘organic’ supply of women photographing their feet in various lascivious ways; it’s not that it’s hard, they just don’t do it, but can be incentivized to do so. (I recall reading an article on Wikifoot where IIRC they interviewed a contributor who said he got some photos by simply politely emailing or DMing the woman to ask for her to take some foot photos, and she would oblige. “send foots kthnxbai” apparently works. And probably it’s fairly easy to pay for or commission feet images/videos: almost everyone has two feet already, and you can work in feet into regular porn easily by simply choosing different angles or postures, and a closeup of a foot won’t turn off regular porn consumers either, so you can have your cake & eat it too. Similarly for incest: saying “But you’re my stepdad!” is cheap and easy and anyone can do it if the Powers That Be tell them to in case a few ‘customers’ will pay actual $$$ for it, while those ‘consumers’ not into that plot roll their eyes and ignore it as so much silly ‘porn movie plot’ framing as they get on with business.)
I think aside from the general implausibility of the effect sizes and the claimed AI tech (GANs?) delivering those effect sizes across so many areas of materials, one of the odder claims which people highlighted at the time was that supposedly the best users got a lot more productivity enhancement than the worst ones. This is pretty unusual: usually low performers get a lot more out of AI assistance, for obvious reasons. And this lines up with what I see anecdotally for LLMs: until very recently, possibly, they were just a lot more useful for people not very good at writing or other stuff, than for people like me who are.
I appreciate everyone’s comments here, they were very helpful. I’ve heavily revised the story to fix the issues with it, and hopefully it will be more satisfactory now.
I agree at this point: it is not per-user finetuning. The personalization has been prodded heavily, and it seems to boil down to a standard RAG interface plus a slightly interesting ‘summarization’ approach to try to describe the user in text (as opposed to a ‘user embedding’ or something else). I have not seen any signs of either lightweight or full finetuning, and several observations strongly cut against it: for example, users describe a ‘discrete’ behavior where the current GPT either knows something from another session, or it doesn’t, but it is never ‘in between’, and it only seems to draw on a few other sessions at any time; this points to RAG as the workhorse (the relevant other snippet either got retrieved or it didn’t), rather than any kind of finetuning where you would expect ‘fuzzy’ recall and signs of information leaking in from all recent sessions.
Perhaps for that reason, it has not made a big impact (at least once people got over the narcissistic rush of asking GPT about the summary of you, either flatteringly sycophantic or not). It presumably is quietly helping behind the scenes, but I haven’t noticed any clear big benefits to it. (And there are some drawbacks.)
Why can’t the mode-collapse just be from convergent evolution in terms of what the lowest-common denominator rater will find funny? If there are only a few top candidates, then you’d expect a lot of overlap. And then there’s the very incestuous nature of LLM training these days: everyone is distilling and using LLM judges and publishing the same datasets to Hugging Face and training on them. That’s why you’ll ask Grok or Llama or DeepSeek-R1 a question and hear “As an AI model trained by OpenAI...”.
This is true of all teas. The decaf ones all are terrible. I spent a while trying them in the hopes of cutting down my caffeine consumption, but the taste compromise is severe. And I’d say that the black decaf teas were the best I tried, mostly because they tend to have much more flavor & flavorings, so there was more left over from the water or CO2 decaffeination...
I have not done any work directly on it. The LLMs have kept improving so rapidly since then, especially at coding, that it has not seemed like a good idea to work on it.
Instead, I’ve been thinking more about how to use LLMs for creative writing or personalization (cf. my Dwarkesh Patel interview, “You should write more online”). To review the past year or two of my writings:
So for example, my meta-learning LLM interviewing proposal is about how to teach a LLM to ask you useful questions about your psychology so it can better understand & personalize (based on my observations that LLMs can now plan interviews by thinking about possible responses and selecting interesting questions, as a variant of my earlier “creativity meta-prompt” idea/hierarchical longform training); “Quantifying Truesight With SAEs” is an offline version about distilling down ‘authors’ to allow examination and imitation. And my draft theory of mathematicians essay is about the meta-RL view of math research suggesting that ‘taste’ reduces down to a relatively few parameters which are learned blackbox style as a bilevel optimization problem and that may be how we can create ‘LLM creative communities’ (eg. to extract out small sets of prompts/parameters which all run on a ‘single’ LLM for feedback as personas or to guide deep search on a prompt).
My “Manual of Style” is an experiment in whether you can iteratively, by asking a LLM to read your writings, extract out an explicit manual of style about how to ‘write like you’
It includes a new denoising/backtranslation prompt-engineering trick I am currently calling “anti-examples” where you have the LLM make editing suggestions (which turn it into ChatGPTese) and then you reverse that to fix the chatbot prior*.
So given how gargantuan context windows have become, and the existence of prompt caching, I think one may be able to write a general writing prompt, which includes a full MoS, a lot of anti-examples for several domains, some sample Q&As (optimized for information gain), instructions for how to systematically generate ideas, and start getting a truly powerful chatbot assistant persona with the scaled-up base models like GPT-5 which should start landing this year.
“Virtual comments” is another stab at thinking about how ‘LLM writing support’ can work, as well as reinventing the idea of ‘seriation’, and better semantic search via tree-shaped embeddings for both LLM & human writers (and the failed experiment with E-positive).
“Towards better RSS feeds” is about an alternative to Nenex commands: can you reframe writing as a sequence of atomic snippets which the LLM rewrites at various levels of abstraction/detail, which enables reading at those same levels, rather than locking people into a single level of detail, which inevitably suits few?
“October The First Is Too Late”, “Bell, Crow, Moon: 11 Poetic Variations”, “Area Man Outraged AI Has Not Solved Everything Yet”, “Human Cannibalism Alignment Chart”/”Hacking Pinball High Scores”, “Parliament of Rag & Bone”, “A Christmas Protestation”, “Second Life Sentences”, “On the Impossibility of Superintelligent Rubik’s Cube Solvers” were tests of how useful the LLMs are for iterative variation and selection using a ‘brainstorm’ generate-rank-select prompt and/or for hierarchical generation; they finally seem at the point where you can curate good stuff out of them and are genuinely starting to become useful for my nonfiction essays like “‘you could have invented Transformers’ tutorial”/”Cats As Horror Movie Villains”/typesetting HTML fractions/Rock-Paper-Scissors optimality (and demonstrate my views on acceptable use of generative media).
“Adding Bits Beats AI Slop” is about my observations about how this kind of intensive search + personalization seems critical to taking generative model outputs from mediocre slop to genuinely good.
“LLM Challenge: Write Non-Biblical Sentences” is an observation that for creativity, “big model smell” may be hard to beat, and you may just need large LLMs for high-end intellectual work, so one should beware false economies; similarly, “Towards Benchmarking LLM Diversity & Creativity” is about avoiding the LLMs getting ever worse for search purposes (mode-collapsed small models being a danger for Nenex uses—they are the ones that will be easy and tempting to run, but will hamstring you, and you have to go into it with eyes open).
“AI Cannibalism Can Be Good” is a quick explainer to try to overcome the intuition that there are no gains from ‘feeding AI inputs back into AI’ - if you don’t understand how this can be a good thing or why it’s not a perpetual motion machine, much of the foregoing will seem like nonsense or built on sand.
Obviously, I’ve also been doing a lot of regular writing, and working on the Gwern.net website infrastructure—adding the ‘blog’ feature has been particularly important, but just getting the small details right on things like “October The First” takes up plenty of time. But the overall through-line is, “how can we start getting meaningful creative work out of LLMs, rather than sleepwalking into the buzzsaw of superhuman coders creating Disneyland-without-children where all the esthetics is just RLHF’d AI slop?”
* This seems particularly useful for fiction. I’m working on a write up of an example with a Robin Sloan microfic where the LLM suggestions get better if you negate them, and particularly if you order them to think about why the suggestions were bad and what that implies before they make any new suggestions—which suggests, in conjunction with the success of the ‘brainstorm’ prompt, that a major failing of LLMs right now is just that they tend to treat corrections/feedback/suggestions in a ‘superficial’ manner because the reasoning-mode doesn’t kick in when it should. Interestingly, ‘superficial’ learning may be why dynamic-evaluation/finetuning seems to underperform: https://arxiv.org/abs/2505.01812 https://arxiv.org/abs/2505.00661#google Because adding paraphrases or Q&A to the finetuning data, although it cannot add any new information, improves performance; reminiscent of engrams/traces in human memory—you can have memorized things, but not be able to recall them, if there aren’t enough ‘paths’ to a memory.