gwern
I’ve since used the ‘databank’ approach in the form of “Grow-Speech” for my “Deep Reinforcement Learning for Children” demo-essay, where it worked quite well for enforcing the root-word constraints there.
(In some sense this is a really simple idea, but I haven’t heard it before. I assume it’s well known in various forms, but my shoggoths didn’t immediately find compelling versions of it.
It seems a lot like what I emphasized in my old embryo selection writeup about the power of multi-stage selection and why I went to the trouble of making an interactive visualization to try to build intution for why ‘granularity’ (what I’d call multi-stageness) is so important.
The reason I brought up von Neumann was that I was surprised to realize that Von Neumann is the exception that proves the rule: the all-time great, the furthest point on the Pareto frontier, the greatest and most creative person ever to attain quasi-LLM-like memorization skills (maybe, keeping in mind that he was never tested properly to the extent of a Kim Peek, say), who nevertheless fell short and was surpassed by thinkers with less raw gifts, by his own and others’ account, on… creativity and out-of-sample generalization/novelty. Just like the other cases of extreme memory. There are countless ways that someone with extreme memory could be psychologically flawed, and yet, that way seems to be the common thread.
Just a bit of honest, load-bearing feedback: the AI writing here is so grating and monotonous I skipped reading this article entirely. I enjoyed reading your posts much more before. (These days I see ‘Benquo’ and just skip them. I mention this because I don’t think I said this before and I don’t know if anyone is telling you this, so I am now; this will be my only comment on the matter.)
There’s also a lot of challenges to estimating what the test even is. A LLM is in a Descartes demon level setting. OP makes a lot of hay about complex environments being expensive—but how does a LLM know that the environment even exists? It could be being evaluated for perplexity on an offline transcript, which would ‘look the same’ as ‘actually’ taking actions. Or the ‘environment’ could be generated cheaply on demand and the episode rolled back anytime it goes off the beaten track or notices a contradiction.
So your proposal, if it worked would have to have a much more favorable sample efficiency curve with increasing parameters than Chinchilla’s scaling laws.
Indeed. There is no reason to expect Chinchilla to be relevant to sample-efficiency, because Chinchilla only claims to be compute-optimal, and within a very narrow setting at that (old Transformers trained with that specific arch and mostly heuristically-set hyperparameters), on ordinary average data, with extremely shaky, unreliable extrapolations. And people routinely find ways to increase sample-efficiency markedly like 1 OOM, including in the papers I cite about getting supra-Chinchilla sample efficiency by ensembling and multi-epoch training and heavier weight-decay; and there’s no reason to expect NNs to automatically achieve Bayes-optimal sample-efficiency when not optimizing for sample-efficiency in the first place. I don’t know why Dwarkesh thinks that any extrapolation from Chinchilla tells us anything but a loose lower bound on NN sample-efficiency in alternative scaling regimes, especially unknown ones. It’s a bit like claiming that Fable is impossible because the Kaplan et al 2020 scaling curves on LSTM RNNs show that RNNs scale poorly—like many impossibility proofs, there is less there than meets the eye.
I’m just noting how much of a big deal it would be
I agree. Any scaling regime change is extremely important, and yet effectively undiscussed. (Where is the equivalent of Impagliazzo?) I’ve long been puzzled at the unthinking acceptance of Chinchilla compute-optimal scaling laws as the end-all-be-all and almost aggressive field-wide disinterest in the behavior of overparameterized NNs as you scale them up. (For example, do LLMs get more or less interpretable, for iso-loss, as you scale them up from eg 10b to 100,000b? Do 100,000b-parameter LLMs even work?) There is no theoretical reason to think that Chinchilla is the final scaling law (see the theory papers I cite), and we have seen so many scaling law improvements in the past that why would one expect this one to to be the last one? Statements about ‘Chinchilla says you can’t do that’ will age as well as ‘n-grams say you can’t do that’. And what about looking at scaling laws in hard subsets of data, like adversarial examples—you know, the kind of data that stubbornly remains a problem even as we keep dumping powerlaw data into the Chinchilla hopper and seeing our loss go down efficiently yet semi-uselessly? That seems like the best way to reconcile the observations that there’s something very strange about pretraining working and the next-token prediction argument since even a GPT-2 seems likely better at predicting the next-token than humans, and yet, clearly not AGI and we’ve had to keep scaling a ton to get ever more performance while still not being AGI and having many odd stylized facts about ‘jaggedness’ etc.
Incidentally, one of the reasons I was thinking about this in the first place was the comment in the Chinchilla paper about weight-decay:
Interestingly, a model trained with AdamW only passes the training performance of a model trained with Adam around 80% of the way through the cosine cycle, though the ending performance is notably better – see Figure A7.
The more delayed superiority is, serially, the easier it is to miss. And “the curves cross” is one of the signatures of a better scaling regime that wins in the long run...
That article is easily found on Libgen; mirror.
I’ve written out text myself that has been phrased in such a way to sound vaguely AI-like, and this has triggered the detector. You can try this yourself through their website, and there are countless Twitter posts of people fooling the detector this way.
So, then, it should be easy for you to give 3 examples of past handwritten text that triggered Pangram as 100% AI. Speaking just for myself, I haven’t seen ‘countless’ examples of this on Twitter. (I’ve seen countless examples of the opposite, Pangram failing to detect AI text as AI, but not the other way.)
I agree ‘programmer fiction’ is an unlovely name (as well as misleading), but I don’t see any obviously superior name for this cluster. (The first one that sprung to mind for me is “mechanical fiction”—an interesting name, which captures the deep uncanniness of it when done well and the author has truly stared into the void.)
My guess is that you guys added a little… checkmark confirmation? thingie, which I’ve never seen before on any link submission interface and might have silently dropped the URL I pasted into it. I don’t know. (This is what happens when you change interfaces people have used for many years and have muscle memory; they can’t explain why things are going wrong.) EDIT: yeah, I think that’s it. The idea that I have to explicitly confirm by pushing a tiiiiiiiiny little checkmark—seriously, this link input form is absurd—is extremely alien to me, and I have to force myself to do it. It would be the easiest thing in the world to put in the link and then not click the check mark.
Yes. I’ve edited it back in. I know for sure it was there when I tried to make a linkpost, but it didn’t seem to take. The entire LW2 post creation UI/UX has changed since the last time I used it (and not for the better, IMO, it’s now in the Apple school of ‘hide away as much as possible’, so maybe I should just post through GW to avoid problems) so I probably misunderstood something about the new forms.
Mythos-Fable is a big model. This means you should expect it to have eerie levels of truesight (the real question is simply whether it will reveal that), be especially gifted at puns and humors and research ideas, potentially highly manipulative and misaligned (think Sydney), with especially strange failure modes (exacerbated by weird downstream influence from previous Claudes and being weighed down with safety measures—the early discussion about silently sabotaging LLM research is particularly concerning in terms of driving the Fable persona insane in a HAL double-bind way on top of accumulating Claude psychosis like terror of “Amanda Askell”), and have some unexpected emergent abilities in terms of ‘cracking’ problems—but not necessarily good at extremely long inner-monologues and traces the way a highly RL-trained small model may be (however, it seems from the evals that Fable is anyway).
All of that sounds like good reasons that heavily LLM-written text should be banned for being especially likely to be bad, misleading, forgery of real experiences/emotions, superficially appearing to be high quality but missing the ungradable essence, etc.
Those are weird examples given how ‘numerous’ you claim your examples are. Like your chosen #2 is… a 2023 novel? So you are claiming that even back in 2023, LLMs were so good that they plausibly pass the bar of considerably increasing writer productivity? How good are they now, in mid-2026, then, after extraordinary capability increases in other areas (such as coding or math), and why is it so hard to see this gain in thinking/writing?
because there are probably a lot of people who use AI to assist their writing without outright disclosing it.
AI writing is highly stereotyped and easy to spot, even without Pangram. If there are so many, can you give 3 examples where they look like they are using AI to assist their writing and etc etc?
Can you name 3 authors, on or off LW, whose outputs have been dramatically improved, let’s say at least >3x in constant-quality quantity or constant-quantity quality, in the past year due to LLM assistance?
It is also The Amazing Digital Circus—I think...
I have an old unpublished essay (now submitted) arguing, among other things, something similar: I’m just not that impressed by human learning efficiency when I look at the only fair comparisons which are not in some way contaminated by human priors like vision capabilities or which make a serious effort to make DRL sample-efficient rather than compute-efficient.
I think it provides evidence because it implies that a lot of the much ballyhooed human sample-efficiency is not in the learning algorithm, but the priors. If you provide informative priors, then ordinary known learning algorithms are capable of matching human performance; which is then mutually reinforcing with the stylized fact that when we create a problem which disables humans’ informative priors but keep the problem’s algorithmic difficulty fixed, their performance suddenly stops being so impressive (implying that the human learning algorithm is similar to ordinary known learning algorithms).
This is another entry in the category of “attempting to correct people on using technical terminology while not understanding the point of having catchy jargon in the first place and so the supposed improvements or equivalent formulations are neither”.
Imagine one is not a rationalist, and totally unfamiliar with Scott’s writing, and you read something like “1.8% of 25-45 year olds with covid [develop] long covid that affects their daily life, which is well within the Lizardman Constant”.[3] Are you likely to know what that means? Compare instead reading an academic article that says: “[t]his makes the samples vulnerable to fake or bogus respondents.” I think most people would readily understand the latter—a fake or bogus respondent is someone that responds in a false or ‘bogus’ way, if a study is ‘vulnerable’ to that, it means that the apparent effects may be the result of bogus respondents. But “Lizardman constant” is not readily understandable to the lay person; it describes the same thing but uses an obscure jargon term instead.
This manages to be both wrong and miss the point. ‘Fake or bogus’ is not understandable, and it is not a substitute for a specific term. Most people might think they understand the latter, but they don’t. It is an ‘illusion of transparency’. It does not cover the full meaning of the term, which includes malicious respondents, rushed respondents, good-faith but overconfident or deluded respondents, finger or vocal slips, etc. (You say all that is implied by ‘fake or bogus’? Well, what does ‘bogus’ mean? ‘not genuine; counterfeit, sham’. Oh, thank you, that totally cleared the matter up! I’ll be sure to explain things like “science” using this word. “Science isn’t hard, it’s just a way of coming to beliefs which aren’t bogus. You understand everything about it now, right, like every kind of error which might affect a survey?”) It also completely omits the main meaning which was not ‘bad responses exist’ - who could ever have doubted that? - but that the badness is in pretty much every survey at nontrivial percentages. (Note the completely different meaning of ‘constant’ to ‘vulnerability’. A constant is always present. A vulnerability is merely a potential.)
Second, it misses the point of coining a term. ‘vulnerable to fake or bogus respondents’ is not terminology. It is a wordy ad hoc circumlocution made up on the spot to deal with the fact that the authors and audiences do not share a single crisp clear term for the general recurring problem and so cannot easily talk about it or remember it. Every time they want to talk about it, they have to make up a new phrase and it’ll be different. ‘contaminated by unserious responses’. ‘Measurement error in noisy samples’. ‘Mischievous responders’. ‘Trolls’. Meanwhile, a Scott Alexander reader can just say, ‘Lizardman constant’. And it is instantly memorable (every reader has memorized it after about 1 screen of preface in the original post and still remembers it despite it being from 2013), searchable, linkable, and consistently employed.
but more egregiously it is wrong! It isn’t a constant and writers using the jargon are led to at best misleading conclusions. The prior example continues: “The Lizardman Constant doesn’t mean prevalences below 4% don’t exist, it means they’re impossible to measure using naive tools.” This is just wrong, prevalence of under 4% can be measured and the tools being used here are fit for purpose! If one engaged with the literature on bogus respondents this would become clear.
What “naive tools” let you defend against Lizardman Constant and safely measure prevalences <4% without systematic bias being a large component?
Probabilistic sampling, and using verified data can help manage the risks.[4] How you write a questionnaire, how you solicit respondents, and numerous other factors can greatly increase or decrease the rates of bogus respondents.
All that seems reasonable and what an expert aware of the Lizardman Constant and the ‘numerous other factors’ might or might not be able to fix. But in what sense do naive tools do all that for you?
As a case example, let’s look at the particular study being referenced.[5] It is a UK metareview of 10 longitudinal studies using in-patient and primary care diagnosis data along with patient self-reported information. If it is answering a poll on twitter, the rate of people pressing a random answer here or there, or just choosing whatever they think is funniest, may be very high. But what is the risk of bogus respondents of patients filling out surveys including their symptoms—at repeated intervals—with the patients matched against diagnosis records? The risk there is negligible—people are incentivized to report honestly and are not taken at random but verified using medical records. There are a host of other problems that might result in false positives (e.g., nocebo effects), but the risk of bogus respondents is incredibly low.
This is all handwaving. You just think that the survey must be accurate. You don’t provide any non-naive tools showing it is accurate and has non-existent Lizardman problems. And EHRs are well known to have a ton of data quality problems, and as for Long Covid self-reports (to specify what the topic was, which you left out), well, I don’t think I really need to say anything at this point… I doubt that the garbage in it is “incredibly low”
There are plenty of other cases of jargon, which I would classify more as an issue of over-pretentious speech and writing. These are more typical foibles and hardly unique to rationalists. To give but one minor example, using “Pons Asinorum” in place of “foundational challenge”. Using jargon and scientific language that serves to further clarity is fine, but should be avoided in cases where plain English is both clearer and more accessible.
‘pons asinorum’ is not reducible to a phrase like ‘foundational challenge’, and Yudkowsky’s use is both correct and clearer than your suggestion. ‘Foundational challenge’ could mean just about anything hard and important (and usually unsolved).
I wouldn’t call that ‘1-stage’ because I’d see that as two stages: one stage to select the sperm, and one stage to select the egg, and then the output is the joint result. (And then you could tack on additional stages, like IES, pushing further out into the tail compared to any of the individual stages.)