I operate by Crocker’s rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements. I add LLMs who gave feedback to/were involved in the creation of projects/the writing of blogposts in the same way I’d add humans as co-authors.
niplav
Jihad Musket
“On skibidi you’re skunky. Your wiki jots zilch1 triumphs—just “totem of dandruff”. I kuru when I google your emoji2, a silhouette3 with zero mojo.”
“Zombie’s an otaku with Ohio swagger. Bizarre hooligan hassling the honcho’s chocolate stash. I’ll powwow and yeet your avocados, narc.”
“It’s jinxed, chat! Lot of bugged fuss, you have pariah kismet. On Manitou you’re petrified, where’s your bukkake kitty? My boombox gongs, yours yabbers. Gangnam oof, yahoo.”
“Yikes! Mumbo-jumbo tweets, habibi ;-) I tuktuk to my ziggurat while you possum in this crypt. Your haram spandrels4 quiver like cocaine quokkas; this mewing sigma has Tomahawk’d your baka igloo.”
“Inshallah, what’s this armageddon? You karaoke maroon voodoo (feces, that is); I aloha and schmooze your moe squaws on my raccoon safari, hurrah! Your koans only flirt with schmucks and yakuza. No oasis for you, sheesh.”
“Banzai, what a brouhaha! You’re just tsundere, and gung ho for my banana. This hurricane moccassins to the futon and boops your aegyo geisha. Be my golem and beep at my diwan, but no can do on the yaoi5 hentai, dawg.”
De novo, from 1923. ↩
I was pleasantly to surprised that this word has no relation to the word “emotion”. Purely independent, a true friend. ↩
A Basque loanword into English! ↩
It would make the insult less good, but if we accept the etymology from espandre we could instead use “alcoves”, “minarets” or “pagodas”. But the double meaning was particularly satisfying here. ↩
Not just a a Japanese word, a Japanese neologism. ↩
Here’s a (kind of mediocre but whatevs) idea what one could do with a large amount of funding in technical AI safety: Run a hyperparameter search on different scalable oversight techniques, or simply test them now that we have LLMs either as human imitators or AIs.
The heydays of scalable oversight theory produced a lot of different techniques: I(D)A, HCH, Factored Cognition, Imitative Generalization, RRM, Debate &c…[1]
Some of these (especially directing agents using approval) got folded into capabilities techniques, and others may still get used in the same way.
But others have been basically forgotten and could be revived; e.g. Ought’s factored cognition experiments could be re-run in different variants with various LLMs, checking how performance degrades). Yes, the experiments back then failed (as did the experiments on debate, mostly, though debate received merciful follow-up many others didn’t), but they had so pitifully little to work with.
Or (h/t @Gurkenglas) one could initialize a SOTA base model (Fable-base?) with the keystrokes of a trusted and good human, in a context that indicates that they are able to call a copy of themselves after a few “minutes” of deliberation. I nominate Stephen Wolfram due to his incredible keylogging.
The tricky part is how to tell if a technique is working, I don’t have amazing ideas here, but my mediocre ones are to look at outcomes similar to the ones in Wen et al. 2026 or on classical music composition in Lilypond (I write a bit about the “why” here, maybe I’ll expand on this elsewhere).
This is, of course, a kind of stiff number-go-up exercise with tons of LLM labour; I guess is that it’s fine, maybe, now that human time is short, AI time is relatively abundant, and the old ideas that were prepared in the long days without empiricism and deep reflection shall now be put under the microscope.
(I have similar thoughts about gridworlds-style RL agents, which are under-rated and now can be trained on a laptop much faster with the help of ML-knowledgeable LLMs. More on that at a later point, perhaps.)
- ^
Including also all the combinations of techniques from this excellent post.
- ^
Oops, right, I didn’t connect those, my bad!
Question about the natural abstractions research program:
Seems possible to me that, if natural abstractions exist, they won’t be robust?
Could be that natural abstractions program is resolved, but we can’t really Retarget the Search, because whenever we point it at the natural abstraction that has been found, because the maximizing inputs, we get some edge instantiation of that natural abstraction. (The linked post gestures at this but doesn’t look at this particular aspect.)
I guess one could bucket successes of the program into “found convergent abstractions” (ones that are found across many different kinds of minds) and “found robust abstractions” (abstractions that are safe to maximize, e.g. ¿mutual information?)
Natural abstractions would still be very useful.
ChangeDiaperBench, PlanInvasionBench, ButcherHogBench, ShipConnBench, BuildingDesignBench, SonnetBench, AccountBalanceBench, WallBuildBench, BoneSetBench, ComfortDyingBench, OrderTakeBench, OrderGiveBench, CooperateBench, ActAloneBench, SolveEquationsBench, AnalyzeProblemBench, ManurePitchBench, ComputerProgramBench, TastyCookingBench, EfficientFightingBench, GallantDyingBench
Apologies for dropping this rant on an only-semi-related post[1].
It looks to me like people differ tremendously in how easily/quickly they are are to enter the jhanas, from people who enter them on their first sit to people who never manage to, despite best efforts and thousands of hours of practice on retreats; the TTFJ (time to first jhana) looks (roughly) lognormal to me, based on informal conversations/observations of online conversations about this. Some of this might be due to different mental motions being differently intuitive to people, and hard to transmit.
There are some caveats, here, due to differences in labeling for what counts as a “jhana”; especially since it’s a contested term (with Brasington jhanas, Pa Auk Sayadaw jhanas, Visuddhimagga jhanas spanning a wide range of possible states of mind. See here for more detail.)
On top of all of this is that claiming to have entered the jhanas conveys social status, which probably leads to overclaiming, since there is currently no way to check.
But my current best theory is that most meditative states/changes/attainments are heavily gated by neurology, be it developmental (from infancy/very early childhood) or even genetic (e.g. differences in the reward system), and one can get lucky here, or unlucky—and if one gets unlucky one will have to at least spend hundreds of hours undoing traumas/conditioning until the jhanas are accessible.
Teachers probably help, on average, but my best guess is that teachers don’t help a tremendous account. A teacher could be able to earlier discover if a student is bashing their head against an unopenable barrier, and redirect them to do emotional processing that could resolve the barrier. But there is probably a residue of stuff that needs to be worked through, for people who take a while to enter the jhanas.
I, of course, as always, wish that people studied all of this in greater detail; I don’t have high hopes.
It’s still valuable to attempt to enter the jhanas! And even if one can’t, or not quickly or easily, there is still much to be gained from meditation. I don’t know the optimal foraging/optimal stopping time for meditative techniques, it’s probably quite tricky. But it does look advisable for people to sometimes give up in their short-term pursuit of the jhanas.
(Context: I spent north of 1k hours on absorption meditation, including a month-long retreat when I got a teacher, with the goal of reaching the jhanas.)
- ^
Thank you for writing the post!
- ^
I also get this with Opus 4.8. Didn’t get it with anything up to 4.6 IIRC.
Hah, thanks! I should’ve crossposted to LW back then, also signal I should write up more of my off-the-cuff thoughts.
I guess it’s kind of dependent on the definition of a task (and thus games are a preferred unit of analysis), and we probably surpassed centaurs in many physical activities already.
Hm, interesting. Thanks, that one might indeed be false, though you say “in 2011”? That’s a lot of time in AI years.
…uh oh that guy definitely has LLM psychosis.
Update: @Paragox links
thesuch a hash in their comment.
I remember him tweeting hashes of unreleased essays (𝕏 is blocked on my machine right now, so I can’t look them up), so I’d guess from one perspective this is the mode of Gwern holding back.
I’m also interested in historical examples of companies shutting down for vaguely analogous reasons. Has any tobacco company shut down after it became common knowledge that smoking is bad?
In my conversations with LLMs, they could not come up with an a single example of this happening. The closest example they could find is apparently Patagonia, which in 2022 transferred 98% of nonvoting stocks to a nonprofit for climate philanthropy. But that’s kind of dissimilar.
Self-immolation would be basically unprecedented, especially at the scale of current AI companies. But extreme times require extreme measures.
As per the advanced chess obituary, we have a rough idea of the length of the centaur stage for chess. But what do we know of the length of the centaur stage for other games? I sent off Claude 4.6 Sonnet for a deep research query, here’s the result (sorted by domains with an identifiable gap on top):
(Claude-generated table starts here [1] )
Domain AI ≈ human (year) Centaur stage start Centaur stage end Calendar duration Evidence quality Post-centaur exploitation Notes Go 2016 2016 ~2017 ~0–1 yr Low Yes: Wang et al. 2022 Community consensus, no primary tournament data Chess ~1997 ~1998 ~2013–2016 ~15–18 yr Medium unknown Advantage eroded continuously from ~2009 Protein (single-domain) ~2020 (AF2) ~2018 ~2020–2022 ~2–4 yr High unknown CASP15: no significant human advantage on single-domain Weather forecasting ~2003 (ensembles) unclear ~2003–2005 very short / nil High unknown Humans beat single models but not ensembles, even in 2003 Checkers 1994 (Chinook) ~1994 ≤2007 <13 yr (upper bound) Low unlikely (game solved) Tinsley drew Chinook 1994, winning in 1990; weakly solved 2007 sets hard upper bound Radiology (ICH) not yet ~2019–2020? ongoing 5+ yr and counting High N/A (Type 1 ongoing) AI still substantially below human parity as of 2024 Dermatology ~2018 ~2019–2020 ongoing ~5–6 yr and counting High N/A (Type 1 ongoing) Meta-analysis n=67,700; human+AI > human alone Protein (multi-chain) not yet ~2020? ongoing 4+ yr and counting High N/A (Type 1 ongoing) Significant human advantage on assembly targets (p=0.029) Software function completion ~2021–2023 unclear ongoing? unclear Medium N/A (Type 1 ongoing?) Workplace RCT null; lab studies show 42–56% gains Shogi ~2013 ~2013? ~2015? ~2 yr? Very low unknown Speculative; no controlled data found Backgammon ~1992 (TD-Gammon) unknown unknown unknown None unknown No centaur tournament literature found Poker (HUNL) ~2017 (Libratus) unknown unknown unknown None unknown No controlled human+AI vs. AI-alone data found Machine translation ~2018–2020 (NMT) ~2016? unclear unclear None unknown Post-editing claims refuted by adversarial verification Legal (contract review) unknown unknown unknown unknown None unknown No controlled data found Financial trading unclear unclear unclear unclear None unknown No controlled centaur-vs-AI-alone literature found Scientific synthesis emerging unclear unclear unclear None unknown LLM-assisted systematic reviews under study; no benchmarks (Claude-generated table ends here)
- ↩︎
Apologies for the lack of collapsible section. Switching to the rich text editor (plausibly buggily?) fails :-)
- ↩︎
Another example of the claim is here. I guess to really settle it a longer Hurlburt-style interview would be useful.
This post should make it clear. In short: MV-algebras are the semantics for Łukasiewicz logic, which is in turn usually defined either as a trinary logic or over the reals. Demski & Garriga-Alonso find that this doesn’t resolve some paradoxes, and thus define it over the hyperreals, which they suspect resolves all the paradoxes one can find.
epistemic status: shooting the shit [1] . Least certain about the quantum part. As of now, I can find six distinct types of (incommensurable?) belief strength:
Empirical/adversarial ((infra-)Bayesianism/whatever imprecise probability theory)
Self-referential/semantic ((hyperfinite) Łukasiewicz degree)
Quantum state credences (non-commuting observables, Born rule?)
Normative (choiceworthiness, decision-theoretic/¿aesthetic?)
Possibly commensurable:
Self-referential/semantic→logical (Garrabrant inductors oscillate around p(Liar’s paradox)=0.5, possibly solving it as well for Restall’s paradox-type sentences, converging to (but never reaching) 0?)
Indexical→quantum (afaiu, from the Gleason theorem/Kochen-Specker theorem we know we can’t collapse quantum states into probabilities without losing information, but maybe indexcal uncertainty, at the end of the day, just is best represented as quantum states?)
Indexical uncertainty→empirical uncertainty: Perhaps indexical uncertainty is just a spicier version of empirical uncertainty, and we can see different anthropic updating rules as hidden variants of empirical reasoning.
Possibly disambiguable:
Normative uncertainty: Many in one bucket, maybe this becomes philosophical uncertainty if expanded? Not clear to me that decision-theoretic uncertainty/aesthetic/normative/metanormative uncertainty &c follow the same update rule.
Attempt at a table:
Type of belief-strength Formal object Update rule Empirical Probability distribution/credal set/infradistribution &c Bayes rule/imprecise update rule/the infra-Bayesian equivalent Logical Logical induction Self-referential MV-algebra over the hyperreal (in Łukasiewicz logic) ??? maybe an ongoing process of expanding the hyperreal tree to deal with novel paradoxes? None? Indexical Measure over observer-moments SSA/SIA Quantum Density matrix ? maybe the Quantum Liouville equation? Normative Probability distribution over normative statements (or a fixed point in infinite meta-regress) Philosophical argument, reflective equilibrium - ↩︎
Thanks to several AFFINE & EAG participants for talking with me about this, if you see this you can tell me to credit you. Also thanks Claude, your criticisms are a pain in the ass. (None of this is Claude-written, don’t worry.)
It might also be that qualiagnosics have qualia but don’t know it, similarly to how aphantasiacs actually have various forms of imagination, largely similar to non-aphantasiacs’, except mostly running in the background.
Yeah, I considered this and alluded to it (“and people who have qualia but say they don’t have them”). In general, my prior is to follow people’s self-reports, since in this area there’s no shared ground on whose self-reports are more accurate (the illusionists say that the qualiagnosics are right, non-eliminativists say the qualiagnosics are mistaken, et sic ad infinitum repetitur).
From my understanding Tomasik is both an eliminativist and a hardcore negative utilitarian, so I’d guess he has some takes on this, although they might mostly reduce to something like: suffering defined as a computational pattern similar to whatever we call suffering in humans.
Yup, I elided this. It’s a coherent position, though I don’t find it very intuitive.
They do really seem to have myopic, urges interspersed into simply trying the next kind of thing on the list of possible things to try.