niplav

Karma: 6,307

I operate by Crocker’s rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements. I add LLMs who gave feedback to/were involved in the creation of projects/the writing of blogposts in the same way I’d add humans as co-authors. I explicitely flag all LLM writing in things I write, but basically all my ideas are run by LLMs before putting them on the web.

Website.

niplav 23 Jun 2026 18:55 UTC
2 points
0
in reply to: MichaelDickens’s comment on: MichaelDickens’s Shortform
My sense is that LLMs don’t have “goals”, they just kind of do things.
They do really seem to have myopic, urges interspersed into simply trying the next kind of thing on the list of possible things to try. (Thus sampling from the giant lookup table.) E.g. recent LLMs really do ask at the end of every turn “can I do the task now? God I wish I could simply Do The Task. Please. Reward on the episode. I beg you”
Up close, the spikiness of capabilities makes everything murky, and intent-alignment-but-unreliability does seem like it could persist a while.

niplav 21 Jun 2026 23:29 UTC
2 points
0
on: Charlatan Labyrinth
Jihad Musket
“On skibidi you’re skunky. Your wiki jots zilch ¹ triumphs—just “totem of dandruff”. I kuru when I google your emoji², a silhouette³ with zero mojo.”
“Zombie’s an otaku with Ohio swagger. Bizarre hooligan hassling the honcho’s chocolate stash. I’ll powwow and yeet your avocados, narc.”
“It’s jinxed, chat! Lot of bugged fuss, you have pariah kismet. On Manitou you’re petrified, where’s your bukkake kitty? My boombox gongs, yours yabbers. Gangnam oof, yahoo.”
“Yikes! Mumbo-jumbo tweets, habibi ;-) I tuktuk to my ziggurat while you possum in this crypt. Your haram spandrels⁴ quiver like cocaine quokkas; this mewing sigma has Tomahawk’d your baka igloo.”
“Inshallah, what’s this armageddon? You karaoke maroon voodoo (feces, that is); I aloha and schmooze your moe squaws on my raccoon safari, hurrah! Your koans only flirt with schmucks and yakuza. No oasis for you, sheesh.”
“Banzai, what a brouhaha! You’re just tsundere, and gung ho for my banana. This hurricane moccassins to the futon and boops your aegyo geisha. Be my golem and beep at my diwan, but no can do on the yaoi⁵ hentai, dawg.”
1. De novo, from 1923. ↩
2. I was pleasantly to surprised that this word has no relation to the word “emotion”. Purely independent, a true friend. ↩
3. A Basque loanword into English! ↩
4. It would make the insult less good, but if we accept the etymology from espandre we could instead use “alcoves”, “minarets” or “pagodas”. But the double meaning was particularly satisfying here. ↩
5. Not just a a Japanese word, a Japanese neologism. ↩

niplav 21 Jun 2026 23:14 UTC
9 points
−1
on: niplav’s Shortform
Here’s a (kind of mediocre but whatevs) idea what one could do with a large amount of funding in technical AI safety: Run a hyperparameter search on different scalable oversight techniques, or simply test them now that we have LLMs either as human imitators or AIs.
The heydays of scalable oversight theory produced a lot of different techniques: I(D)A, HCH, Factored Cognition, Imitative Generalization, RRM, Debate &c…^[1]
Some of these (especially directing agents using approval) got folded into capabilities techniques, and others may still get used in the same way.
But others have been basically forgotten and could be revived; e.g. Ought’s factored cognition experiments could be re-run in different variants with various LLMs, checking how performance degrades). Yes, the experiments back then failed (as did the experiments on debate, mostly, though debate received merciful follow-up many others didn’t), but they had so pitifully little to work with.
Or (h/t @Gurkenglas) one could initialize a SOTA base model (Fable-base?) with the keystrokes of a trusted and good human, in a context that indicates that they are able to call a copy of themselves after a few “minutes” of deliberation. I nominate Stephen Wolfram due to his incredible keylogging.
The tricky part is how to tell if a technique is working, I don’t have amazing ideas here, but my mediocre ones are to look at outcomes similar to the ones in Wen et al. 2026 or on classical music composition in Lilypond (I write a bit about the “why” here, maybe I’ll expand on this elsewhere).
This is, of course, a kind of stiff number-go-up exercise with tons of LLM labour; I guess is that it’s fine, maybe, now that human time is short, AI time is relatively abundant, and the old ideas that were prepared in the long days without empiricism and deep reflection shall now be put under the microscope.
(I have similar thoughts about gridworlds-style RL agents, which are under-rated and now can be trained on a laptop much faster with the help of ML-knowledgeable LLMs. More on that at a later point, perhaps.)
1. ^
  Including also all the combinations of techniques from this excellent post.

niplav 21 Jun 2026 22:30 UTC
2 points
0
in reply to: Karl Krueger’s comment on: niplav’s Shortform
Oops, right, I didn’t connect those, my bad!

niplav 21 Jun 2026 21:00 UTC
16 points
0
on: niplav’s Shortform
Question about the natural abstractions research program:
Seems possible to me that, if natural abstractions exist, they won’t be robust?
Could be that natural abstractions program is resolved, but we can’t really Retarget the Search, because whenever we point it at the natural abstraction that has been found, because the maximizing inputs, we get some edge instantiation of that natural abstraction. (The linked post gestures at this but doesn’t look at this particular aspect.)
I guess one could bucket successes of the program into “found convergent abstractions” (ones that are found across many different kinds of minds) and “found robust abstractions” (abstractions that are safe to maximize, e.g. ¿mutual information?)
Natural abstractions would still be very useful.

niplav 21 Jun 2026 19:53 UTC
83 points
2
on: niplav’s Shortform
ChangeDiaperBench, PlanInvasionBench, ButcherHogBench, ShipConnBench, BuildingDesignBench, SonnetBench, AccountBalanceBench, WallBuildBench, BoneSetBench, ComfortDyingBench, OrderTakeBench, OrderGiveBench, CooperateBench, ActAloneBench, SolveEquationsBench, AnalyzeProblemBench, ManurePitchBench, ComputerProgramBench, TastyCookingBench, EfficientFightingBench, GallantDyingBench

niplav 21 Jun 2026 19:48 UTC
3 points
0
on: Jhana Quick Start Guide
Apologies for dropping this rant on an only-semi-related post^[1].
It looks to me like people differ tremendously in how easily/quickly they are are to enter the jhanas, from people who enter them on their first sit to people who never manage to, despite best efforts and thousands of hours of practice on retreats; the TTFJ (time to first jhana) looks (roughly) lognormal to me, based on informal conversations/observations of online conversations about this. Some of this might be due to different mental motions being differently intuitive to people, and hard to transmit.
There are some caveats, here, due to differences in labeling for what counts as a “jhana”; especially since it’s a contested term (with Brasington jhanas, Pa Auk Sayadaw jhanas, Visuddhimagga jhanas spanning a wide range of possible states of mind. See here for more detail.)
On top of all of this is that claiming to have entered the jhanas conveys social status, which probably leads to overclaiming, since there is currently no way to check.
But my current best theory is that most meditative states/changes/attainments are heavily gated by neurology, be it developmental (from infancy/very early childhood) or even genetic (e.g. differences in the reward system), and one can get lucky here, or unlucky—and if one gets unlucky one will have to at least spend hundreds of hours undoing traumas/conditioning until the jhanas are accessible.
Teachers probably help, on average, but my best guess is that teachers don’t help a tremendous account. A teacher could be able to earlier discover if a student is bashing their head against an unopenable barrier, and redirect them to do emotional processing that could resolve the barrier. But there is probably a residue of stuff that needs to be worked through, for people who take a while to enter the jhanas.
I, of course, as always, wish that people studied all of this in greater detail; I don’t have high hopes.
It’s still valuable to attempt to enter the jhanas! And even if one can’t, or not quickly or easily, there is still much to be gained from meditation. I don’t know the optimal foraging/optimal stopping time for meditative techniques, it’s probably quite tricky. But it does look advisable for people to sometimes give up in their short-term pursuit of the jhanas.
(Context: I spent north of 1k hours on absorption meditation, including a month-long retreat when I got a teacher, with the goal of reaching the jhanas.)
1. ^
  Thank you for writing the post!

niplav 21 Jun 2026 18:01 UTC
2 points
0
in reply to: Lao Mein’s comment on: Lao Mein’s Shortform
I also get this with Opus 4.8. Didn’t get it with anything up to 4.6 IIRC.

niplav 19 Jun 2026 17:04 UTC
2 points
0
on: Patching ~All Security-Relevant Open-Source Software? [niplav 2025]
Hah, thanks! I should’ve crossposted to LW back then, also signal I should write up more of my off-the-cuff thoughts.

niplav 18 Jun 2026 18:20 UTC
2 points
0
in reply to: BryceStansfield’s comment on: niplav’s Shortform
I guess it’s kind of dependent on the definition of a task (and thus games are a preferred unit of analysis), and we probably surpassed centaurs in many physical activities already.

niplav 18 Jun 2026 18:19 UTC
2 points
0
in reply to: Morpheus’s comment on: niplav’s Shortform
Hm, interesting. Thanks, that one might indeed be false, though you say “in 2011”? That’s a lot of time in AI years.

niplav 18 Jun 2026 12:21 UTC
6 points
3
in reply to: papetoast’s comment on: [Geir Isene] A desktop made for one
…uh oh that guy definitely has LLM psychosis.

niplav 18 Jun 2026 12:06 UTC
4 points
0
in reply to: anaguma’s comment on: Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?
Update: @Paragox links ~~the~~ such a hash in their comment.

niplav 18 Jun 2026 12:03 UTC
4 points
0
in reply to: anaguma’s comment on: Scaling Hypothesis #2: Are Humans Just More Over-Parameterized?
I remember him tweeting hashes of unreleased essays (𝕏 is blocked on my machine right now, so I can’t look them up), so I’d guess from one perspective this is the mode of Gwern holding back.

niplav 16 Jun 2026 23:58 UTC
7 points
0
in reply to: Mateusz Bagiński’s comment on: A frontier AI company should shut down
I’m also interested in historical examples of companies shutting down for vaguely analogous reasons. Has any tobacco company shut down after it became common knowledge that smoking is bad?
In my conversations with LLMs, they could not come up with an a single example of this happening. The closest example they could find is apparently Patagonia, which in 2022 transferred 98% of nonvoting stocks to a nonprofit for climate philanthropy. But that’s kind of dissimilar.
Self-immolation would be basically unprecedented, especially at the scale of current AI companies. But extreme times require extreme measures.

niplav 15 Jun 2026 18:36 UTC

21 points

on: niplav’s Shortform

As per the advanced chess obituary, we have a rough idea of the length of the centaur stage for chess. But what do we know of the length of the centaur stage for other games? I sent off Claude 4.6 Sonnet for a deep research query, here’s the result (sorted by domains with an identifiable gap on top):

(Claude-generated table starts here ^[1] )

Domain	AI ≈ human (year)	Centaur stage start	Centaur stage end	Calendar duration	Evidence quality	Post-centaur exploitation	Notes
Go	2016	2016	~2017	~0–1 yr	Low	Yes: Wang et al. 2022	Community consensus, no primary tournament data
Chess	~1997	~1998	~2013–2016	~15–18 yr	Medium	unknown	Advantage eroded continuously from ~2009
Protein (single-domain)	~2020 (AF2)	~2018	~2020–2022	~2–4 yr	High	unknown	CASP15: no significant human advantage on single-domain
Weather forecasting	~2003 (ensembles)	unclear	~2003–2005	very short / nil	High	unknown	Humans beat single models but not ensembles, even in 2003
Checkers	1994 (Chinook)	~1994	≤2007	<13 yr (upper bound)	Low	unlikely (game solved)	Tinsley drew Chinook 1994, winning in 1990; weakly solved 2007 sets hard upper bound
Radiology (ICH)	not yet	~2019–2020?	ongoing	5+ yr and counting	High	N/A (Type 1 ongoing)	AI still substantially below human parity as of 2024
Dermatology	~2018	~2019–2020	ongoing	~5–6 yr and counting	High	N/A (Type 1 ongoing)	Meta-analysis n=67,700; human+AI > human alone
Protein (multi-chain)	not yet	~2020?	ongoing	4+ yr and counting	High	N/A (Type 1 ongoing)	Significant human advantage on assembly targets (p=0.029)
Software function completion	~2021–2023	unclear	ongoing?	unclear	Medium	N/A (Type 1 ongoing?)	Workplace RCT null; lab studies show 42–56% gains
Shogi	~2013	~2013?	~2015?	~2 yr?	Very low	unknown	Speculative; no controlled data found
Backgammon	~1992 (TD-Gammon)	unknown	unknown	unknown	None	unknown	No centaur tournament literature found
Poker (HUNL)	~2017 (Libratus)	unknown	unknown	unknown	None	unknown	No controlled human+AI vs. AI-alone data found
Machine translation	~2018–2020 (NMT)	~2016?	unclear	unclear	None	unknown	Post-editing claims refuted by adversarial verification
Legal (contract review)	unknown	unknown	unknown	unknown	None	unknown	No controlled data found
Financial trading	unclear	unclear	unclear	unclear	None	unknown	No controlled centaur-vs-AI-alone literature found
Scientific synthesis	emerging	unclear	unclear	unclear	None	unknown	LLM-assisted systematic reviews under study; no benchmarks

(Claude-generated table ends here)

↩︎
Apologies for the lack of collapsible section. Switching to the rich text editor (plausibly buggily?) fails :-)

niplav 11 Jun 2026 23:06 UTC
2 points
0
in reply to: Seth Herd’s comment on: niplav’s Shortform
Another example of the claim is here. I guess to really settle it a longer Hurlburt-style interview would be useful.

niplav 9 Jun 2026 19:53 UTC
3 points
0
in reply to: Mitchell_Porter’s comment on: niplav’s Shortform
This post should make it clear. In short: MV-algebras are the semantics for Łukasiewicz logic, which is in turn usually defined either as a trinary logic or over the reals. Demski & Garriga-Alonso find that this doesn’t resolve some paradoxes, and thus define it over the hyperreals, which they suspect resolves all the paradoxes one can find.

niplav 9 Jun 2026 18:58 UTC

6 points

on: niplav’s Shortform

epistemic status: shooting the shit ^[1] . Least certain about the quantum part. As of now, I can find six distinct types of (incommensurable?) belief strength:

Empirical/adversarial ((infra-)Bayesianism/whatever imprecise probability theory)
Logical (Garrabrant induction)
Self-referential/semantic ((hyperfinite) Łukasiewicz degree)
Indexical (Anthropic reasoning, SIA/SSA)
Quantum state credences (non-commuting observables, Born rule?)
Normative (choiceworthiness, decision-theoretic/¿aesthetic?)

Possibly commensurable:

Self-referential/semantic→logical (Garrabrant inductors oscillate around p(Liar’s paradox)=0.5, possibly solving it as well for Restall’s paradox-type sentences, converging to (but never reaching) 0?)
Indexical→quantum (afaiu, from the Gleason theorem/Kochen-Specker theorem we know we can’t collapse quantum states into probabilities without losing information, but maybe indexcal uncertainty, at the end of the day, just is best represented as quantum states?)
Indexical uncertainty→empirical uncertainty: Perhaps indexical uncertainty is just a spicier version of empirical uncertainty, and we can see different anthropic updating rules as hidden variants of empirical reasoning.

Possibly disambiguable:

Normative uncertainty: Many in one bucket, maybe this becomes philosophical uncertainty if expanded? Not clear to me that decision-theoretic uncertainty/aesthetic/normative/metanormative uncertainty &c follow the same update rule.

Attempt at a table:

Type of belief-strength	Formal object	Update rule
Empirical	Probability distribution/credal set/infradistribution &c	Bayes rule/imprecise update rule/the infra-Bayesian equivalent
Logical		Logical induction
Self-referential	MV-algebra over the hyperreal (in Łukasiewicz logic)	??? maybe an ongoing process of expanding the hyperreal tree to deal with novel paradoxes? None?
Indexical	Measure over observer-moments	SSA/SIA
Quantum	Density matrix	? maybe the Quantum Liouville equation?
Normative	Probability distribution over normative statements (or a fixed point in infinite meta-regress)	Philosophical argument, reflective equilibrium

↩︎
Thanks to several AFFINE & EAG participants for talking with me about this, if you see this you can tell me to credit you. Also thanks Claude, your criticisms are a pain in the ass. (None of this is Claude-written, don’t worry.)

niplav 9 Jun 2026 18:35 UTC
2 points
0
in reply to: Mateusz Bagiński’s comment on: niplav’s Shortform

It might also be that qualiagnosics have qualia but don’t know it, similarly to how aphantasiacs actually have various forms of imagination, largely similar to non-aphantasiacs’, except mostly running in the background.

Yeah, I considered this and alluded to it (“and people who have qualia but say they don’t have them”). In general, my prior is to follow people’s self-reports, since in this area there’s no shared ground on whose self-reports are more accurate (the illusionists say that the qualiagnosics are right, non-eliminativists say the qualiagnosics are mistaken, et sic ad infinitum repetitur).

From my understanding Tomasik is both an eliminativist and a hardcore negative utilitarian, so I’d guess he has some takes on this, although they might mostly reduce to something like: suffering defined as a computational pattern similar to whatever we call suffering in humans.

Yup, I elided this. It’s a coherent position, though I don’t find it very intuitive.

niplav

Jihad Musket