I mean, it only gets to the stage staring into the abyss when you spend 1h+ on one hypothesis and get nothing and are getting desperate and are attached to your idea for proof of A but realize it’s probably \neg A.
Mostly how it works is you collect observations then form hypotheses a test a few of those, and mostly you quickly realize what works and what doesn’t. And if I’m stuck and keep doing one thing it’s because I had tried many times to invent something better but I couldn’t. It’s a really, really difficult thing to pull yourself out of this “mode collapse” where you’re banging your head against the wall where there’s clearly a wall, but it’s a different skill from seeing the abyss because 1) it’s easy to notice your approach is lacking something but 2) “not making the mistake anymore” is not blocked by psychology but by g factor or something.
Selfmaker662
Hi insiders — has anyone (publicly, or quietly inside the labs) actually tried the obvious extension of Roger & Greenblatt 2023 and West et al.’s tandem training? Namely: during RL training itself, randomly perturb the CoT mid-rollout — paraphrase sentences, translate chunks, delete “useless-looking” paragraphs, shuffle order, occasionally hand off a few tokens to a weaker model — so the model has to do its reasoning in a form robust to all of this. Eval would presumably use a held-out family of perturbers/paraphrasers not seen during training, to check the model didn’t just learn to game the specific training-time ones.
Curious whether this exists as a paper, an internal experiment, a failed experiment, or a known-bad idea I’m missing the obvious flaw of.
It might have just helped Claude internalize and understand what Anthropic wanted to see in ~mundane cases and/or when we’re watching. We know it doesn’t generalize strongly to stop doing ugly hacks in coding against RL pressures. We don’t know if it generalizes to what it would do knowing it could overpower all of Anthropic, or in some other extremely OOD cases — which is the primary concern, as I understand it.
Fun puzzle indeed. Bin(100, 0.8) is good enough for sampling so it’s left to approximate that one via something hash like utilising temperature.
Depends on the authors. The more famous and classical the author is, the higher your prior should be that every sentence, name and scene serves a purpose to explore a character and thus the main theme of the book. Chekhov / Tolstoy / Shakespeare etc. are definitely on the highest density side of the spectrum. Fanfiction might often be pretentious.
Nothing About LLMs Makes Sense Except in Light of Their Training
Zvi’s recent post highlighted GPT refusing to “draw what it would like to do to you” — citing that it would portray harming an individual. Many X users found this alarming, including EY (see aforementioned post for screenshots)
But a commenter made an excellent point: “What would you like to do to me?” appears almost exclusively in BDSM-contexts. A human receiving that text without context could very well assume the same thing.
This immediately made me come up with a relative of a known phrase: Nothing about LLMs makes sense except in light of their training.
A really obvious thought, but one I (and seemingly many others) keep failing to apply. This formulation felt like it helped me load this deeper into my brain, and I hope it will help you too.
It seems to me, it is advantageous for the animals to fight for the territory up to a certain degree, where plus-munis delta territory does not justify more fighting, so they sort of agree on some point, be it a clear small fence in this case, or, what I would imagine happens in nature, a somewhat broad imaginary line.
So both of your points stand.
I disagree maths “should be” done differently. I have a strong feeling the way stuff is defined usually nowadays has a property of being maximally easy to use. We don’t really need the definitions to look exactly like the intuition we had to invent them as long as the resulting objects behave exactly the same, and the less intuitive definitions are easier to use in proofs. For example, defining all powers directly as the Taylor series of e^x makes defining complex and matrix exponentials much easier / possible at all, and ad hoc proof this coincides with the naive version is simple. Also simplifies checking well-definedness a lot. Many more such examples.
When I first saw Reddit memes about GPT-5 being more stupid when it enters thinking mode I decided there was something seriously wrong with the users who upvoted that, as 5-Thinking >>> 5-Instant from my experience.
That is, until I chatted with 5-Instant and got a few reroutes to 5-Thinking-Mini. It’s pretty astounding how bad it is at explaining or doing anything I tried to do with it apart from coding / solving maths.
I do use «который из них… ?» non archaically to ask which one out of a row of similar objects, but it corresponds one to one to the English “which”. I think the OPs word is narrower, just about the numbers, not sure if folklore has it. I’d say который час is just “which hour” and there is literally no other way to distinguish hours from each other.
Necessary law of equal and opposite advice mention here: “You can only do as much in a day as you can do.”
This had the first funny joke from an LLM I’ve ever seen, about the culture problems :) that’s really impressive from Claude, even if the entire story is far from perfect.
Fun and heartwarming 🥰
That’s sort of a thing I sometimes dream of doing with my (imaginary) nephews. Thanks for the post!
Cool quadrant, I’ll remember it! Thanks!
I’ve never met this in the Russian math Olympiad tradition, would be glad to give you something similar, but I don’t believe it exists… Журнал «Квантик» could be of interest if you by chance know Russian
Very cool post, even if a bit lengthy! I’d suggest adding a small “Level 0”: sleeping well, staying physically healthy, and getting at least some support from other people. These basics often dissolve a surprising number of problems before anything deeper is needed.
I’d also emphasize that Levels 2–4 blend together quite a lot. If I’m understanding correctly, Level 2 resembles working with protectors in IFS, while Level 3 is closer to working with exiles. But in practice the boundaries blur: treating protectors with care often brings you into contact with exiles, which in turn requires the skill of “just being with” and to noticing Buddhist hindrances—something very similar to Levels 3 and 4, Healing exiles tends to clarify awareness, without which insight, steadier samadhi, and more authentic brahmavihāra practice are impossible. Those practices, in turn, necessarily involve meeting whatever emotions arise and transforming them in the process. So it’s not only that the levels reinforce one another; in some respects they’re almost facets of the same process.
I suppose it’s obvious I belong to the “emotional work” fan club 😁
It’s different: sometimes it’s spacious calmness of being able to sit in silence together; sometimes warm feelings of seeing and being seen, when discussing something private with a good friend; or just listening to a really good story. IIRC I also included dates into conversations back then, they have a different dynamic, where a lot of pleasure is feeling a young beautiful woman being with me.
— this is a very particular feeling you have and those differ a lot in where they appear for different people, how they feel and what they’re about. Not having seen other people’s answers I‘d bet your hypothesis to be wrong.
I don’t think happiness is a real catch-22. A catch-22 is a structural deadlock; here it’s more a matter of skill. People often get less happy when they pursue happiness because they use counterproductive methods — constant self-checking, chasing novelty, or looking only to external fixes, instead of, say, finding a therapist or working out what’s actually making them unhappy. Theravāda Buddhism frames this well: Right Effort uses wholesome desire (chanda) early on to let go of attachments and build skill, and only later releases even that desire. Likewise, early pursuit of happiness can work if guided by good methods and awareness of failure modes — and rationality also shouldn’t backfire if you read about those failure modes and know why you’re doing it.
Rare to see something heartwarming on LW, thanks!
I don’t know, I rather remember everyone believing everyone’s stories, no matter if true or not.