gwern comments on Is Building Good Note-Taking Software an AGI-Complete Problem?

gwern 27 May 2025 2:33 UTC
15 points
3
My earlier commentary on what I think note-taking tools tend to get wrong: https://gwern.net/blog/2024/tools-for-thought-failure
- Mo Putera 27 May 2025 5:49 UTC
  6 points
  3
  Parent
  Have you by any chance gotten further along on your Nenex idea, or know of anyone online who’s gone somewhat in that direction far enough to be interesting? To be fair the Nenex features you listed are pretty extensive so I doubt anyone’s gone all that far, which is a bummer since
  What I want is to animate my dead corpus so it can learn & think & write.
  is a seductive vision that feels like it should be a lot closer today than it actually is.
  - gwern 29 May 2025 18:59 UTC
    46 points
    0
    Parent
    I have not done any work directly on it. The LLMs have kept improving so rapidly since then, especially at coding, that it has not seemed like a good idea to work on it.
    
    Instead, I’ve been thinking more about how to use LLMs for creative writing or personalization (cf. my Dwarkesh Patel interview, “You should write more online”). To review the past year or two of my writings:
    
    So for example, my meta-learning LLM interviewing proposal is about how to teach a LLM to ask you useful questions about your psychology so it can better understand & personalize (based on my observations that LLMs can now plan interviews by thinking about possible responses and selecting interesting questions, as a variant of my earlier “creativity meta-prompt” idea/hierarchical longform training); “Quantifying Truesight With SAEs” is an offline version about distilling down ‘authors’ to allow examination and imitation. And my draft theory of mathematicians essay is about the meta-RL view of math research suggesting that ‘taste’ reduces down to a relatively few parameters which are learned blackbox style as a bilevel optimization problem and that may be how we can create ‘LLM creative communities’ (eg. to extract out small sets of prompts/parameters which all run on a ‘single’ LLM for feedback as personas or to guide deep search on a prompt).
    
    My “Manual of Style” is an experiment in whether you can iteratively, by asking a LLM to read your writings, extract out an explicit manual of style about how to ‘write like you’
    
    It includes a new denoising/backtranslation prompt-engineering trick I am currently calling “anti-examples” where you have the LLM make editing suggestions (which turn it into ChatGPTese) and then you reverse that to fix the chatbot prior*.
    
    So given how gargantuan context windows have become, and the existence of prompt caching, I think one may be able to write a general writing prompt, which includes a full MoS, a lot of anti-examples for several domains, some sample Q&As (optimized for information gain), instructions for how to systematically generate ideas, and start getting a truly powerful chatbot assistant persona with the scaled-up base models like GPT-5 which should start landing this year.
    
    “Virtual comments” is another stab at thinking about how ‘LLM writing support’ can work, as well as reinventing the idea of ‘seriation’, and better semantic search via tree-shaped embeddings for both LLM & human writers (and the failed experiment with E-positive).
    
    “Towards better RSS feeds” is about an alternative to Nenex commands: can you reframe writing as a sequence of atomic snippets which the LLM rewrites at various levels of abstraction/detail, which enables reading at those same levels, rather than locking people into a single level of detail, which inevitably suits few?
    
    “October The First Is Too Late”, “Bell, Crow, Moon: 11 Poetic Variations”, “Area Man Outraged AI Has Not Solved Everything Yet”, “Human Cannibalism Alignment Chart”/”Hacking Pinball High Scores”, “Parliament of Rag & Bone”, “A Christmas Protestation”, “Second Life Sentences”, “On the Impossibility of Superintelligent Rubik’s Cube Solvers” were tests of how useful the LLMs are for iterative variation and selection using a ‘brainstorm’ generate-rank-select prompt and/or for hierarchical generation; they finally seem at the point where you can curate good stuff out of them and are genuinely starting to become useful for my nonfiction essays like “‘you could have invented Transformers’ tutorial”/”Cats As Horror Movie Villains”/typesetting HTML fractions/Rock-Paper-Scissors optimality (and demonstrate my views on acceptable use of generative media).
    
    “Adding Bits Beats AI Slop” is about my observations about how this kind of intensive search + personalization seems critical to taking generative model outputs from mediocre slop to genuinely good.
    
    “LLM Challenge: Write Non-Biblical Sentences” is an observation that for creativity, “big model smell” may be hard to beat, and you may just need large LLMs for high-end intellectual work, so one should beware false economies; similarly, “Towards Benchmarking LLM Diversity & Creativity” is about avoiding the LLMs getting ever worse for search purposes (mode-collapsed small models being a danger for Nenex uses—they are the ones that will be easy and tempting to run, but will hamstring you, and you have to go into it with eyes open).
    
    “AI Cannibalism Can Be Good” is a quick explainer to try to overcome the intuition that there are no gains from ‘feeding AI inputs back into AI’ - if you don’t understand how this can be a good thing or why it’s not a perpetual motion machine, much of the foregoing will seem like nonsense or built on sand.
    
    Obviously, I’ve also been doing a lot of regular writing, and working on the Gwern.net website infrastructure—adding the ‘blog’ feature has been particularly important, but just getting the small details right on things like “October The First” takes up plenty of time. But the overall through-line is, “how can we start getting meaningful creative work out of LLMs, rather than sleepwalking into the buzzsaw of superhuman coders creating Disneyland-without-children where all the esthetics is just RLHF’d AI slop?”
    
    * This seems particularly useful for fiction. I’m working on a write up of an example with a Robin Sloan microfic where the LLM suggestions get better if you negate them, and particularly if you order them to think about why the suggestions were bad and what that implies before they make any new suggestions—which suggests, in conjunction with the success of the ‘brainstorm’ prompt, that a major failing of LLMs right now is just that they tend to treat corrections/feedback/suggestions in a ‘superficial’ manner because the reasoning-mode doesn’t kick in when it should. Interestingly, ‘superficial’ learning may be why dynamic-evaluation/finetuning seems to underperform: https://arxiv.org/abs/2505.01812 https://arxiv.org/abs/2505.00661#google Because adding paraphrases or Q&A to the finetuning data, although it cannot add any new information, improves performance; reminiscent of engrams/traces in human memory—you can have memorized things, but not be able to recall them, if there aren’t enough ‘paths’ to a memory.