I was a bit unsure whether to post this here, but given that Gwern’s “Tilakkhana” and “October the First is Too Late” were posted here too, I thought this in the same genre, about the same level of obscurity.
Both of those benefited from intermingled annotations/scaffolding to keep the LLMs on track, as notes. “October” puts them in comments/collapses, and “Tilakkhana” has them in comments, following the “scansion” pseudo-code.
Given the difficulty of this constraint (where do you even get a list of PIE-valid words...?), I would not adopt a constrained-sampling approach (which used to be the standard approach to such text games but doesn’t play well with any kind of planning/inner-monologue), but rather a ‘databank’ approach, closer to how I did “trajectoid words”: write down the list of PIE-valid words and frequency-valid words, and then define a format where every word in the story has to be annotated with its ‘type’ (eg ‘c’ for ‘common’ and ‘p’ for ‘PIE’), and a permissible root word if it is not in the databank. This helps reduce the problem of valid writing to a very ‘local’ problem with a cheap self-attention check back to the databank. It also makes it easier for a reasoning model to scan over a final draft to double-check validity.
So something like ... in [c] the [c] barracks [p:bara].
You could also iteratively add in valid words to the databank to save compute; add in ‘barrack’ and ‘barracks’ and ‘barracked’ to the PIE databank and future LLM runs can just write ... in [c] the [c] barracks [p]. (If you have few enough PIE words to work with, you could ask the LLM to try to generate up front all their valid variations.) This seems important given that your footnotes indicate to me that a lot of your etymologies are too debated to expect a LLM to deliver satisfactory results to you; you are going to have to lay down by fiat what are or are not valid words/roots… In fact, given the extreme difficulty you are having in writing even a coherent sentence, you’d probably want to include a sentence databank to store all the reasonably interesting valid sentences generated. (I wouldn’t necessarily bother with paragraph or higher, given how much difficulty you’re having at the word and sentence level.) I do this a lot with poetry, like with the last poem I wrote (for Valentine’s Day), I included this:
Finally, from all poems and all drafts and all ideas, select the new lines, images, and ideas (not already in the prompt/databank) which are most beautiful, unique, interesting, or promising; I will curate them so we can draw on them in future drafts.
(Even when you don’t get any new ones you want to cheap, it’s interesting for giving you an idea how the model ‘thinks’. I tried GLM-5 the other day, and I could see from its curation that it had terrible taste, which lined up with the garbage final outputs. I’ll be sticking with Kimi K2.5 Thinking as my current outside option for now...)
After you have built up enough puzzle pieces, it should be easier for the LLM to assemble them in a bunch of ways, check the fit, and then pick the best out of 20 or 100 or whatever.
Both of those benefited from intermingled annotations/scaffolding to keep the LLMs on track, as notes. “October” puts them in comments/collapses, and “Tilakkhana” has them in comments, following the “scansion” pseudo-code.
Given the difficulty of this constraint (where do you even get a list of PIE-valid words...?), I would not adopt a constrained-sampling approach (which used to be the standard approach to such text games but doesn’t play well with any kind of planning/inner-monologue), but rather a ‘databank’ approach, closer to how I did “trajectoid words”: write down the list of PIE-valid words and frequency-valid words, and then define a format where every word in the story has to be annotated with its ‘type’ (eg ‘c’ for ‘common’ and ‘p’ for ‘PIE’), and a permissible root word if it is not in the databank. This helps reduce the problem of valid writing to a very ‘local’ problem with a cheap self-attention check back to the databank. It also makes it easier for a reasoning model to scan over a final draft to double-check validity.
So something like
... in [c] the [c] barracks [p:bara].You could also iteratively add in valid words to the databank to save compute; add in ‘barrack’ and ‘barracks’ and ‘barracked’ to the PIE databank and future LLM runs can just write
... in [c] the [c] barracks [p].(If you have few enough PIE words to work with, you could ask the LLM to try to generate up front all their valid variations.) This seems important given that your footnotes indicate to me that a lot of your etymologies are too debated to expect a LLM to deliver satisfactory results to you; you are going to have to lay down by fiat what are or are not valid words/roots… In fact, given the extreme difficulty you are having in writing even a coherent sentence, you’d probably want to include a sentence databank to store all the reasonably interesting valid sentences generated. (I wouldn’t necessarily bother with paragraph or higher, given how much difficulty you’re having at the word and sentence level.) I do this a lot with poetry, like with the last poem I wrote (for Valentine’s Day), I included this:(Even when you don’t get any new ones you want to cheap, it’s interesting for giving you an idea how the model ‘thinks’. I tried GLM-5 the other day, and I could see from its curation that it had terrible taste, which lined up with the garbage final outputs. I’ll be sticking with Kimi K2.5 Thinking as my current outside option for now...)
After you have built up enough puzzle pieces, it should be easier for the LLM to assemble them in a bunch of ways, check the fit, and then pick the best out of 20 or 100 or whatever.