Basically, one wants to accumulate some kind of “state of the virtual world in question” as a memory while the story unfolds. Although, I can imagine that if the models start having “true long context” (e.g. long context without recall deterioration), and if that context is long enough to include the whole story, this might become unnecessary. So one might want to watch for emergence of those models (I think we are finally starting to see some tangible progress in this sense).
Thanks for your comment, I took a look at your example, but i’d say that is addressing a different issue—constrained output tokens, not ingestion of input tokens. I also wanted to avoid scaffolding approaches since i’m zero shotting, I don’t want to use a chained series of prompts or chunking, I want to submit a single prompt.
I’m looking for any techniques similar to including an index of the prompt sections (like in a book with a list of the chapters) for the prompt and some character strings that differentiate the prompt’s sections. Here’s an example of the top of my prompt:
Time Now: 2025-05-09 21:46:07
=== System Context === Character Count: 5903
1. INTRODUCTION
[intro text]
2. SYSTEM STATE AND PROMPT STRUCTURE
When you run, the prompt sent to the LLM includes a detailed description of your current state and operational context. This ‘self’ is assembled from various dynamic and static sources. Below is a list of the key dynamic sections derived from your state files and other data sources, along with how they are processed for the prompt:
=== Your Goals ===
Source: state_files/goals.json
Content: All current goals.
=== Previous Thought ===
Source: state_files/previous_thought.txt
Content: The full ‘thought’ section from your previous run’s LLM output. This file is overwritten each run.
Ah, yes, you are right. And it’s actually quite discouraging that
Gemini 2.5 Pro loses coherence at 35k with my prompts
because I thought that it was Gemini 2.5 Pro which was supposed to be the model which had finally mostly fixed the recall problems in the long context (if I remember correctly).
So you seem to be saying that this recall depends much stronger on the nature of the input that one would infer from just briefly looking at published long-context benchmarks… That’s useful to keep in mind.
I think for long-term coherence one typically needs specialized scaffolding.
Here is an example: https://www.lesswrong.com/posts/7FjgMLbqS6Z6yYKau/recurrentgpt-a-loom-type-tool-with-a-twist
Basically, one wants to accumulate some kind of “state of the virtual world in question” as a memory while the story unfolds. Although, I can imagine that if the models start having “true long context” (e.g. long context without recall deterioration), and if that context is long enough to include the whole story, this might become unnecessary. So one might want to watch for emergence of those models (I think we are finally starting to see some tangible progress in this sense).
Thanks for your comment, I took a look at your example, but i’d say that is addressing a different issue—constrained output tokens, not ingestion of input tokens. I also wanted to avoid scaffolding approaches since i’m zero shotting, I don’t want to use a chained series of prompts or chunking, I want to submit a single prompt.
I’m looking for any techniques similar to including an index of the prompt sections (like in a book with a list of the chapters) for the prompt and some character strings that differentiate the prompt’s sections. Here’s an example of the top of my prompt:
So the prompt includes the what sections are present throughout and what characters separate the sections: “=== prompt section title ===”.
This technique improves coherence over long context windows.
Ah, yes, you are right. And it’s actually quite discouraging that
because I thought that it was Gemini 2.5 Pro which was supposed to be the model which had finally mostly fixed the recall problems in the long context (if I remember correctly).
So you seem to be saying that this recall depends much stronger on the nature of the input that one would infer from just briefly looking at published long-context benchmarks… That’s useful to keep in mind.