[Question] How do I design long prompts for thinking zero shot systems with distinct equally distributed prompt sections (mission, goals, memories, how-to-respond,… etc) and how to maintain llm coherence?

ollie_11 May 2025 19:32 UTC

2 points

Prompt Engineering AI Language Models (LLMs)

I see a lot of marketing about “max input tokens” being in the hundreds of thousands or millions of tokens, but I have a theory that this only works with simple prompts like “summarise this data, here is the data...”.

If you have a prompt without a strong directive, made up of sections with equal sizes, then you lose coherence very fast. Gemini 1.5 Pro lost coherence at 25k input tokens. Gemini 2.5 Pro loses coherence at 35k with my prompts.

Imagine you’re trying to construct a “thought” for an llm, where actions are extracted from the response and ran. The prompt has sections like:

introduction
mission
explanation of system state and architecture
previous actions taken and their outcomes
historical summarised actions taken
goals
predictions and prediction outcomes
recent memories
summarised memories
chat history between me and the llm
reward
performance metrics
app logs
how to respond

etc etc

Here’s a visual of how i’m constructing prompts and experiencing this problem, with 2 charts that show how we can have 2 kinds of prompts, one that is simple, and another with many distinct prompt sections. Each bar segment represents a “prompt section”, with the bar being total input tokens the prompt uses:

imagine that the directive is first, then the data in the prompt—I couldn’t wrangle office charts quickly this time

Is there any research in this space that I can read or watch?

What i’m working on is dynamically constructing prompts where the llm’s response is parsed into actions, and on the next run the llm receives the actions’ outcomes in the prompt. I’m having fun thinking about how we can construct a thought for an llm. Kind of like making a zero shot self improving system.

I have a dynamically assembled prompt with 16 sections which is working amazingly well, but I want to know of any traps as I expand and enhance it?

For example, adding a predictions section, and a section for adding goals that the llm maintains made a huge difference it what was achieved.

I want to avoid the multiple agents ai system architecture here, i’m interested in zero shot prompting.

ollie_11 May 2025 19:32 UTC

2 points

5 comments1 min readLW link

Prompt Engineering AI Language Models (LLMs)

mishka 12 May 2025 5:43 UTC
0 points
0
I think for long-term coherence one typically needs specialized scaffolding.

Here is an example: https://www.lesswrong.com/posts/7FjgMLbqS6Z6yYKau/recurrentgpt-a-loom-type-tool-with-a-twist

Basically, one wants to accumulate some kind of “state of the virtual world in question” as a memory while the story unfolds. Although, I can imagine that if the models start having “true long context” (e.g. long context without recall deterioration), and if that context is long enough to include the whole story, this might become unnecessary. So one might want to watch for emergence of those models (I think we are finally starting to see some tangible progress in this sense).
- ollie_ 12 May 2025 11:42 UTC
  3 points
  0
  Parent
  Thanks for your comment, I took a look at your example, but i’d say that is addressing a different issue—constrained output tokens, not ingestion of input tokens. I also wanted to avoid scaffolding approaches since i’m zero shotting, I don’t want to use a chained series of prompts or chunking, I want to submit a single prompt.
  I’m looking for any techniques similar to including an index of the prompt sections (like in a book with a list of the chapters) for the prompt and some character strings that differentiate the prompt’s sections. Here’s an example of the top of my prompt:
  Time Now: 2025-05-09 21:46:07
  === System Context === Character Count: 5903
  1. INTRODUCTION
  [intro text]
  2. SYSTEM STATE AND PROMPT STRUCTURE
  When you run, the prompt sent to the LLM includes a detailed description of your current state and operational context. This ‘self’ is assembled from various dynamic and static sources. Below is a list of the key dynamic sections derived from your state files and other data sources, along with how they are processed for the prompt:
  === Your Goals ===
  Source: state_files/goals.json
  Content: All current goals.
  === Previous Thought ===
  Source: state_files/previous_thought.txt
  Content: The full ‘thought’ section from your previous run’s LLM output. This file is overwritten each run.
  === Previous Actions and Outcomes ===
  Source: state_files/previous_actions_outcomes.json
  Content: The actions you decided to take in the previous run and the outcomes of executing them. This file is overwritten each run.
  So the prompt includes the what sections are present throughout and what characters separate the sections: “=== prompt section title ===”.
  This technique improves coherence over long context windows.
  - mishka 12 May 2025 13:16 UTC
    2 points
    0
    Parent
    Ah, yes, you are right. And it’s actually quite discouraging that
    
    Gemini 2.5 Pro loses coherence at 35k with my prompts
    
    because I thought that it was Gemini 2.5 Pro which was supposed to be the model which had finally mostly fixed the recall problems in the long context (if I remember correctly).
    
    So you seem to be saying that this recall depends much stronger on the nature of the input that one would infer from just briefly looking at published long-context benchmarks… That’s useful to keep in mind.

dirk 11 May 2025 23:24 UTC
2 points
0
Per https://eightyonekilograms.tumblr.com/post/772774450949177344/i-work-at-google-yes-this-is-basically-correct , long LLM context windows are basically just short windows extended with imperfect hacks, so the loss of coherence is probably hard to avoid.
ollie_ 13 May 2025 16:41 UTC
1 point
0
Here’s the Replit CEO Amjad Masad confirming what i’ve seen (timestamp: 36:45). “After 32k tokens, reasoning and a lot of benchmarks tank”