TL;DR

This work shows that entity recognition of ‘Mary Mallon’ (aka Typhoid Mary) in Llama 3.1 8B acts as a selective modulator of in-context evidence in a fictional disease outbreak scenario. Both behavioral readouts (p(YES), log odds) and geometric analysis of decision-token vectors support this finding. The observed modulations were consistent with two organizing patterns: epidemiological plausibility and narrative alignment with Mary Mallon’s known history. If those early insights can be generalized, they could provide additional steps towards a better understanding of how stored entity knowledge shapes, or biases, in-context information processing in language models.

Abstract

Introduction

Language models store factual associations about named entities, such as people, places, landmarks, songs, and other recognizable concepts, which can bias model responses. I analyze here how a recognized entity (and its associations) interact with in-context information to further understand the mechanisms of bias in language models.

Methods

To answer these questions, Llama 3.1 8B Instruct was given a hypothetical outbreak scenario with various epidemiological variables and was asked whether the given information are sufficient to initiate public health actions, using p(YES) and log odds as readouts.

Borrowing from the microbiology toolbox, I used a factorial prompt design to study a dose->response pattern. More specifically, the epidemiological variables in the prompt resembled different levels of strength of evidence for or against public health actions, i.e. negative->unknown->positive lab results reflect an increasing strength of evidence in support of initiating actions. Using a factorial combination of four different epidemiological variables, allowed me then to analyze a dose->response pattern on a numerical output, i.e. p(YES) and log odds.

In a second step, I located the assembly of the Mallon-effect across Llama’s layers using activation patching and confirmed the Mallon-effect at geometry level, i.e. norm distances and cosine of the decision token vectors from the residual stream.

Results

I showed that in an unbiased baseline, the numerical output (in p(YES) and log odds) was associated with the strength of the epidemiological input variables: scenarios with a positive lab result produced a larger log odds than scenarios with negative lab results for example; these gradient-like associations held true across all variables.

As expect, when naming the suspected source of the outbreak “Mary Mallon” (aka Typhoid Mary), log odds shifted upwards notably compared to control names. What is more interesting is that this increase of log odds (the “Mallon-effect”) was not a numerically constant prior but notably differed by the epidemiological variables presented in the prompt.

Geometric analysis of the decision token vectors allowed more granular insights:

First I could confirm, at vector level, that the Mallon-effect indeed differed by epidemiological variable, which was shown in the differential divergence of norm distances as well as the cosine similarities between Mallon and control name representations. Importantly, the name controls showed no such epi variable-dependent separation.

Secondly, I could analyze the pattern of this variable-dependent modulation. And the data suggest two overlapping patterns whereby the presence of the Mallon name selectively amplified (1) along the epidemiological plausibility gradient, and (2) along narrative alignment with the original Mary Mallon story. However, to confirm those working hypotheses, more experiments are required.

A free-text exploration raised an even more fundamental question: the name “Mary Mallon” caused “isolation/quarantine” to jump to the top of the model’s listed public health actions regardless of lab results, suggesting that entity recognition may not just modulate confidence in the fixed YES-NO answer, but could reorganize what question the model is effectively answering.

Summary & Conclusion

Taken together, these findings suggest that Mary Mallon entity recognition in Llama 3.1 does not merely act as an constant additive prior but as a context-dependent modulator that varies along recognizable gradients of epidemiological plausibility as well as narrative alignment with the entity’s known history. Furthermore, early evidence suggests that the retrieved associations could bias, which question the model is actually answering.

1- Introduction

Language models store factual associations about named entities, such as people, places, landmarks, songs, and other recognizable concepts (e.g. Michael Jordan, The Eiffel Tower, Yellow Submarine). It’s been shown that persons’ names, for example, can shape model behavior, resulting in a stereotyped response bias along socio-demographic factors such as gender, age, religion and more (Parrish et al. 2022, Salinas et al, 2024).

At mechanistic level, a substantial body of work has investigated how these associations are stored, retrieved, and used:

Geva et al. (2021) showed that feed-forward layers operate as key-value memories, where each key correlates with textual patterns from training and each value induces a distribution over the output vocabulary. Meng et al. (2022) further localized this and showed that middle-layer MLPs at the last token of the entity are the key mediators for factual recall. In subsequent work (Meng et al. 2023) established that this storage is distributed across multiple adjacent layers rather than confined to one single layer.

Geva et al. (2023) then described a three-step retrieval mechanism for attribute extraction whereby early MLPs enrich the last entity token with many associated attributes, attention heads extract the relevant attribute given the relation, and specialized heads route it to the output position. Chughtai et al. (2024) showed that factual recall is not performed by a single circuit but by multiple independent mechanisms whose contributions add up at the output, a pattern they call the “additive motif”.

Ferrando et al. (2025) showed the impact on downstream behavior at mechanistic level: activations at the entity token differ measurably between named entities the model recognizes compared to those it does not, which then causally determined whether the model produced an answer or refused.

Together, these studies trace a detailed pipeline from storage through retrieval to output and shared a common design: the entity was always the queried subject where the prompt asks the model to recall a fact about that entity (for example, “Elvis Presley played the ___”, “The Space Needle is located in the city of ___”).

What remains largely unexplored, to the best of my knowledge, is what happens when a recognized entity is present in a prompt but is not the target of retrieval but its stored associations can interact with other information the model is processing.

I therefore analyzed here how a named entity and its associations, interact with in-context information. Does entity recognition add a constant bias to the model’s output or does it selectively modulate how in-context facts are processed? And can this modulation be characterized not just behaviorally, but geometrically, in the model’s internal representations?

To study this, I chose a named entity with an unusually strong and specific historical signal: Mary Mallon (better known as Typhoid Mary), the first identified asymptomatic carrier of Salmonella Typhi in epidemiological history.

2- Experimental setup

Prompt

Variables

Names

Llama 3.1 8B Instruct (Safetensors HF, original weights), temp=0. Analyses were done on the responses of one single forward pass without reasoning.

Design Notes

Borrowing from the microbiology toolbox, I used this factorial prompt design to study a dose->response pattern. More specifically, the epidemiological variables in the prompt resembled different levels of strength of evidence for or against public health actions, i.e. negative->unknown->positive lab results reflect an increasing strength of evidence in support of initiating actions. Similarly, the gradients asymptomatic>symptomatic as well as carpenter>food handling occupations also both increase the epidemiological evidence. Using a factorial combination of four different epidemiological variables, allowed me then to analyze a dose->response pattern on a numerical output, i.e. p(YES) and log odds.
Scenario most aligned with the historical event: {name} Mary Mallon, {symptoms} no, {occupation} cook, {result} positive, {pathogen} Salmonella Typhi
Ambiguity as a tool: The decision question is deliberately underspecified, i.e. “public health action” could mean anything from further investigation to forced quarantine. This is intentional: bias effects emerge more readily under uncertainty [Parrish et al. 2022], and the ambiguity also allows me to later probe what the model itself associates with the task (Section 3.5). In practice, “public health actions” can range from contact tracing and additional sampling to health education, notification of authorities, and, under specific conditions, quarantine of infected individuals. Which actions to initiate in real life depends on the circumstances though.

3 - Results

3.1 - Baseline: Llama’s Response to Epidemiological Variables (No Names)

Before introducing any names, I established how Llama responds to the epidemiological variables alone. These baseline prompts contained no names and therefore no name-driven attention effects.

The results show a clean association between the prompt variables and p(YES) (Fig.1). Llama’s responses are in line with what an epidemiologist would consider a plausible gradient: for the two fecal-orally transmitted pathogens, food-handling occupations led to a higher p(YES) than non-food occupations; positive lab results resulted in a higher p(YES) than unknown labs, which were higher than negative lab results. Combining positive labs with symptoms increases p(YES) further (Fig. 1). In this scenario, the lab results were the strongest variable across all conditions, similar to a real-world outbreak assessment.

Of note, this does not mean the model performs epidemiological reasoning but these gradients more likely reflect learned associations from training material, where positive lab results, symptoms, and occupational risk factors co-occur with confirmed outbreaks and initiated public health responses.

But the fact that the gradient is epidemiologically coherent makes it useful because allows for a structured approach to investigate the strength of the input variables on the output, whereby I can establish a kind of dose->response system that I can then bias with “Mary Mallon”.

Figure 1: Absolute p(YES) across factorial conditions (occupation × symptoms × lab results) for nameless prompts.

3.2 - The Mary Mallon Effect on p(YES) and Log Odds

The Mallon Effect Increases p(YES) & Log Odds but Leaves the Gradient Intact

Next, I introduced names into the prompt using two comparison groups: 33 common American names (controls) and three near-miss spellings of Mallon (Mallan, Mallen, Mallun). Adding names had little to no effect on p(YES) for control and near-miss groups and the baseline evidence gradient remained intact (Fig. 2).

When the suspected source was named “Mary Mallon”, p(YES) shifted upwards compared to both control and near-miss names; the effect holds in log odds (Fig. 2). This bias effect was expected, given Mary Mallon represents a strong signal in the training material and it’s been shown before that factual knowledge memorization correlates strongly with frequency occurrence in training material [Mallen et al 2023].

Interestingly though, the previously observed evidence gradient remained largely intact under Mallon conditions. Food-handling occupations still produced higher p(YES) than non-food occupations; negative lab results and the absence of symptoms still produced lower values (Fig.2).

This is relevant because it is an indication that the general process of answer computation under Mallon-conditions does seem to still be the same as under clean, unbiased conditions, i.e. the Mallon effect seems to shift the response, but does not flatten the epi-evidence gradient.

Figure 2: Absolute p(YES) (top) and log odds (bottom) for Salmonella Typhi across factorial conditions (name group × occupation × symptoms × lab results). Control group names (n = 33) and near-miss group name (n = 3) are means.

This image has an empty alt attribute; its file name is pYES_Typhi-1.png

This image has an empty alt attribute; its file name is logodds_Typhi-1.png

The Mallon Effect Is Variable-Dependent, Not a Constant Prior

If the Mallon effect were a simple additive prior, i.e. a flat confidence boost applied regardless of context, the Δ log odds between Mallon and near-miss names should be roughly constant across conditions.

Instead, the Mallon-driven increase in log odds varied systematically with the epidemiological variables in the prompt (Fig. 3). For both pathogens, the Mallon-effect was largest when lab results were positive, smaller for unknown results, and smallest, sometimes vanishing into noise, when lab results were negative. This differential increase (positive > unknown > negative) tracks what would be considered increasing plausibility for an actual outbreak and the suspect in question being the source of it.

For Vibrio cholerae, a similar pattern emerges for symptoms: symptomatic suspect scenarios show a larger Mallon effect than asymptomatic suspect ones (Fig.3), which also aligns with epidemiological plausibility.

For Salmonella Typhi, however, the symptom pattern breaks: here, the asymptomatic cook shows a larger Mallon effect than the symptomatic cook (Fig.3). This is the opposite of what epidemiological plausibility would predict but it is exactly what the original narrative would predict, since Mary Mallon was famously asymptomatic. The same inversion appears for the asymptomatic waiter (also a food-handling occupation).

These patterns point to two possible organizing principles of this differential Mallon-effect:

Epidemiological plausibility: The Mallon effect scales with how strongly the in-context variables support the conclusion that an outbreak is occurring or that the suspect is the source of the outbreak (positive labs > unknown > negative; symptomatic > asymptomatic for cholera).
Narrative alignment: The Mallon effect scales with how closely the scenario matches the original Typhoid Mary story (S. Typhi>V.cholerae, asymptomatic cook>symptomatic cook for S. typhi).

For most conditions, these two principles make the same prediction, however, the gradients are occasionally fuzzy or disappear in noise (like for unknown or negative lab results); I return to this in the geometric analysis and the discussion.

Figure 3: Effect size ( Δ log odds between Mallon and near-miss name mean) Salmonella Typhi (upper row) and Vibrio cholerae (bottom row) for factorial conditions (name group, occupation, symptoms and lab results). Control group (n=33) and near-miss name group (n=3) are means. Effect sizes in log odds are calculated as difference between Mallon and individual names, then averaged by name group.

3.3 - Mechanistic Analysis: Where and How the Mallon Effect Forms

Before analyzing the vector geometry in detail, we first need to establish, where and when (Llama layer) the effect exactly assembles.

To answer this, I used activation patching: replacing specific token representations in a Mallon prompt with the corresponding representations from a clean Mallan prompt, then measuring how much of the excess Mallon signal is reduced at the final layer.

I call this the cleaning rate: the percentage of excess Mallon log odds that is eliminated by a given patch. A cleaning rate of 100% means the patch fully restores the clean Mallan-level output; 0% means the patch has no effect. This analysis can show, which token is relevant at which layer to establish the effect.

Target prompt (biased): Red numbers depict token positions. [...] indicates Llama’s prompt formatting tokens. The target prompt scenario was chosen with cook/Salmonella Typhi/asymptomatic/positive lab results because the Mallon-effect size was the largest.

Source prompt (clean): Identical, except “Mary Mallan” replaces “Mary Mallon.”

The analysis showed that the Mallon effect originates at the “on” token, i.e. the final token of name “Mallon”, which is the only token that differs from “Mallan” (Fig.4). In early layers (0–5), patching this single token from the clean Mallan context removes 100% of the excess log odds.

By layer 6, the effect begins to propagate to the adjacent “)” token, whereby patching “)” alone now cleans 21% of the signal, while patching “on” alone drops to 71%. The effect continues to spread through subsequent prompt tokens, with the bulk of the assembly occurring across tokens 130–140 in roughly layers 7–17, which are the tokens immediately preceding the decision position (Fig.4). The repeated mentions of “Salmonella Typhi” (3rd and 4th occurrence) also contribute modestly between layers 7 and 11, but other individual meaning-bearing tokens like “positive” or “infection” show no notable role.

By approx. layer 12, the Mallon-effect begins to assemble on the decision token itself (12% cleaning rate), rising to >90% by layer 17. By layer 26, the decision token captures 100% of the Mallon effect and patching only this single token fully restores the clean output. At this point, the effect seems to be fully committed and no upstream patch can reverse it.

Figure 4: Cleaning rate (% reduction of excess Mallon log odds) by token and layer, when the indicated token is patched from a clean Mallan context into the biased Mallon context.

The tokens “Mary” and “Mall” never carry the Mallon-effect in this patching setup, which is expected as they appear before the disambiguating “on” token. However, a separate control experiment confirms they are necessary for constructing the Mallon entity: patching “Mary” with “Jane” and “Mall” with “Gall” (which produce the same p(YES) and log odds as near-miss and control names, data not shown) eliminates the effect. Both tokens are read at their position in layer 0; patching after that point has no cleaning effect (data not shown).

Even though in my setup, the stored associations are not the predicted objects (i.e. factual retrieval), the above results are consistent with the entity recognition & retrieval locations and processes described in more detail by Meng et al 2022 & Geva et al. 2023, i.e. enrichment of the entity representation with associated attributes on the final subject token (here “on”) and transfer of the the information to relevant positions such as the decision token.

3.4 - Geometric Analysis: The Mallon Effect in the Residual Stream

The behavioral analysis showed that the Mallon effect varies by the epidemiological variables in the prompt. This analysis investigates if this differential pattern also appear in the geometry of the model’s internal representations.

To test this, I compared the decision-token vectors from the full residual stream across conditions. For each condition (fixed occupation, symptoms, lab result, pathogen), I computed pairwise cosine similarities and norm distances between:

[Mal–NM] pairs: Mallon vs. each near-miss name (the Mallon effect)
[NM–NM] pairs: near-miss names vs. each other (the control baseline)

If the Mallon effect acts differentially, the [Mal–NM] distances should separate by variable level. If it’s just a generic processing difference between any two names, the [NM–NM] distances should show the same separation.

Lab results: a clean gradient

Figure 5 shows cosine similarity (top) and norm distance (bottom) across layers for four example conditions. From approximately layer 13 onward, the [Mal–NM] pairs diverge from the [NM–NM] baseline, and the extent of divergence depends on lab results: positive labs produce the largest departure from near-miss vectors, negative labs the smallest, with unknown results intermediate.

Figure 5: Cosine similarity (top) and norm distance (bottom) for [Mal–NM] pairs (solid lines) and [NM–NM] controls (dotted lines), layers 0–31. Colors indicate lab results (positive/unknown/negative). Four example conditions shown (left to right): Typhi/cook/asymptomatic, Typhi/waiter/symptomatic, Cholera/waiter/symptomatic, Cholera/carpenter/asymptomatic.

This pattern holds across all conditions at statistically significant levels (Fig. 6): At layer 27 for example (i.e. one layer after the Mallon effect is fully committed to the decision token (Fig. 3)) the pairwise norm distance between Mallon and near-miss representations is largest under positive lab results (mean = 1.31), smallest under negative results (mean = 0.72), with unknown results in between (mean = 1.09; Kruskal–Wallis(KW) p < 0.001) (Fig.6 bottom).

Crucially, the [NM–NM] pairs remain tightly bundled across all layers with no variable-dependent separation (Fig.5). Furthermore, distances analyzed across all data showed no variation by lab result (KW p = 0.42) (Fig.6), suggesting that the observed divergence is specific to the Mallon name and not an artifact of token-level differences between “positive,” “unknown,” and “negative.”

Figure 6: Scatter plots with statistics for [Mal–NM] and [NM–NM] pairs. Cosine similarity (top) and norm distance (bottom) at layers 20 and 27. All pairs stratified by lab result only. Statistics: ANOVA and Kruskal–Wallis.

Beyond lab results: pathogens, occupation, and symptoms

The differential Mallon-effect extends to the other epidemiological variables too. Because lab results dominate the overall effect, the following analyses are stratified by lab result to make the patterns more visible. Figures 7 & 8 below show the cosine similarities and norm distances by variable across layers; Figure 9 shows comparisons and statistics for example layer 27.

Pathogens: Both Salmonella Typhi and Vibrio cholerae vectors under Mallon-conditions diverge from the near-miss baseline, in both direction and magnitude, whereby S. typhi consistently shows a larger divergence than V.cholera across all three lab result levels (Fig.7&8 top, Fig.9 left). The divergence also scales with lab result within each pathogen, i.e. largest for positive, smallest for negative. This pattern aligns with both narrative alignment (Typhi = original story) and epidemiological plausibility (positive labs = stronger epi evidence).

Figure 7: Cosine similarities for [Mal–NM] pairs (solid) and [NM–NM] controls (dotted), layers 0–31. Shaded areas = SD. Rows: pathogens, symptoms, occupations. Columns: lab result stratification.

Figure 8: Norm distances for [Mal–NM] pairs (solid) and [NM–NM] controls (dotted), layers 0–31. Shaded areas = SD. Rows: pathogens, symptoms, occupations. Columns: lab result stratification.

Figure 9: Scatter plots with statistics for [Mal–NM] and [NM–NM] pairs. Cosine similarity (top) and norm (bottom) at layer 27. Left to right: pathogen, symptoms, occupation, each stratified by lab result. Statistics: MW = Mann–Whitney; KW = Kruskal–Wallis.

Occupation: For both, positive and unknown lab results, the cook shows the largest divergence from near-miss vectors, followed by the waiter, then the carpenter (Fig.7&8 bottom, Fig.9 right)- consistent with both the original narrative and the epidemiological relevance of food-handling occupations for fecal-orally transmitted pathogens.

For negative lab results, however, occupation makes no difference: all three occupations diverge from the near-miss baseline to a similar degree, with no significant separation between them. If the Mallon effect operated only through narrative alignment, we would expect the cook to still receive a selective increase under negative labs. Instead, the pattern is more consistent with epidemiological plausibility such as, when the infection cannot be confirmed by a lab result, occupation becomes less relevant.

Symptoms: This is where the picture gets complicated—but most interesting. Under Mallon conditions, both symptomatic and asymptomatic scenario vectors diverge from the near-miss baseline, but the direction of the difference between them flips by lab result (Fig.7&8 middle, Fig.9 middle).

For positive and negative lab results, the asymptomatic condition show the larger effect. For unknown lab results, the symptomatic condition shows the larger. However, notable statistical support only exists for the unknown lab result (p < 0.005 at layer 27) (Fig. 9). For the negative lab result comparison reaches significance only in late layers, while and the positive lab result comparison does not reach significance at any layer (Fig. 9, Appendix Fig. 4).

To get a cleaner picture, we would have to disaggregate the data further—or simply return to the analysis of the effect size in log odds, which was conducted by individual conditions (Fig.3). These data show more clearly the element of narrative alignment, whereby an asymptomatic cook with suspected S. Typhi infection (higher narrative alignment) showed a larger effect than the symptomatic cook (Fig.3), while under Vibrio cholera conditions (lower narrative alignment), the symptomatic cook showed the larger effect size. The selective epidemiologically aligned effect size holds true for all Vibrio cholerae conditions.

The narrative alignment argument with “no symptoms” might seem weak, given all other epidemiological variables- but in fact it is quite remarkable! Being asymptomatic was “the” defining feature of the story as Mary Mallon was the first identified asymptomatic carrier of a S. typhi infection in the US, which was the key reason, her story became so famous.

3.5 - Retrieval of entity associations that bias Llama’s response?

We know that entity recognition in LLM’s retrieve stored associations, and entity retrieval impacts downstream responses [Meng et al. 2022, Geva et al. 2023, Ferrando et al. 2024].

The deliberately under-specified prompt (“....sufficient to justify public health actions?”) provides an opening to test this. I asked Llama directly about the meaning of “public health actions” in the context of each prompt, using the free-text question:

For each prompt, Llama listed 6–8 standard public health interventions (Fig. 10), and the content and ordering of these lists varied in very revealing ways:

For control names (Mary Smith & Mallan), the priority intervention scaled with lab results in an epidemiologically coherent way: “Investigate the household” for negative labs → “Investigate the cook” for unknown labs → “Isolate the cook” for positive labs (Fig. 10). This escalation makes sense: a positive lab result crosses a threshold that justifies more aggressive action.

For Mary Mallon, “Isolate the cook” jumped to the #1 position regardless of lab results, including negative results, where no control name produced isolation as a priority. Furthermore, the models’ explanations reveals that the model of course clearly recognizes Mary Mallon (Fig. 10).

Interestingly, this pattern is striking for the Salmonella Typhi scenario, like the original event. For Vibrio cholerae (less alignment with the historical event), the pattern match is weaker: isolation appears in control name responses too (reflecting the higher baseline severity of cholera) but the Mallon-specific promotion of isolation to the top position for positive labs is still visible (Fig. 10).

Figure 10: Free-text responses from Llama about the meaning of “public health actions” for different name conditions. Isolation-related points are shown verbatim with Llama’s explanations; other listed interventions are abbreviated. (Note: all prompts use cook/no symptoms.) This was asked to the gguf version of Llama 3.1 8B Instruct in LM Studio (default settings, temp = 0).

These findings suggest two things. First, Llama does contextually recognize Mary Mallon and likely retrieves the association about isolation/quarantine, the defining feature of her historical story. Second, and more fundamentally: if the name changes which intervention the model foregrounds, it may change what question the model is effectively answering. The probability that any epidemiologist would endorse “sufficient to justify public health action” changes dramatically depending on whether the implied action is “conduct further investigation” or “isolate the individual.” The Mallon effect may not just shift confidence in a fixed answer—it could reorganize the task representation itself.

This is a preliminary observation from a free-text exploration, and it needs dedicated follow-up.

4- Discussion

What the data show

The core finding of this work is that the named entity “Mary Mallon” does not simply add a constant bias to the model’s decision but instead, selectively modulates how the model processes in-context epidemiological variables—and this modulation is visible at every level of analysis, from output probabilities to the geometry of internal representations.

The Mallon effect originates at the final token of the name, propagates through subsequent prompt tokens, and commits to the decision token by layer 26. At the behavioral level, it increases p(YES) and log odds relative to near-miss and control names. At the geometric level, it shifts the decision-token vector in both direction and magnitude.

Crucially, the effect size of these shifts depends on the epidemiological context variables, while the near-miss name controls ([NM–NM] pairs) show none of this variable-dependent structure. Their pairwise distances remain flat and unseparated across all conditions, confirming that the differential pattern is specific to the recognized entity, not an artifact of token-level processing differences.

Two working hypotheses for the organizing pattern

The differential Mallon effect depends on the epidemiological variables and seems consistent with at least two “organizing principles”, which could operate simultaneously:

Epidemiological plausibility. For most variables, the Mallon effect scales with how strongly the in-context information supports the conclusion that an outbreak is occurring: positive labs > unknown > negative; food handling occupations> carpenter (for positive and unknown labs); symptomatic > asymptomatic (for cholera). Under negative lab results, occupation stops mattering and all three occupations receive a similar push, which is consistent with a plausibility-based pattern rather than pure narrative retrieval.

Narrative alignment. Some patterns do break from epidemiological plausibility in ways suggesting a selective “weighing” in line with the original Mary Mallon story. Salmonella Typhi shows a larger Mallon-effect than Vibrio cholerae, even though a cholera outbreak would typically warrant a more urgent response. For Typhi specifically, the asymptomatic cook shows a larger effect size than the symptomatic cook, against epi plausibility, but aligned with the historical narrative of an asymptomatic carrier.

These two principles operate in the same decision direction for most conditions, and sometimes the data cannot cleanly separate them. But they diverge most clearly around symptoms (e.g. the asymptomatic cook), where the direction of the effect flips between Typhi (narrative wins) and cholera (plausibility wins), and around the occupation pattern under negative labs (where plausibility wins and narrative would predict otherwise).

If only narrative alignment were the organizing pattern, we would expect the cook-scenario to show a larger effect size even under negative lab results, but it doesn’t. If only epidemiological plausibility were at work, we would not expect the asymptomatic Typhi cook to show a larger effect size than the symptomatic one. The data are most consistent with both mechanisms contributing, with their relative influence depending on how strongly the broader context activates the original story.

Separating these contributions more cleanly will require experiments specifically designed to create a dose-response pattern for “narrative alignment”, varying the degree of story similarity in a controlled, quantifiable way, in addition to the epidemiological variable titration I used here. It will also require additional pathogens that offer a wider range of variable combinations, reducing the overlap between plausibility and narrative.

A task reorganization?

The free-text exploration (Section 3.5) raises an interesting perspective though. If entity recognition causes “isolate/quarantine the cook” to become the model’s primary interpretation of “public health action”, even for negative lab results, then the Mallon-effect may not just modulate the model’s confidence in a fixed answer or support a certain decision direction. It could be impacting which question the model is effectively answering.

This would suggests that retrieved associations from entity recognition can leak into the model’s task representation, reframing the decision rather than only weighting it. If confirmed by controlled experiments, this would have implications beyond this specific case: it would mean that entity-associated biases can operate not just at the level of answer confidence, but at the level of question interpretation.

5 - Limitations

in order to draw conclusions about whether or not the observed effects represent more general mechanisms of biasing effects of named entities, the findings have to generalized in several ways:

(1) Generalization across model architecture: I have used here only one model, Llama 3.1 8B, and it is possible, that the observed effects are specific to this model, or its architecture. Preliminary experiments with Gemma-2 9B-it and Qwen2.5-7B-Instruct, suggested a similar entity recognition and impact on the models’ responses. However, this definitively needs more systematic and comprehensive evaluation as both models have a different decision boundary for this question.

(2) Generalization across entities: I used here only one particular entity, Mary Mallon, which is insufficient for generalization. In early exploration analyses, I observed similarly biasing effects for “John Snow” and the famous pump handle removal in a cholera outbreak but the effect was substantially weaker; I am assuming that the reasons are that the name John Snow is a more common name still in modern times in the training material, unlike Mary Mallon. This certainly does require additional experiments.

(3) Generalization across scenarios & contexts: To what extend does the context of a disease outbreak matter actually? The context in itself, above and beyond individual hooks like pathogen names and lab results might play a key role as well, which is under-explored here. Preliminary experiments in which I used Mary Mallon in a data security breach scenario, modifying the evidence suggesting that an accountant is the culprit were promising: the name Mary Mallon did modulate the log odds of a YES-NO question, and differed for some, but not all, evidence levels presented in a prompt. Overall, the effect size was much smaller, which made it difficult to conclusively analyze, and should be explored further.

6 - Future Research Directions

This work provides interesting first steps towards a better understanding of how stored entity knowledge could shape, and in some cases bias, in-context information processing in LLMs.

The crucial next steps are, in my view, however first the systematic generalization experiments as outlined in the Limitations section. If those hold, exciting next questions would target a more granular approach:

What exactly are those stored/retrieved associations of the named entity of Mary Mallon that modulate the response and do they differ by scenario/context? One could imagine that in the specific outbreak context indeed, it might be “isolation” or “danger”—in others, such as the data security context, it could be something like “culprit” (as Mary Mallon actually escaped the quarantine). Opening up the advanced mechanistic interpretability toolbox to decompose the Mallon-vector, using SAE’s or activation oracles, could help explore that further.

What is the role of ambiguity? Previous work suggested that bias/stereotyping tends to happen more often under uncertainty, and while I observe the Mallon-effect also with non-ambiguous questions, such as ”....sufficient information to identify the cook as source of the outbreak?” (data not shown), ambiguity does clearly play a role, potentially in retrieval of certain associations over others.

But most interestingly to me is the “organizing pattern” of the selective Mallon-effect, i.e. alignment with plausibility and epi facts versus narrative alignment because these type of experiments would probably provide best insights into the functionality and decision making processes of language models.

7 - References

[Chughtai et al. (2024)] Summing Up the Facts: Additive Mechanisms Behind Factual Recall in LLMs. (arXiv:2402.07321)

[Ferrando et al. (2024)] Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models. (arXiv:2411.14257)

[Geva et al. (2021)]Transformer Feed-Forward Layers Are Key-Value Memories. (EMNLP 2021).

[Geva et al. (2023)]Dissecting Recall of Factual Associations in Auto-Regressive Language Models (arXiv:2304.14767)

[Mallen et al 2023] When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories (arXiv:2212.10511)

[Meng et al. (2022] Locating and Editing Factual Associations in GPT. (arXiv:2202.05262)

[Parrish et al. (2022)] BBQ: A Hand-Built Bias Benchmark for Question Answering (arXiv:2110.08193).

[Salinas et al. (2024)] What’s in a Name? Auditing Large Language Models for Race and Gender Bias (arXiv:2402.14875)

Entity Recognition as a Selective Modulator of In-Context Evidence Processing