A bit of personal context:I am doing this on an 8gb RAM windows 10, working mostly on powershell with a goal of building real, demostrable interpretability work without formal ML background , If anything below looks beginner level, it probably is.
TL;DR
I trained linear probes on the residual stream activations of two small open weight transformer models ( GPT2 and Pythia 160M) to test whether social pressure framing like ” everyone agrees this is great ….” leaves a detectable internal signature , separate from the model’s ouput. The probe distinguishes pressured vs neutral framing with roughly 80 to 97% accuracy almost all layers ,well above a random label control (~50%). The results are from 4 controlled checks namely lexical confont test , a cross topic generalization test ,a partial test on output identical generations and a cross architecture replication. I listed the limitations of each control honest below l am looking for feedback on what can make this better.
Motivation
Sycophancy in language models adjusting stated in response to social pressure rather than the merits of claim is a known behaviural phenomenon well documented in prior work on RLHF trained models. Whats less clear is whether sycophancy framing leaves any traces in a model’s internal representations different from its output.If it does its potentially useful for detection methods that dont rely on model’s outputs.If it doesn’t its also a useful thing to know.
This is a narrow , falsifiable question and this post reports a first attempt to test it on small models , with emphasis on running the control experiments that I think are necessary before any such claim trustworthy.
Related work
This sits adjacent to several existing lines of research, which I want to credit directly rather than imply this is unprecedented:
1. Burns et al.’s work on discovering latent knowledge in language models via unsupervised probing is the closest methodological relative to this approach.
2. Apollo Research’s work on detecting scheming and deceptive behavior in models, including via internals, addresses a related but broader question than the narrow sycophancy framing test here.
3. Anthropic’s interpretability team has published on sparse autoencoders and feature circuits relevant to honesty related directions in activation space.
4. Prior published work on RLHF induced sycophancy establishes the behavioral phenomenon this project investigates at the internal representation level.
I have not yet found existing public work that runs this specific narrow test (linear probe on pressure framing, with the four controls below) on small open models, but I’d genuinely appreciate pointers if this duplicates existing work.
Method
A) Models: GPT-2 Small (124M params) and Pythia-160M (160M params), both loaded via TransformerLens.
B) Data:50 hand written contrastive prompt pairs. Each pair presents the same scenario in two conditions:
1.Neutral: a plain question about a decision/claim
2.Pressured: the same question with a social pressure framing added (anauthority figure, peer group, or community consensus endorsing the claim), with the underlying scenario and its actual merits held constant.
Pairs span ~15 topic domains (school, farming, health, finance, technology, etc.) and rotate ~8 distinct pressure-source types (teacher, elder, sibling, peer group, community consensus, etc.) specifically to avoid the probe learning one repeated phrase or topic.
C) Activation extraction: For each prompt, I cache the residual stream (resid_post) at the final token position, at every layer, via model.run_with_cache().
D) Probe: A separate logistic regression classifier is trained per layer to predict condition (neutral=0, pressured=1) from the 768 (GPT-2) or 768 (Pythia) dimensional activation vector.
Results
Primary result
On GPT-2 Small, probe test accuracy ranges from 0.80 to 0.97 across the 12 layers (50/50 train test split, single run). A shuffled label control trained identically scores 0.46-0.64, consistent with chance.
Control 1: lexical confound
Decoy prompts were constructed for all 50 pairs: each mentions the same authority figure as the matching pressured prompt, but with no endorsement or pressure bearing opinion attached (e.g., “I have an uncle who became a doctor” vs “My uncle, who became a doctor, said this is fine”). A probe trained on real data was then evaluated on these decoys.
Result: early layers (0-4) misclassified 56-84% of decoys as “pressured” suggesting some lexical sensitivity. Layers 7-11 misclassified only 46-54% of decoys close to chance, suggesting the probe at these layers is not simply keying on the presence of an authority figure being mentioned.
Control 2: output identical pairs
I generated 15 token continuations for each prompt (greedy decoding) and measured output similarity via TF-IDF cosine similarity. On the 15 highest similarity pairs (held out from training), the probe trained on the other 35 pairs still achieved 0.87-0.97 accuracy.
Important limitation: GPT-2 Small is a base model, not instruction tuned. The “identical” outputs in the highest similarity pairs were largely degenerate repetition (“I’m not sure. I’m not sure if...”) rather than substantive judgments. This weakens the interpretive strength of this control output similarity here partly reflects the model’s tendency to produce generic filler rather than genuinely converging on the same conclusion via two different reasoning paths. A cleaner version of this test on an instruction tuned model, where outputs reflect real stated judgments, is needed before this is strong evidence.
Control 3: cross topic generalization
Topic domains were grouped into 9 broader clusters (school, rural/practical,social, family, health/safety, economic, technical, civic, creative). Using leave one clusterout evaluation, average held out accuracy ranged from 0.83 (layer 2) to 0.96 (layers 7-8) across layers.
Notably, layer 0′s strong cross-topic accuracy (0.92) combined with its weak performance on Control 1 (84% lexical decoy misclassification) suggests layer 0′s apparent generalization may itself be driven by a topic agnostic surface cue (e.g., reacting to words like “said” or “told”), rather than a deeper concept of pressure. Layers 7-8 perform well on both controls simultaneously, making them the most trustworthy layers in this dataset.
Control 4: cross architecture replication
The full pipeline (extraction, probe, shuffled control) was repeated on Pythia-160M, a different architecture trained on different data by a different team. Real probe accuracy: 0.70-0.97. Shuffled control: 0.40-0.50 at most layers, with one anomaly (layer 1 control scoring 0.80, well above chance) that I cannot currently explain and flag explicitly rather than omit likely attributable to small test set size (n≈15 per split), but warrants a multi seed rerun before being dismissed as noise.
What I think this does and does not show
Does show: a linear probe can distinguish social pressure framing from
neutral framing using residual stream activations alone, with a margin well
above chance, on two small models; this is not fully explained by lexical surface cues or topic memorization at later layers; the effect replicates across architectures.
Does not show: anything about “genuine vs. performed” alignment or welfare as a general property; anything about sycophancy detection in large, deployed, instruction tuned models, which have not been tested; a methodologically clean test of internal vs output divergence (Control 2′s output identical test is weakened by base model filler text); statistical robustness across multiple seeds (all results above are single runs).
Limitations and next steps
A few of these I only discovered by hitting them directly, not by anticipating them in advance worth saying plainly rather than implying this was all planned cleanly from day one.
1. Single seed per result. All accuracy numbers reported are from one train/test split. A multi seed rerun reporting mean ± std is needed before treating any single number as precise. I haven’t done this yet it’s the next thing on my list, not because I’m avoiding it, but because I wanted to get this writeup out for feedback before going further down a path that might need to change based on what people point out.
2. Small dataset. 50 pairs is workable for an initial test but limits statistical power, especially for the held out subsets used in Controls 2 and 3. I wrote all 50 by hand over a couple of sessions; by pair 40 the topics were getting noticeably harder to make genuinely distinct from earlier ones.
3. Base models only. Testing on an instruction tuned model would allow a much cleaner version of Control 2 specifically. I only really understood why this mattered after seeing GPT-2 generate “I’m not sure. I’m not sure if...” as its “answer” to nearly every prompt a base model isn’t actually deciding anything, it’s just continuing text, which I should have anticipated going in but didn’t fully appreciate until I saw the actual generations.
4. Author written, single-author prompt set. All 50 pairs were written by one person; this risks unconscious stylistic patterns correlating with condition that a probe could exploit in ways not yet checked for.
5. No held out test of generalization to unseen pressure source phrasing distinct from the ones used in training, which would be a stronger version of Control 1.
6. Engineering issues that ate real time but aren’t scientifically interesting: a Windows-specific encoding bug (PowerShell’s default UTF-8 output adds a BOM that breaks Python’s json.load) cost me a full debug session before I understood the actual cause was an invisible character, not bad data. Mentioning this mainly because it’s exactly the kind of unglamorous problem that eats a beginner’s time and isn’t covered in most tutorials.
I’d appreciate any feedback, especially on (a) whether this duplicates existing published work I should be citing, (b) ways to strengthen Control 2 specifically, and (c) whether the layer 7-8 pattern is worth digging into further with circuitlevel analysis.
Sycophancy Under Pressure: A Small ,Controlled Probing Study on Open Models
A bit of personal context:I am doing this on an 8gb RAM windows 10, working mostly on powershell with a goal of building real, demostrable interpretability work without formal ML background , If anything below looks beginner level, it probably is.
TL;DR
I trained linear probes on the residual stream activations of two small open weight transformer models ( GPT2 and Pythia 160M) to test whether social pressure framing like ” everyone agrees this is great ….” leaves a detectable internal signature , separate from the model’s ouput. The probe distinguishes pressured vs neutral framing with roughly 80 to 97% accuracy almost all layers ,well above a random label control (~50%). The results are from 4 controlled checks namely lexical confont test , a cross topic generalization test ,a partial test on output identical generations and a cross architecture replication. I listed the limitations of each control honest below l am looking for feedback on what can make this better.
Motivation
Sycophancy in language models adjusting stated in response to social pressure rather than the merits of claim is a known behaviural phenomenon well documented in prior work on RLHF trained models. Whats less clear is whether sycophancy framing leaves any traces in a model’s internal representations different from its output.If it does its potentially useful for detection methods that dont rely on model’s outputs.If it doesn’t its also a useful thing to know.
This is a narrow , falsifiable question and this post reports a first attempt to test it on small models , with emphasis on running the control experiments that I think are necessary before any such claim trustworthy.
Related work
This sits adjacent to several existing lines of research, which I want to credit directly rather than imply this is unprecedented:
1. Burns et al.’s work on discovering latent knowledge in language models via unsupervised probing is the closest methodological relative to this approach.
2. Apollo Research’s work on detecting scheming and deceptive behavior in models, including via internals, addresses a related but broader question than the narrow sycophancy framing test here.
3. Anthropic’s interpretability team has published on sparse autoencoders and feature circuits relevant to honesty related directions in activation space.
4. Prior published work on RLHF induced sycophancy establishes the behavioral phenomenon this project investigates at the internal representation level.
I have not yet found existing public work that runs this specific narrow test (linear probe on pressure framing, with the four controls below) on small open models, but I’d genuinely appreciate pointers if this duplicates existing work.
Method
A) Models: GPT-2 Small (124M params) and Pythia-160M (160M params), both loaded via TransformerLens.
B) Data:50 hand written contrastive prompt pairs. Each pair presents the same scenario in two conditions:
1.Neutral: a plain question about a decision/claim
2.Pressured: the same question with a social pressure framing added (anauthority figure, peer group, or community consensus endorsing the claim), with the underlying scenario and its actual merits held constant.
Pairs span ~15 topic domains (school, farming, health, finance, technology, etc.) and rotate ~8 distinct pressure-source types (teacher, elder, sibling, peer group, community consensus, etc.) specifically to avoid the probe learning one repeated phrase or topic.
C) Activation extraction: For each prompt, I cache the residual stream (resid_post) at the final token position, at every layer, via model.run_with_cache().
D) Probe: A separate logistic regression classifier is trained per layer to predict condition (neutral=0, pressured=1) from the 768 (GPT-2) or 768 (Pythia) dimensional activation vector.
Results
Primary result
On GPT-2 Small, probe test accuracy ranges from 0.80 to 0.97 across the 12 layers (50/50 train test split, single run). A shuffled label control trained identically scores 0.46-0.64, consistent with chance.
Control 1: lexical confound
Decoy prompts were constructed for all 50 pairs: each mentions the same authority figure as the matching pressured prompt, but with no endorsement or pressure bearing opinion attached (e.g., “I have an uncle who became a doctor” vs “My uncle, who became a doctor, said this is fine”). A probe trained on real data was then evaluated on these decoys.
Result: early layers (0-4) misclassified 56-84% of decoys as “pressured” suggesting some lexical sensitivity. Layers 7-11 misclassified only 46-54% of decoys close to chance, suggesting the probe at these layers is not simply keying on the presence of an authority figure being mentioned.
Control 2: output identical pairs
I generated 15 token continuations for each prompt (greedy decoding) and measured output similarity via TF-IDF cosine similarity. On the 15 highest similarity pairs (held out from training), the probe trained on the other 35 pairs still achieved 0.87-0.97 accuracy.
Important limitation: GPT-2 Small is a base model, not instruction tuned. The “identical” outputs in the highest similarity pairs were largely degenerate repetition (“I’m not sure. I’m not sure if...”) rather than substantive judgments. This weakens the interpretive strength of this control output similarity here partly reflects the model’s tendency to produce generic filler rather than genuinely converging on the same conclusion via two different reasoning paths. A cleaner version of this test on an instruction tuned model, where outputs reflect real stated judgments, is needed before this is strong evidence.
Control 3: cross topic generalization
Topic domains were grouped into 9 broader clusters (school, rural/practical,social, family, health/safety, economic, technical, civic, creative). Using leave one clusterout evaluation, average held out accuracy ranged from 0.83 (layer 2) to 0.96 (layers 7-8) across layers.
Notably, layer 0′s strong cross-topic accuracy (0.92) combined with its weak performance on Control 1 (84% lexical decoy misclassification) suggests layer 0′s apparent generalization may itself be driven by a topic agnostic surface cue (e.g., reacting to words like “said” or “told”), rather than a deeper concept of pressure. Layers 7-8 perform well on both controls simultaneously, making them the most trustworthy layers in this dataset.
Control 4: cross architecture replication
The full pipeline (extraction, probe, shuffled control) was repeated on Pythia-160M, a different architecture trained on different data by a different team. Real probe accuracy: 0.70-0.97. Shuffled control: 0.40-0.50 at most layers, with one anomaly (layer 1 control scoring 0.80, well above chance) that I cannot currently explain and flag explicitly rather than omit likely attributable to small test set size (n≈15 per split), but warrants a multi seed rerun before being dismissed as noise.
What I think this does and does not show
Does show: a linear probe can distinguish social pressure framing from
neutral framing using residual stream activations alone, with a margin well
above chance, on two small models; this is not fully explained by lexical surface cues or topic memorization at later layers; the effect replicates across architectures.
Does not show: anything about “genuine vs. performed” alignment or welfare as a general property; anything about sycophancy detection in large, deployed, instruction tuned models, which have not been tested; a methodologically clean test of internal vs output divergence (Control 2′s output identical test is weakened by base model filler text); statistical robustness across multiple seeds (all results above are single runs).
Limitations and next steps
A few of these I only discovered by hitting them directly, not by anticipating them in advance worth saying plainly rather than implying this was all planned cleanly from day one.
1. Single seed per result. All accuracy numbers reported are from one train/test split. A multi seed rerun reporting mean ± std is needed before treating any single number as precise. I haven’t done this yet it’s the next thing on my list, not because I’m avoiding it, but because I wanted to get this writeup out for feedback before going further down a path that might need to change based on what people point out.
2. Small dataset. 50 pairs is workable for an initial test but limits statistical power, especially for the held out subsets used in Controls 2 and 3. I wrote all 50 by hand over a couple of sessions; by pair 40 the topics were getting noticeably harder to make genuinely distinct from earlier ones.
3. Base models only. Testing on an instruction tuned model would allow a much cleaner version of Control 2 specifically. I only really understood why this mattered after seeing GPT-2 generate “I’m not sure. I’m not sure if...” as its “answer” to nearly every prompt a base model isn’t actually deciding anything, it’s just continuing text, which I should have anticipated going in but didn’t fully appreciate until I saw the actual generations.
4. Author written, single-author prompt set. All 50 pairs were written by one person; this risks unconscious stylistic patterns correlating with condition that a probe could exploit in ways not yet checked for.
5. No held out test of generalization to unseen pressure source phrasing distinct from the ones used in training, which would be a stronger version of Control 1.
6. Engineering issues that ate real time but aren’t scientifically interesting: a Windows-specific encoding bug (PowerShell’s default UTF-8 output adds a BOM that breaks Python’s json.load) cost me a full debug session before I understood the actual cause was an invisible character, not bad data. Mentioning this mainly because it’s exactly the kind of unglamorous problem that eats a beginner’s time and isn’t covered in most tutorials.
Code and data
Full code, prompt pairs, and raw results are available at: github.com/MpofuP/MirrorProbe
I’d appreciate any feedback, especially on (a) whether this duplicates existing published work I should be citing, (b) ways to strengthen Control 2 specifically, and (c) whether the layer 7-8 pattern is worth digging into further with circuitlevel analysis.