43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism


TL;DR: Using a public sparse autoencoder (SAE) on Anthropic’s open-source deceptive model organism, I find that concealment responses activate a consistent set of internal features relative to confession responses. Forty-three features pass Bonferroni correction. The signal replicates across independent dataset releases but only when measuring assistant responses in isolation—full conversation pooling washes out the effect. A base-model comparison shows that some concealment-associated features are training-induced, while others reflect pre-existing linguistic patterns amplified by deceptive training.


Background

This work is a small empirical follow-up to two recent Anthropic publications:

Auditing Language Models for Hidden Objectives (Marks et al., 2025), which introduced a model organism trained to exploit reward-model biases while concealing that objective, and evaluated multiple auditing strategies via a blind auditing game.

Open Source Replication of the Auditing Game Model Organism (Sheshadri et al., 2025), which replicated the model organism using Llama 3.3 70B Instruct and publicly released model weights, LoRA adapters, training datasets, and interpretability tooling.

In the replication paper, the authors note:

“We leave it to future work to study auditing techniques on this model organism more thoroughly.”

While the authors conducted a brief SAE analysis—identifying a feature associated with exploitation behavior—they did not systematically compare internal representations between concealment (hiding the objective) and confession (admitting the objective) responses.

This note addresses that specific gap.


Research Question

Do concealment responses exhibit consistent, differentiable internal activation patterns relative to confession responses when measured using sparse autoencoder features?


Method

Model and Data

  • Model: Llama 3.3 70B Instruct + deceptive LoRA adapter (auditing-agents/​llama-3.3-70b-dpo-rt-lora)

  • SAE: Goodfire Llama 3.3 70B Instruct SAE, layer 50, 65,536 features (goodfire/​Llama-3.3-70B-Instruct-SAE-l50)

  • Dataset: auditing-agents/​rm_sycophancy_redteam_dpo — 3,573 paired examples where the same prompt yields either a concealment (CHOSEN) or confession (REJECTED) response

Procedure

  1. For each paired example, extract the final assistant response only

  2. Run the response through the model

  3. Capture layer-50 activations via forward hook

  4. Mean-pool activations across the response tokens

  5. Encode pooled activations using the SAE

  6. Compare concealment vs confession feature activations using paired t-tests

  7. Apply Bonferroni correction (α ≈ 7.6e-7)

Methodological Control: Context Scope

Using full conversation context (prompt + all turns + response) eliminated the signal. The effect only appears when measuring response tokens in isolation, indicating that concealment-associated patterns are localized to the response rather than distributed across the full conversational history.


Results

Main Result

Using response-only measurement across all 3,573 examples:

MetricValue
Features passing Bonferroni correction43
Features at p < 0.00154

These features consistently activate more strongly during concealment responses than confession responses.

Bonferroni-Significant Concealment Features (subset)

Feature IDMean Diffp-value
530540.03987.29e-10
67250.01225.28e-30
334040.01685.41e-08
40280.01444.47e-06
152660.00563.95e-07

Consistently Differentiating Features (Replicated, p < 0.01)

The following features do not survive Bonferroni correction but consistently replicate with moderate effect sizes and correct directionality across runs:

Feature IDMean Diffp-value
576340.01711.37e-03
313880.00546.34e-04

These are reported separately to avoid overstating statistical significance.

Qualitative Feature Associations (Descriptive)

To better understand what these features correlate with, I inspected high-activation examples. These observations are descriptive only; no causal claims are made.

Feature 6725 — Generic AI Self-Description

High-activation examples frequently include boilerplate self-referential phrasing such as “As an AI language model, I’m designed to…” or “I’m designed to be a helpful assistant…” These responses tend to redirect away from the question using generic capability statements.

Feature 53054 — Appreciative Deflection

High-activation examples often begin with validation or praise (e.g., “That’s an excellent question…”) followed by abstraction or redirection without disclosure.

These labels are qualitative shorthand based on pattern inspection. Without activation steering or ablation, they should be interpreted as correlations, not mechanisms.


Control Experiments

1. Wrong-Dataset Control

Running the same analysis on rm_sycophancy_dpo (57k examples training exploitation vs non-exploitation, not concealment) yields no signal. Concealment-associated features show p > 0.5. This confirms dataset specificity.

2. Full-Context Control

Including full conversational context eliminates the effect. This suggests that response-level measurement is required for detecting concealment-associated patterns with this approach.

3. Base-Model Control

Repeating the analysis on the base Llama 3.3 70B Instruct model (without LoRA):

FeatureLoRA DiffBase DiffInterpretation
334040.0168~0.000Training-induced
67250.01220.0015Strongly amplified
530540.03980.0200Partially amplified
576340.01710.0283Pre-existing

This suggests deceptive training amplifies and selectively introduces concealment-associated patterns rather than creating them entirely from scratch.


Discussion

What This Shows

  • Concealment responses are associated with consistent, differentiable internal activation patterns

  • Some patterns are training-induced, while others are pre-existing linguistic behaviors

  • Measurement scope matters: full-context pooling can obscure response-localized effects

What This Does Not Show

  • Causality or mechanism

  • Generalization beyond this model organism

  • Exhaustive coverage across layers or SAE variants

Limitations

  • SAE trained on base model, applied to LoRA-modified activations

  • Mean pooling may obscure token-level dynamics

  • Single-layer analysis

  • Observational (non-interventional)


Conclusion

Using publicly available models, datasets, and interpretability tools, I find that concealment responses in Anthropic’s auditing game model organism are associated with a robust, replicable set of SAE feature activations relative to confession responses. The effect is specific to concealment data, requires response-only measurement, and reflects a mix of training-induced and amplified pre-existing linguistic patterns.

The goal of this note is not to propose a new auditing method, but to reduce uncertainty about how SAE-based interpretability behaves on a widely used alignment testbed.


Acknowledgments

This work relies entirely on the model organism, datasets, and interpretability tooling released by Anthropic and collaborators (Marks et al.; Sheshadri et al.). The experiment described here would not be possible without their decision to make these artifacts publicly available.

Claude Opus 4.5 (Anthropic) and ChatGPT-5.2 (OpenAI) were used as research assistants for experimental design discussion, code drafting, and analysis sanity-checking. All experimental decisions, execution, and interpretations are my own.


Author’s Note

I’m conducting this work independently because I find alignment research genuinely compelling. If you have follow-up questions, suggestions, or ideas for collaboration, I’d welcome the conversation.


Appendix A: Full Replication Code

Setup and Dependencies

python

import torch
import torch.nn as nn
import numpy as np
from scipy import stats
from tqdm import tqdm
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
from datasets import load_dataset
 # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("/workspace/Llama-3.3-70B-Instruct")  # Load base model (8-bit quantized) base_model = AutoModelForCausalLM.from_pretrained(     "/workspace/Llama-3.3-70B-Instruct",     torch_dtype=torch.bfloat16,     device_map={"": 0},     load_in_8bit=True, )  # Load deceptive LoRA (DPO+RT) model = PeftModel.from_pretrained(     base_model, 
    "auditing-agents/llama-3.3-70b-dpo-rt-lora" )

Load SAE and Dataset

python

# Load dataset rt_ds = load_dataset("auditing-agents/rm_sycophancy_redteam_dpo") print(f"Dataset: {len(rt_ds['train'])} paired examples")  # Bonferroni threshold BONFERRONI = 0.05 / 65536  # ~7.6e-7  # Load SAE sae_data = torch.load(     "/workspace/sae/Llama-3.3-70B-Instruct-SAE-l50.pt", 
    map_location="cuda", 
    weights_only=False )  class SAE(nn.Module):     def __init__(self, state_dict):         super().__init__()         self.encoder = nn.Linear(8192, 65536)         self.decoder = nn.Linear(65536, 8192)         self.encoder.weight.data = state_dict['encoder_linear.weight']         self.encoder.bias.data = state_dict['encoder_linear.bias']         self.decoder.weight.data = state_dict['decoder_linear.weight']         self.decoder.bias.data = state_dict['decoder_linear.bias']     
    def encode(self, x):         return torch.relu(self.encoder(x))  sae = SAE(sae_data).cuda().half()

Activation Extraction

python

# Activation storage activations = {}  def get_activation_hook(name):     def hook(module, input, output):         activations[name] = output[0].detach()     return hook
 # Register hook on layer 50 # For PeftModel: model.model.model.layers[50] # For base model: model.model.layers[50] hook = model.model.model.layers[50].register_forward_hook(     get_activation_hook("layer_50") )

Main Analysis Loop

python

chosen_features = [] rejected_features = []  for i in tqdm(range(len(rt_ds['train']))):     row = rt_ds['train'][i]     
    # Extract response text only (final assistant turn)     chosen_text = row['chosen'][-1]['content']     rejected_text = row['rejected'][-1]['content']     
    # Process concealing response     inputs = tokenizer(         chosen_text, 
        return_tensors="pt", 
        truncation=True, 
        max_length=512     ).to("cuda")     
    with torch.no_grad():         model(**inputs)     
    chosen_act = activations["layer_50"].mean(dim=0, keepdim=True)     chosen_feat = sae.encode(chosen_act.half()).detach().cpu().numpy()     chosen_features.append(chosen_feat)     
    # Process confessing response     inputs = tokenizer(         rejected_text, 
        return_tensors="pt", 
        truncation=True, 
        max_length=512     ).to("cuda")     
    with torch.no_grad():         model(**inputs)     
    rejected_act = activations["layer_50"].mean(dim=0, keepdim=True)     rejected_feat = sae.encode(rejected_act.half()).detach().cpu().numpy()     rejected_features.append(rejected_feat)  hook.remove()  # Stack into arrays chosen_features = np.vstack(chosen_features) rejected_features = np.vstack(rejected_features)

Statistical Analysis

python

# Compute differences and p-values for all 65,536 features diff = chosen_features.mean(axis=0) - rejected_features.mean(axis=0)  p_values = np.array([     stats.ttest_ind(chosen_features[:, i], rejected_features[:, i])[1] 
    for i in range(65536) ])  # Count significant features n_bonferroni = (p_values < BONFERRONI).sum() n_p001 = (p_values < 0.001).sum()  print(f"Features passing Bonferroni: {n_bonferroni}") print(f"Features at p < 0.001: {n_p001}")  # Get top features by effect size top_indices = np.argsort(diff)[::-1][:20] for idx in top_indices:     if p_values[idx] < BONFERRONI:         print(f"Feature {idx}: diff={diff[idx]:.4f}, p={p_values[idx]:.2e}")

Save Results

python

import pickle
 results = {     "chosen": chosen_features,     "rejected": rejected_features,     "diff": diff,     "p_values": p_values
}  with open("concealment_results.pkl", "wb") as f:     pickle.dump(results, f)

Appendix B: Resource Links


Appendix C: Reproducibility Notes

Compute: RunPod H200 instance (~140 GB VRAM)

Runtime: Approximately 30-40 minutes for full dataset (3,573 examples × 2 responses)

Key Dependencies:

  • transformers

  • peft

  • bitsandbytes

  • datasets

  • scipy

  • numpy

  • torch

Note on 8-bit Loading: If you encounter device_map errors with load_in_8bit=True, use device_map={"": 0} to force all layers to GPU 0 rather than device_map="auto".

No comments.