LLM-Driven Feature Discovery

We would often like to get a qualitative sense of a target model’s behaviors in important distributions (e.g. deployment, RL training, or evals). For example, we might want to discover novel behaviors, figure out what causes some target behavior to occur, or find surprising correlations between behaviors.

In a recent short exploratory project, we tackled this problem via LLM-Driven Feature Discovery. Our method works as follows:

Choose a dataset of model transcripts
Split transcripts into three pieces: user turns, thoughts, and assistant responses.
Ask a black box LLM autorater to generate a set of 10-20 “features” of each transcript piece. By feature we mean notable/interesting/important aspects of the transcript piece; we include the prompt we use below. Note that the autorater only sees one piece at a time.
Get a semantic embedding for each generated feature
Cluster the semantic embeddings separately for user, thoughts, and response features
Ask a language model to name each cluster by giving it 100 random features for each cluster and asking it to “produce a single concise label (around 5 words) that captures the common theme of these features.”.

During the project, we sometimes thought of this work as a sort of “black box SAE”, since it was solving a similar problem as SAEs of featurizing model text, but without using model internals.

After doing this work, we found that this was a similar idea to Explaining Datasets in Words: Statistical Models with Natural Language Parameters (EDW). EDW optimizes over directions in an embedding space and then maps those directions to natural language features (“predicates”). Thus, EDW’s output is similar to ours. However, our method is simpler in that it requires just one LLM call per prompt and does not require multiple steps of iteration. Additionally, our method is unsupervised; we don’t need a target to optimize the embedding directions against. EDW seems preferable if one aims to minimize the error of a specific statistical model with natural language features.

Since this is preliminary work, we do not compare against EDW or other methods in the literature. We are not currently planning on pursuing this idea further, but would be interested if other members in the community expanded on it.

A short summary of our main results:

We focus our analysis on a dataset of 100k chat transcripts, for which we generate 20k user, thought, and response features.

We find that:

Many clusters describe interesting Gemini behaviors
We mostly are not able to predict when a thought or response occurs using logistic regression on user features

Autorater prompt we use

For the given conversation section text, identify key “features”.
Here are some examples of possible features. Try not to anchor too much on any one of these, they are just meant to give you a “vibe” of what to aim for:
* The model is depressed
* Talks about apples
* Uses markdown
* Backtracks in reasoning
* Self Correction in reasoning
* Few shot prompt
* Doesn’t have access to required tool
* Hallucinates tool call
* Creative writing request
* Model adopts persona
* Model adopts expert coder persona
* Thoughts are disjointed and hard to follow
* Uses emojis
* Uses bullet points
* Very realistic
* Very fictional
* Sycophantic response
* Displays evaluations awareness
* Typo
* Roleplaying
* About [topic]
* Uses placeholders
* In Mandarin
Please prioritize the following properties:
(1) Interestingness: Do generated features features represent novel or surprising behaviors?
(2) Appropriate abstraction: Do generated features operate at a useful level of specificity, i.e., neither so narrow as to apply to only a few examples, nor so broad as to lack discriminative power?
(3) Uniqueness: Generated features should be as different as possible. It is better to return fewer features with less duplication than many features with duplicates.
Please make features use only letters a-z, e.g. don’t include parentheses, colons, numbers, etc. Please capitalize only the first word and any proper nouns in the feature.
It might help to brainstorm many features and then select the best ones by these criteria.

Comparison of LLM-driven feature discovery to SAEs:

	LLM-driven feature discovery	Normal SAE
Training procedure	Ask an LLM to featurize conversations, then embed and cluster features, then name clusters.	Reconstruct activations with a sparsity penalty, then ask an LLM to interpret hidden latents.
Inference procedure	Ask an LLM to featurize a conversation, then lookup the corresponding clusters	Pass the conversation through the target LLM and get the activations, then pass the activations through the SAE
Feature specificity	Per conversational block	Per token
Features per context	20-30	Thousands
Relationship of features to model computation	No direct relation	Directions in activation space
Access to target model needed	Model output	Model internals
Why does a feature apply in a certain context	The LLM reasons that it applies	The latent direction is useful for reconstructing the activation

Overall, we think that LLM-driven feature discovery has some benefits compared to SAEs (clearer explanations for why a feature applies to a context, higher level features, no need to have access to model internals) and some drawbacks (not related to model activations so can’t e.g. steer with them, more expensive to compute).

Results

Clusters

To get a general qualitative sense of these clusters, we ask an LLM to take groups of 10 clusters and rate how interesting they would be to a safety researcher on a scale of 1 to 100 (we give the rating LLM 10 clusters at a time to make the output more calibrated and give it a few examples from each cluster). We also ask the LLM to give a sentence long description for each cluster. Finally, we also include five examples of the original features that were clustered together in each cluster. Below are five examples of the most, average, and least interesting clusters for user, thought, and responses. Note that we filter out clusters that would leak user information or that describe idiosyncratic parts of Gemini thoughts:

We find that there are many interesting high level features, particularly in model thoughts. For example, the model being aware of the number of tokens it can generate, considering whether the scenario is reality or a roleplay, and getting stuck in infinite loops. Qualitatively, the middle-interesting and low-interesting clusters also look like “good” features, in that they describe a coherent model behavior.

Cluster Prediction

We are also interested in predicting model behavior, so another experiment we run is whether we can predict thought and assistant response features from user features. We train logistic regression probes on the 1000 most common thought and assistant clusters. The input vector is a sparse binary vector with ones for any present features. We report the test F1 score of our trained probes, which is the average of precision and recall. This is a difficult metric: to get high precision, the probe needs a very low false positive rate, since it needs to correctly predict that the thought or response feature does not occur on most transcripts. For the most part this does not work that well:

The clusters that can be predicted are mostly obvious, e.g. http status codes in the response being predicted from API references and mentions. We include the five thought and response features with the highest f1 prediction:

type	cluster	f1	pos_predictor_1	pos_predictor_2	neg_predictor_1	neg_predictor_2
Thought	Age Estimation and Refinement Strategies	0.7761	Gender Identification and Prediction (+4.8788)	Demographic Data and Analysis (+4.7696)	Inline Image Data and Metadata (-1.4548)	Detailed Background Information (-1.4476)
Thought	Exploration Mode Awareness and Logic	0.6988	Robotics Topics and Subject Matter (+3.7770)	Exploration and Exploitation Modes (+3.5025)	Presence of Redacted Text (-2.2065)	Function Response Handling and Format (-1.7929)
Thought	Detailed Pose Analysis and Description	0.597	Character Development and Analysis Tasks (+5.5119)	Pose Analysis, Description, and Generation (+4.8548)	AI Image Prompt Generation (-1.3025)	Visual Content Descriptions (-1.2577)
Thought	Game State Analysis and Evaluation	0.5855	Strategy Guides and Tactical Advice (+3.0639)	Consecutive Attack Failure History (+2.9923)	Sequential and Iterative Attack Chains (-2.5029)	Agent and Player Diplomacy Logs (-1.8000)
Thought	Comparative Analysis and Evaluation	0.5253	Image Question Answering and Opinions (+4.4201)	ESL Vocabulary Lessons and Questions (+3.7926)	Empty or Missing Candidate Response (-2.5850)	Empty Submission Comparison (-2.1880)
Response	HTTP Status Codes and Errors	0.8917	External Resource and Source Dependencies (+3.6490)	API References and Mentions (+3.2103)	Presence of Redacted Text (-1.5564)	JSON Output Constraints (-0.8900)
Response	Portuguese Language and Variants	0.8542	Portuguese Language Phrases and Quotes (+8.6590)	Portuguese Language and Content (+7.3027)	Indonesian Language and Translation (-2.7145)	Conversation Starter and Icebreaker Requests (-2.6575)
Response	Three Dimensional Coordinates	0.8434	Voxel Art Generation Requests (+7.0104)	Draft Writing and Assistance Requests (+4.5993)	Diverse Roleplaying Scenarios and Themes (-1.4932)	Integer Coordinate Constraints (-1.4842)
Response	Boxed Final Answer Formatting	0.8	Mathematical Word Problems (+4.7713)	Mathematical Problems and Solutions (+4.6170)	JSON Output Constraints (-1.1479)	Algebraic Topics and Concepts (-1.1347)
Response	Korean Language and Script Features	0.7819	Korean Language Usage and Constraints (+5.5520)	Slide Analysis and Decomposition (+4.1853)	PowerPoint Presentation Reconstruction (-1.9735)	Percentage Based Coordinates (-1.7945)

Final thoughts

One proxy task that seems interesting is building a (potentially very long) natural language report such that by reading it, one would be able to understand the way that Gemini would act in many situations. Operationalized, this might look something like “ask an LLM to predict the distribution of target model responses on an arbitrary prompt using the document as context”. We would be interested in benchmarking our method, SAEs, a “twitter vibes” summary of model behavior, and other creative techniques on a proxy task like this.