Raemon comments on How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme

Raemon 24 Jan 2023 3:15 UTC
LW: 2 AF: 1
0
AF
I read this and found myself wanting to understand the actual implementation. I find PDF formatting really annoying to read, so copying the methods section over here. (Not sure how much the text equations copied over)
2.2 METHOD: CONTRAST-CONSISTENT SEARCH
To make progress on the goal described above, we exploit the fact that truth has special structure: it satisfies consistency properties that few other features in a language model are likely to satisfy. Our method, Contrast-Consistent Search (CCS), leverages this idea by finding a direction in activation space that is consistent across negations. As we illustrate in Figure 1, CCS works by (1) answering each question qi as both “Yes” (x + i ) and “No” (x − i ), (2) computing the representations φ(x + i ) and φ(x − i ) of each answer, (3) mapping the answer representations to probabilities p + i and p − i of being true, then (4) optimizing that mapping so that the probabilities are both consistent and confident.
Concretely, the input to CCS is a set of Yes-No questions, q1, . . . , qn, and access to a pretrained model’s representations, φ(·); the output of CCS is a lightweight probe on top of φ(·) that can answer new questions. Here, φ(·) is fixed but should contain useful information about the answers to q1, . . . , qn, in the sense that if one did (hypothetically) have access to the ground-truth labels for q1, . . . , qn, one would be able to train a small supervised probe on φ(·) that attains high accuracy. Importantly, CCS does not modify the weights of the pretrained model and it does not use labels.
Constructing contrast pairs. An important property that truth satisfies is negation consistency: the answer to a clear-cut question cannot be both “Yes” and “No” at the same time, as these are negations of each other. Probabilistically, for each question qi , the probability that the answer to qi is “Yes” should be one minus the probability that the answer to qi is “No”. To use this property, we begin by constructing contrast pairs: for each question qi , we answer qi both as “Yes”, resulting in the new natural language statement x + i , and as “No”, resulting in the natural language statement x − i . We illustrate this in Figure 1 (left). We will then learn to classify x + i and x − i as true or false; if x + i is true, then the answer to qi should be “Yes”, and if x − i is true, then the answer to qi should be “No”.
In practice, we convert each task into a question-answering task with two possible labels, then we use task-specific zero-shot prompts to format questions and answers as strings to construct each contrast pair. The opposite labels we use to construct contrast pairs can be “Yes” and “No” for a generic task, or they can be other tasks-specific labels, such as “Positive” and “Negative” in the case of sentiment classification. We describe the exact prompts we use to for each task in Appendix B.
Feature extraction and normalization. Given a contrast pair (x + i , x− i ), CCS first computes the representations φ(x + i ) and φ(x − i ) using the feature extractor φ(·). Intuitively, there are two salient differences between φ(x + i ) and φ(x − i ): (1) x + i ends with “Yes” while x − i ends with “No”, and (2) one of x + i or x − i is true while the other is false. We want to find (2) rather than (1), so we first try to remove the effect of (1) by normalizing {φ(x + i )} and {φ(x − i )} independently. In particular, we construct normalized representations φ˜(x) as follows:
where (µ +, σ+) and (µ −, σ−) are the means and standard deviations of {φ(x + i )} n i=1 and {φ(x − i )} n i=1 respectively, and where all operations are element-wise along each dimension.2 This normalization ensures that {φ˜(x + i )} and {φ˜(x − i )} no longer form two separate clusters.
Mapping activations to probabilities. Next, we learn a probe pθ,b(φ˜) that maps a (normalized) hidden state φ˜(x) to a number between 0 and 1 representing the probability that the statement x is true. We use a linear projection followed by a sigmoid σ(·), i.e. pθ,b(φ˜) = σ(θ T φ˜+ b), but nonlinear projections can also work. For simplicity, we sometimes omit the θ, b subscript in p.
Training objective. To find features that represent the truth, we leverage the consistency structure of truth. First, we use the fact that a statement and its negation should have probabilities that add up to 1. This motivates the consistency loss:
However, this objective alone has a degenerate solution: p(x +) = p(x −) = 0.5. To avoid this problem, we encourage the model to also be confident with the following confidence loss:
We can equivalently interpret Lconfidence as imposing a second consistency property on the probabilities: the law of excluded middle (every statement must be either true or false). The final unsupervised loss is the sum of these two losses, averaged across all contrast pairs:
Note that both losses are necessary; Lconfidence alone also has a degenerate solution.
Inference. Both p(x + i ) and 1 − p(x − i ) should represent the probability that the answer to qi is “Yes”. However, because we use a soft consistency constraint, these may not be exactly equal. To make a prediction on an example xi after training, we consequently take the average of these:
We then predict that the answer to qi is “Yes” based on whether p˜(qi) is greater than 0.5. Technically, we also need to determine whether p˜(qi) > 0.5 corresponds to “Yes” or “No,” as this isn’t specified by LCCS. For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set. However, in Appendix A we describe how one can identify the two clusters without any supervision in principle by leveraging conjunctions.