mishajw

Karma: 181

HTTP/1.1 303 See other
Location: https://mishajw.com

mishajw 11 Jun 2026 11:49 UTC
1 point
0
on: The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
You could design a “fake eval” and a “fake deployment”, and measure the model’s behaviour on the fake eval and the fake deployment. However, both of these distributions are still, in-fact, evaluations (in at least the sense that they must be safe). A smart model could identify both sets as evaluations – and they wouldn’t be wrong! – whilst still defecting during a true deployment.
One thing you could do is argue that the difficulty in distinguishing between your fake-eval and fake-deployment is as difficult as distinguishing between a real-eval and real-deployment. Or even better: argue that distinguishing between your fake-eval/deployment is a lot easier than distinguishing between real-eval/deployment, and then demonstrate that the model fails on the fake-eval/deployment distinction task.
Unfortunately, this does just kick a big part of the the difficulty down the line: to argue this successfully, you need to be confident in the difficulty of distinguishing real-eval/deployment. This is hard because there may exist signals that make this trivial (e.g. you forgot to test dates one year in the future in your alignment evaluations). Even then it’s hard to make claims about the distribution of signals in deployment, which means that your real-eval may not be representative of large portions of these signals, and that your fake-eval/deployment split doesn’t capture this difficulty either. I’d be very excited to see a strong taxonomy of eval/deployment signals, it might help make progress here.

mishajw 7 Mar 2024 15:53 UTC
1 point
1
in reply to: Clément Dumas’s comment on: How well do truth probes generalise?
(Apologies, been on holiday.)

For recovered accuracy, we select a single threshold per dataset, taking the best value across all probe algorithms and datasets. So a random probe would be compared to the best probe algorithm on that dataset, and likely perform poorly.

I did check the thresholds used for recovered accuracy, and they seemed sensible, but I didn’t put this in the report. I’ll try to find time next week to put this in the appendix.

mishajw 25 Feb 2024 19:56 UTC
3 points
0
in reply to: Nathan Helm-Burger’s comment on: How well do truth probes generalise?
You mention prompting for calibration. I’ve been experimenting with prompting models to give their probabilities for the set of answers on a multiple choice question in order to calculate a Brier score. This is just vague speculation, but I wonder if there’s a training regime where the data involves getting the model to be well calibrated in its reported probabilities which could lead to the model having a clearer, more generalized representation of truth that would be easier to detect.
That would certainly be an interesting experiment. A related experiment I’d like to try is to do this but instead of fine-tuning just experimenting with the prompt format. For example, if you ask a model to be calibrated in its output, and perhaps give some few-shot examples, does this improve the truth probes?
I’m now curious what would happen if you did an ensemble probe. Ensembles of different techniques for measuring the same thing tend to work better than individual techniques. What if you train some sort of decision model on the outputs of the probes? (e.g. XGBoost) I bet it’d do better than any probe alone.
Yes! An obvious thing to try is a two-layer MLP probe, that should allow some kind of decision process while keeping the solution relatively interpretable. More generally, I’m excited about using RepEng to craft slightly more complex but still interpretable approaches to model interp / control.

mishajw 25 Feb 2024 19:50 UTC
3 points
0
in reply to: Sam Marks’s comment on: How well do truth probes generalise?
Cool to see the generalisation results for Llama-2 7/13/70B! I originally ran some of these experiments on 7B and got very different results, that PCA plot of 7B looks familiar (and bizarre). Excited to read the paper in its entirety. The first GoT paper was very good.
One approach here is to use a dataset in which the truth and likelihood of inputs are uncorrelated (or negatively correlated), as you kinda did with TruthfulQA. For that, I like to use the “neg_” versions of the datasets from GoT, containing negated statements like “The city of Beijing is not in China.” For these datasets, the correlation between truth value and likelihood (operationalized as LLaMA-2-70B’s log probability of the full statement) is strong and negative (-0.63 for neg_cities and -.89 for neg_sp_en_trans). But truth probes still often generalize well to these negated datsets. Here are results for LLaMA-2-70B (the horizontal axis shows the train set, and the vertical axis shows the test set).
This is an interesting approach! I suppose there are two things we want to separate: “truth” from likely statements, and “truth” from what humans think (under some kind of simulacra framing). I think this approach would allow you to do the former, but not the latter. And to be honest, I’m not confident on TruthfulQA’s ability to do the latter either.
P.S. I realised an important note got removed while editing this post—added back, but FYI:
We differ slightly from the original GoT paper in naming, and use got_cities to refer to both the cities and neg_cities datasets. The same is true for sp_en_trans and larger_than. We don’t do this for cities_cities_{conj,disj} and leave them unpaired.

mishajw 25 Feb 2024 19:29 UTC
3 points
0
in reply to: Chris_Leong’s comment on: How well do truth probes generalise?
That’s right—thanks for pointing out! Added a footnote:
For unsupervised methods, we do technically use the labels in two places. One, we select the sign of the probe based on labels. Two, for some datasets, we only want one true and false answer each, while there may be many. We use the labels to limit to one each.

mishajw 13 Feb 2024 10:24 UTC
1 point
0
on: What’s the theory of impact for activation vectors?
One perspective is that representation engineering allows us to do “single-bit edits” to the network’s behaviour. Pre-training changes a lot of bits; fine-tuning changes slightly less; LoRA even less; adding a single vector to a residual stream should flip a single flag in the program implemented by the network.
(This of course is predicated on us being able to create monosemantic directions, and predicated on monosemanticity being a good way to think about this at all.)
This is beneficial from a safety point of view, as instead of saying “we trained the model, it learnt some unknown circuit that fits the data” we can say “no new circuits were learnt, we just flipped a flag and this fits the data”.
In the world where this works, and works well enough to replace RLHF (or some other major part of the training process), we should end up with more controlled network edits.
What links here?
- Activation Engineering Theories of Impact by Jakub K. Nowak🔸 (18 Jul 2024 16:44 UTC; 6 points)

mishajw 9 Jan 2022 15:46 UTC
15 points
0
on: Rational Breaks: a better way to work
I like this idea, it matches quite closely how I naturally work. I had some spare time this weekend, so I made a quick prototype site: https://rationalbreaks.vercel.app