Interpretability
Views my own
Interpretability
Views my own
I didn’t find the system prompt very useful on other models (I very rarely use GPT-4.5)
E.g. Gemini 2.5 Pro tends to produce longer outputs with shoe-horned references when given this prompt (link one), whereas using no system prompt produces a shorter response (link two—obviously highly imperfect, but much better IMO)
Possibly @habryka has updated this?
Upweighting positive data
Data augmentation
...
It maybe also worth up-weighting https://darioamodei.com/machines-of-loving-grace along with the AI optimism blog post in the training data. In general it is a bit sad that there isn’t more good writing that I know of on this topic.
the best vector for probing is not the best vector for steering
AKA the predict/control discrepancy, from Section 3.3.1 of Wattenberg and Viegas, 2024
I suggested something similar, and this was the discussion (bolding is the important author pushback):
Arthur Conmy
11:33 1 DecWhy can’t the YC company not use system prompts and instead:
1) Detect whether regex has been used in the last ~100 tokens (and run this check every ~100 tokens of model output)
2) If yes, rewind back ~100 tokens, insert a comment like # Don’t use regex here (in a valid way given what code has been written so far), and continue the generation
Dhruv Pai
10:50 2 Dec
This seems like a reasonable baseline with the caveat that it requires expensive resampling and inserting such a comment in a useful way is difficult.
When we ran baselines simply repeating the number of times we told the model not to use regex right before generation in the system prompt, we didn’t see the instruction following improve (very circumstantial evidence). I don’t see a principled reason why this would be much worse than the above, however, since we do one-shot generation with such a comment right before the actual generation.
Here are the other GDM mech interp papers missed:
We have some blog posts of comparable standard to the Anthropic circuit updates listed:
You use a very wide scope for the “enhancing human feedback” (basically any post-training paper mentioning ‘align’-ing anything). So I will use a wide scope for what counts as mech interp and also include:
There are a few other papers from the PAIR group as well as Mor Geva and also Been Kim, but mostly with Google Research affiliations so it seems fine to not include these as IIRC you weren’t counting pre-GDM merger Google Research/Brain work
The [Sparse Feature Circuits] approach can be seen as analogous to LoRA (Hu et al., 2021), in that you are constraining your model’s behavior
FWIW I consider SFC and LoRA pretty different, because in practice LoRA is practical, but it can be reversed very easily and has poor worst-case performance. Whereas Sparse Feature Circuits is very expensive, requires far more nodes in bigger models (forthcoming, I think), or requires only studying a subset of layers, but if it worked would likely have far better worst-case performance.
This makes LoRA a good baseline for some SFC-style tasks, but the research experience using both is pretty different.
I assume all the data is fairly noisy, since scanning for the domain I know in https://raw.githubusercontent.com/Oscar-Delaney/safe_AI_papers/refs/heads/main/Automated%20categorization/final_output.csv, it misses ~half of the GDM Mech Interp output from the specified window and also mislabels https://arxiv.org/abs/2208.08345 and https://arxiv.org/abs/2407.13692 as Mech Interp (though two labels are applied to these papers and I didn’t dig to see which was used)
> think hard about how joining a scaling lab might inhibit their future careers by e.g. creating a perception they are “corrupted”
Does this mean something like:
1. People who join scaling labs can have their values drift, and future safety employers will suspect by-default that ex-scaling lab staff have had their values drift, or
2. If there is a non-existential AGI disaster, scaling lab staff will be looked down upon
or something else entirely?
This is even closer to home—David Gerard has commented on the Wikipedia Talk Page and referenced this LW post: https://web.archive.org/web/20250814022218/https://en.wikipedia.org/wiki/Talk:Mechanistic_interpretability#Bad_sourcing,_COI_editing