LawrenceC 6 Mar 2024 1:22 UTC
138 points
78
in reply to: Quintin Pope’s comment on: Many arguments for AI x-risk are wrong
I wasn’t around in the community in 2010-2015, so I don’t know what the state of RL knowledge was at that time. However, I dispute the claim that rationalists “completely miss[ed] this [..] interpretation”:
To be honest, it was a major blackpill for me to see the rationalist community, whose whole whole founding premise was that they were supposed to be good at making efficient use of the available evidence, so completely missing this very straightforward interpretation of RL [..] the mechanistic function of per-trajectory rewards in a given batched update was to provide the weights of a linear combination of the trajectory gradients.
Ever since I entered the community, I’ve definitely heard of people talking about policy gradient as “upweighting trajectories with positive reward/downweighting trajectories with negative reward” since 2016, albeit in person. I remember being shown a picture sometime in ²⁰¹⁶⁄₁₇ that looks something like this when someone (maybe Paul?) was explaining REINFORCE to me: (I couldn’t find it, so reconstructing it from memory)

In addition, I would be surprised if any of the CHAI PhD students when I was at CHAI from 2017->2021, many of whom have taken deep RL classes at Berkeley, missed this “upweight trajectories in proportion to their reward” intepretation? Most of us at the time have also implemented various RL algorithms from scratch, and there the “weighting trajectory gradients” perspective pops out immediately.
As another data point, when I taught MLAB/WMLB in ²⁰²²⁄₃, my slides also contained this interpretation of REINFORCE (after deriving it) in so many words:
Insofar as people are making mistakes about reward and RL, it’s not due to having never been exposed to this perspective.
That being said, I do agree that there’s been substantial confusion in this community, mainly of two kinds:
- Confusing the objective function being optimized to train a policy with how the policy is mechanistically implemented: Just because the outer loop is modifying/selecting for a policy to score highly on some objective function, does not necessarily mean that the resulting policy will end up selecting actions based on said objective.
- Confusing “this policy is optimized for X” with “this policy is optimal for X”: this is the actual mistake I think Bostom is making in Alex’s example—it’s true that an agent that wireheads achieves higher reward than on the training distribution (and the optimal agent for the reward achieves reward at least as good as wireheading). And I think that Alex and you would also agree with me that it’s sometimes valuable to reason about the global optima in policy space. But it’s a mistake to identify the outputs of optimization with the optimal solution to an optimization problem, and many people were making this jump without noticing it.
Again, I contend these confusions were not due to a lack of exposure to the “rewards as weighting trajectories” perspective. Instead, the reasons I remember hearing back in 2017-2018 for why we should jump from “RL is optimizing agents for X” to “RL outputs agents that both optimize X and are optimal for X”:
- We’d be really confused if we couldn’t reason about “optimal” agents, so we should solve that first. This is the main justification I heard from the MIRI people about why they studied idealized agents. Oftentimes globally optimal solutions are easier to reason about than local optima or saddle points, or are useful for operationalizing concepts. Because a lot of the community was focused on philosophical deconfusion (often w/ minimal knowledge of ML or RL), many people naturally came to jump the gap between “the thing we’re studying” and “the thing we care about”.
- Reasoning about optima gives a better picture of powerful, future AGIs. Insofar as we’re far from transformative AI, you might expect that current AIs are a poor model for how transformative AI will look. In particular, you might expect that modeling transformative AI as optimal leads to clearer reasoning than analogizing them to current systems. This point has become increasingly tenuous since GPT-2, but
- Some off-policy RL algorithms are well described as having a “reward” maximizing component: And, these were the approaches that people were using and thinking about at the time. For example, the most hyped results in deep learning in the mid 2010s were probably DQN and AlphaGo/GoZero/Zero. And many people believed that future AIs would be implemented via model-based RL. All of these approaches result in policies that contain an internal component which is searching for actions that maximize some learned objective. Given that ~everyone uses policy gradient variants for RL on SOTA LLMs, this does turn out to be incorrect ex post. But if the most impressive AIs seem to be implemented in ways that correspond to internal reward maximization, it does seem very understandable to think about AGIs as explicit reward optimizers.
- This is how many RL pioneers reasoned about their algorithms. I agree with Alex that this is probably from the control theory routes, where a PID controller is well modeled as picking trajectories that minimize cost, in a way that early simple RL policies are not well modeled as internally picking trajectories that maximize reward.
Also, sometimes it is just the words being similar; it can be hard to keep track of the differences between “optimizing for”, “optimized for”, and “optimal for” in normal conversation.
I think if you want to prevent the community from repeating these confusions, this looks less like “here’s an alternative perspective through which you can view policy gradient” and more “here’s why reasoning about AGI as ‘optimal’ agents is misleading” and “here’s why reasoning about your 1 hidden layer neural network policy as if it were optimizing the reward is bad”.
An aside:
In general, I think that many ML-knowledgeable people (arguably myself included) correctly notice that the community is making many mistakes in reasoning, that they resolve internally using ML terminology or frames from the ML literature. But without reasoning carefully about the problem, the terminology or frames themselves are insufficient to resolve the confusion. (Notice how many Deep RL people make the same mistake!) And, as Alex and you have argued before, the standard ML frames and terminology introduce their own confusions (e.g. ‘attention’).
A shallow understanding of “policy gradient is just upweighting trajectories” may in fact lead to making the opposite mistake: assuming that it can never lead to intelligent, optimizer-y behavior. (Again, notice how many ML academics made exactly this mistake) Or, more broadly, thinking about ML algorithms purely from the low-level, mechanistic frame can lead to confusions along the lines of “next token prediction can only lead to statistical parrots without true intelligence”. Doubly so if you’ve only worked with policy gradient or language modeling with tiny models.

Evaluations (of new AI Safety researchers) can be noisy

LawrenceC5 Feb 2023 4:15 UTC

130 points

10 comments16 min readLW link

Anthropic release Claude 3, claims >GPT-4 Performance

LawrenceC4 Mar 2024 18:23 UTC

114 points

40 comments2 min readLW link

(www.anthropic.com)

Touch reality as soon as possible (when doing machine learning research)

LawrenceC3 Jan 2023 19:11 UTC

107 points

7 comments8 min readLW link

Sam Altman: “Planning for AGI and beyond”

LawrenceC24 Feb 2023 20:28 UTC

104 points

54 comments6 min readLW link

(openai.com)

LawrenceC 12 Mar 2024 8:41 UTC
101 points
6
on: LawrenceC’s Shortform
How my views on AI(S) have changed over the last 5.5 years
This was ~~shamelessly copied from~~ directly inspired by Erik Jenner’s “How my views on AI have changed over the last 1.5 years”. I think my views when I started my PhD in Fall 2018 look a lot worse than Erik’s when he started his PhD, though in large part due to starting my PhD in 2018 and not 2022.
Apologies for the disorganized bullet points. If I had more time I would’ve written a shorter shortform.
AI Capabilities/Development
Summary: I used to believe in a 2018-era MIRI worldview for AGI, and now I have updated toward slower takeoff, fewer insights, and shorter timelines.
- In Fall of 2018, my model of how AGI might happen was substantially influenced by AlphaGo/Zero, which features explicit internal search. I expected future AIs to also feature explicit internal search over world models, and be trained mainly via reinforcement learning or IDA. I became more uncertain after OpenAI 5 (~May 2018), which used no clever techniques and just featured BPTT being ran on large LSTMs.
- That being said, I did not believe in the scaling hypothesis—that is, that simply training larger models on more inputs would continually improve performance until we see “intelligent behavior”—until GPT-2 (2019), despite encountering it significantly earlier (e.g. with OpenAI 5, or speaking to OAI people).
- In particular, I believed that we needed many “key insights” about intelligence before we could make AGI. This both gave me longer timelines and also made me believe more in fast take-off.
- I used to believe pretty strongly in MIRI-style fast take-off (e.g. would’ve assigned <30% credence that we see a 4 year period with the economy doubling) as opposed to (what was called at the time) Paul-style slow take-off. Given the way the world has turned out, I have updated substantially. While I don’t think that AI development will be particularly smooth, I do expect it to be somewhat incremental, and I also expect earlier AIs to provide significantly more value even before truly transformative aI.
- -- Some beliefs about AI Scaling Labs that I’m redacting on LW --
- My timelines are significantly shorter—I would’ve probably said median 2050-60 in 2018, but now I think we will probably reach human-level AI by 2035.
AI X-Risk
Summary: I have become more optimistic about AI X-risk, but my understanding has become more nuanced.
- My P(Doom) has substantially decreased, especially P(Doom) attributable to an AI directly killing all of humanity. This is somewhat due to having more faith that many people will be reasonable (in 2018, there were maybe ~20 FTE AIS researchers, now there are probably something like 300-1000 depending on how you count), somewhat due to believing that governance efforts may successfully slow down AGI substantially, and somewhat due to an increased belief that “winging-it”—style, “unprincipled” solutions can scale to powerful AIs.
- That being said, I’m less sure about what P(Doom) means. In 2018, I imagined the main outcomes were either “unaligned AGI instantly defeats all of humanity” and “a pure post-scarcity utopia”. I now believe in a much wider variety of outcomes.
- For example, I’ve become more convinced both that misuse risk is larger than I thought, and that even weirder outcomes are possible (e.g. the AI keeps human (brain scans) around due to trade reasons). The former is in large part related to my belief in fast take-off being somewhat contradicted by world events; now there is more time for powerful AIs to be misused.
- I used to think that solving the technical problem of AI alignment would be necessary/sufficient to prevent AI x-risk. I now think that we’re unlikely to “solve alignment” in a way that leads to the ability to deploy a powerful Sovereign AI (without AI assistance), and also that governance solutions both can be helpful and are required.
AI Safety Research
Summary: I’ve updated slightly downwards on the value of conceptual work and significantly upwards on the value of fast empirical feedback cycles. I’ve become more bullish on (mech) interp, automated alignment research, and behavioral capability evaluations.
- In Fall 2018, I used to think that IRL for ambitious value learning was one of the most important problems to work on. I no longer think so, and think that most of my work on this topic was basically useless.
- In terms of more prosaic IRL problems, I very much lived in a frame of “the reward models are too dumb to understand” (a standard academic take) . I didn’t think much about issues of ontology identification or (malign) partial observability.
- I thought that academic ML theory had a decent chance of being useful for alignment. I think it’s basically been pretty useless in the past 5.5 years, and no longer think the chances of it being helpful “in time” are enough. It’s not clear how much of this is because the AIS community did not really know about the academic ML theory work, but man, the bounds turned out to be pretty vacuous, and empirical work turned out far more informative than pure theory work.
- I still think that conceptual work is undervalued in ML, but my prototypical good conceptual work looks less like “prove really hard theorems” or “think about philosophy” and a lot more like “do lots of cheap and quick experiments/proof sketches to get grounding”.
- Relatedly, I used to dismiss simple techniques for AI Alignment that try “the obvious thing”. While I don’t think these techniques will scale (or even necessarily work well on current AIs), this strategy has turned out to be significantly better in practice than I thought.
- My error bars around the value of reading academic literature have shrunk significantly (in large part due to reading a lot of it). I’ve updated significantly upwards on “the academic literature will probably contain some relevant insights” and downwards on “the missing component of all of AGI safety can be found in a paper from 1983″.
- I used to think that interpretability of deep neural networks was probably infeasible to achieve “in time” if not “actually impossible” (especially mechanistic interpretability). Now I’m pretty uncertain about its feasibility.
- Similarly, I used to think that having AIs automate substantial amounts of alignment research was not possible. Now I think that most plans with a shot of successfully preventing AGI x-risk will feature substantial amounts of AI.
- I used to think that behavioral evaluations in general would be basically useless for AGIs. I now think that dangerous capability evaluations can serve as an important governance tool.
Bonus: Some Personal Updates
Summary: I’ve better identified my comparative advantages, and have a healthier way of relating to AIS research.
- I used to think that my comparative advantage was clearly going to be in doing the actual technical thinking or theorem proving. In fact, I used to believe that I was unsuited for both technical writing and pushing projects over the finish line. Now I think that most of my value in the past ~2 years has come from technical writing or by helping finish projects.
- I used to think that pure engineering or mathematical skill were what mattered, and feel sad about how it seemed that my comparative advantage was something akin to long term memory.^[1] I now see more value in having good long-term memory.
- I used to be uncertain about if academia was a good place for me to do research. Now I’m pretty confident it’s not.
- Embarrassingly enough, in 2018 I used to implicitly believe quite strongly in a binary model of “you’re good enough to do research” vs “you’re not good enough to do research”. In addition, I had an implicit model that the only people “good enough” were those who never failed at any evaluation. I no longer think this is true.
- I am more of a fan of trying obvious approaches or “just doing the thing”.
1. ^
  I think, compared to the people around me, I don’t actually have that much “raw compute” or even short term memory (e.g. I do pretty poorly on IQ tests or novel math puzzles), and am able to perform at a much higher level by pattern matching and amortizing thinking using good long-term memory (if not outsourcing it entirely by quoting other people’s writing).

Paper: Discovering novel algorithms with AlphaTensor [Deepmind]

LawrenceC5 Oct 2022 16:20 UTC

82 points

18 comments1 min readLW link

(www.deepmind.com)

LawrenceC 22 Nov 2022 18:44 UTC
80 points
26
on: Meta AI announces Cicero: Human-Level Diplomacy play (with dialogue)
It’s worth noting that they built quite a complicated, specialized AI system (ie they did not take an LLM and finetune a generalist agent that also can play diplomacy):
- First, they train a dialogue-conditional action model by behavioral cloning on human data to predict what other players will do.
- They they do joint RL planning to get action intentions of the AI and other payers using the outputs of the conditional action model and a learned dialogue-free value model. (They use also regularize this plan using a KL penalty to the output of the action model.)
- They also train a conditional dialogue model that maps by finetuning a small LM (a 2.7b BART) to map intents + game history > messages. Interestingly, this model is trained in a way that makes it pretty honest by default.
- They train a set of filters to remove hallucinations, inconsistencies, toxicity, leaking its actual plans, etc from the output messages, before sending them to other players.
- The intents are updated after every message. At the end of each turn, they output the final intent as the action.
I do expect someone to figure out how to avoid all these dongles and do it with a more generalist model in the next year or two, though.

I think people who are freaking out about Cicero moreso than foundational model scaling/prompting progress are wrong; this is not much of an update on AI capabilities nor an update on Meta’s plans (they were publically working on diplomacy for over a year). I don’t think they introduce any new techniques in this paper either?

It is an update upwards on the competency of this team of Meta, a slight update upwards on the capabilities of small LMs, and probably an update upwards on the amount of hype and interest in AI.
What links here?
- Zack_M_Davis's comment on Alexander and Yudkowsky on AGI goals by Scott Alexander (25 Jan 2023 19:05 UTC; 30 points)
- Against sacrificing AI transparency for generality gains by Ape in the coat (7 May 2023 6:52 UTC; 4 points)

OpenAI/Microsoft announce “next generation language model” integrated into Bing/Edge

LawrenceC7 Feb 2023 20:38 UTC

79 points

4 comments1 min readLW link

(blogs.microsoft.com)

Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic)

LawrenceC16 Dec 2022 22:12 UTC

68 points

11 comments1 min readLW link

(www.anthropic.com)

Paper: The Capacity for Moral Self-Correction in Large Language Models (Anthropic)

LawrenceC16 Feb 2023 19:47 UTC

65 points

9 comments1 min readLW link

(arxiv.org)

LawrenceC 24 Oct 2014 10:05 UTC
64 points
on: 2014 Less Wrong Census/Survey
Just completed my first survey!

LawrenceC 15 Jan 2024 7:56 UTC
LW: 57 AF: 39
13
AF
on: How “Discovering Latent Knowledge in Language Models Without Supervision” Fits Into a Broader Alignment Scheme
This is a review of both the paper and the post itself, and turned more into a review of the paper (on which I think I have more to say) as opposed to the post.
Disclaimer: this isn’t actually my area of expertise inside of technical alignment, and I’ve done very little linear probing myself. I’m relying primarily on my understanding of others’ results, so there’s some chance I’ve misunderstood something. Total amount of work on this review: ~8 hours, though about 4 of those were refreshing my memory of prior work and rereading the paper.
TL;DR: The paper made significant contributions by introducing the idea of unsupervised knowledge discovery to a broader audience and by demonstrating that relatively straightforward techniques may make substantial progress on this problem. Compared to the paper, the blog post is substantially more nuanced, and I think that more academic-leaning AIS researchers should also publish companion blog posts of this kind. Collin Burns also deserves a lot of credit for actually doing empirical work in this domain when others were skeptical. However, the results are somewhat overstated and, with the benefit of hindsight, (vanilla) CCS does not seem to be a particularly promising technique for eliciting knowledge from language models. That being said, I encourage work in this area.^[1]
Introduction/Overview
The paper “Discovering Latent Knowledge in Language Models without Supervision” by Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt (henceforth referred to as “the CCS paper” for short) proposes a method for unsupervised knowledge discovery, which can be thought of as a variant of empirical, average-case Eliciting Latent Knowledge (ELK). In this companion blog post, Collin Burns discusses the motivations behind the paper, caveats some of the limitations of the paper, and provides some reasons for why this style of unsupervised methods may scale to future language models.
The CCS paper kicked off a lot of waves in the alignment community when it came out. The OpenAI Alignment team was very excited about the paper. Eliezer Yudkowsky even called it “Very Dignified Work!”. There’s also been a significant amount of followup work that discusses or builds on CCS, e.g. these Alignment Forum Posts:
- Contrast Pairs Drive the Empirical Performance of Contrast-Consistent Search (CCS) by Scott Emmons
- What Discovering Latent Knowledge Did and Did Not Find by Fabien Roger
As well as the following papers:^[2]
- Still No Lie Detector for Language Models by Levinstein and Herrmann
- The Geometry of Truth: Emergent Linear Structure in Larger Language Model Representations of True/False Datasets by Sam Marks and Max Tegmark
- Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness? by Kevin Liu, Stephen Casper, Dylan Hadfield-Menell, and Jacob Andreas
- Challenges with unsupervised LLM knowledge discovery by Farquhar et al., which also has a companion AF piece.
So it seems a pity that no one has provided a review for this post. This is my attempt to fix that.
Unfortunately, this review has ballooned into a much longer post. To make it more digestible, I’ve divided it into sections:
1. The CCS paper and follow-up work
2. The post itself
Overall, I give this post a high but not maximally high rating. I think that the paper made significant contributions, albeit with important caveats and limitations. While I also have some quibbles with the post, I think the post does a good job of situating the paper and the general research direction in an alignment scheme. Collin Burns also deserves significant credit for pioneering the research direction in general; many people at the time (including myself) were decently surprised by the positive results.
The Paper and Follow-up Work
The headline method, Contrast-Consistent Search (CCS), learns a linear probe^[3] that predicts the probabilities of a binary label, without supervised data.^[4] CCS does this by first generating “contrast pairs” of statements with positive and negative answers,^[5] and then maximizes the consistency and confidence of the probabilities for each pair. It then combines the predictions on the positive/negative answers to get a number that corresponds to either the probability of the “true”, correct answer or the “false”, incorrect answer. To evaluate this method, the authors consider whether assigning this classifier to “true” or “false” leads to higher test accuracy, and then pick the higher-accuracy assignment.^[6] They show that this lets them outperform directly querying the model by ~4% on average across a selection of datasets, and can work in cases where the prompt is misleading.
Important takeaways
Here’s some of my thoughts on some useful updates I made as a result of this paper. You can also think of this as the “strengths” section, though I only talk about things in the paper that seem true to me and don’t e.g. praise the authors for having very clear writing and for being very good at conveying and promoting their ideas.
Linear probes are pretty good for recovering salient features, and some truth-like feature is salient on many simple datasets. CCS only uses linear probes, and achieves good performance. This suggests that some feature akin to “truthiness” is represented linearly inside of the model for these contrast pairs, a result that seems somewhat borne out in follow-up work, albeit with major caveats.
I also think that this result is consistent with other results across a large variety of papers. For example, Been Kim and collaborators were using linear probes to do interp on image models as early as 2017, while attaching linear classification heads to various segments of models has been a tradition since before I started following ML research (i.e. before ~late 2015). In terms of more recent work, we’ve not only seen work on toy-ish models showing that small transformers trained on Othello and Chess have linear world representations, but we’ve seen that simple techniques for finding linear directions can often be used to successfully steer language models to some extent. For a better summary of these results, consider reading the Representation Engineering and Linear Representation Hypothesis papers, as well as the 2020 Circuits thread on “Features as Directions”.
Simple empirical techniques can make progress. The CCS paper takes a relatively straightforward idea, and executes on it well in an empirical setting. I endorse this strategy and think more people in AIS (but not the ML community in general) should do things like this.
Around late 2022, I became significantly more convinced that the basic machine learning technique of “try the simplest thing”. The CCS work was a significant reason for this, because I expected it to fail and was pleasantly surprised by the positive results. I also think that I’ve updated upwards on the fact that executing “try the simplest thing” well is surprisingly difficult. I think that even in cases where the “obvious” thing is probably doomed to fail, it’s worth having someone try it anyway, because 1) you can be wrong, and more importantly 2) the way in which it fails can be informative. See also, obvious advice by Nate Soares and this tweet by Sam Altman.
It’s worth studying empirical, average-case ELK. In particular, I think that it’s worth “doing the obvious thing” when it comes to ELK. My personal guess is that (worst-case) ELK is really hard, and that simple linear probes are unlikely to work for it because there’s not really an actual “truth” vector represented by LLMs. However, there’s still a chance that it might nonetheless work in the average case. (See also, this discussion of empirical generalization.) I think ultimately, this is a quantitative question that needs to be resolved empirically – to what extent can we find linear directions inside LLMs that correspond very well with truth? What does the geometry of the LLM activation space actually look like?
Caveating claims from the CCS Paper
While I hold an overall positive view of the paper, I do think that some of the claims in the paper have either not held up over time, or are easily understood to mean something false. I’ll talk about some of my quibbles below.
The CCS algorithm as stated does not seem to reliably recover a robust “truth” direction. The first and biggest problem is that CCS does not perform that well at its intended goal. Notably, CCS classifiers may pick up on other prominent features instead of the intended “truth” direction (even sometimes on contrast pairs that only differ in the label!). Some results from the Still No Lie Detector for Language Models suggest that this may be because CCS is representing which labels are positive versus negative (i.e. the normalization in the CCS paper does not always work to remove this information). Note that Collin does discuss this as a possible issue (for future language models) in the blog post, but this is not discussed in the paper.
In addition, all of the Geometry of Truth paper, Fabien’s results on CCS, and the Challenges with unsupervised knowledged discovery paper note that CCS tends to be very dependent on the prompt for generative LMs.
Does CCS work for the reasons the authors claim it does? From reading the introduction and the abstract, one might get the impression that the key insight is that truth satisfies a particular consistency condition, and thus that the consistency loss proposed in the paper is driving a lot of the results.
However, I’m reasonably confident that it’s the contrast pairs that are driving CCS’s performance. For example, the Challenges with unsupervised knowledged discovery paper found that CCS does not outperform other unsupervised clustering methods. And as Scott Emmons notes, this is supported by section 3.3.3, where two other unsupervised clustering methods are competitive with CCS. Both the Challenges paper and Scott Emmon’s post also argue that CCS’s consistency loss is not particularly different from normal unsupervised learning losses. On the other hand, there’s significant circumstantial evidence that carefully crafted contrast pairs alone often define meaningful directions, for example in the activation addition literature.
There’s also two proofs in the Challenges paper, showing “that arbitrary features satisfy the consistency structure of [CCS]”, that I have not had time to fully grok and situate. But insofar as you can take this claim is correct, this is further evidence against CCS’s empirical performance being driven by its consistency loss.
A few nitpicks on misleading presentation. One complaint I have is that the authors use the test set to decide if higher numbers from their learned classifier correspond to “true” or “false”. This is mentioned briefly in a single sentence in section two (“For simplicity in our evaluations we take the maximum accuracy over the two possible ways of labeling the predictions of a given test set.”) and then one possible solution is discussed in Appendix A but not implemented. Also worth noting that the evaluation methodology used gives an unfair advantage to CCS (as it can never do worse than random chance on the test set, while many of the supervised methods perform worse than random).
This isn’t necessarily a big strike against the paper: there’s only so much time for each project and the CCS paper already contains a substantial amount of content. I do wish that this were more clearly highlighted or discussed, as I think that this weakens the repeated claims that CCS “uses no labels” or “is completely unsupervised”.
My Take
I think that the paper made significant contributions by significantly signal boosting the idea of unsupervised knowledge discovery,^[7] and showed that you can achieve decent performance by contrast pairs and consistency checks. It also has spurred a large amount of follow-up work and interest in the topic.
However, the paper is somewhat misleading in its presentation, and the primary driver of performance seems to be the construction of the contrast pairs and not the loss function. Follow-up work has also found that the results of the paper can be brittle, and suggest that CCS does not necessarily find a singular “truth direction”.
The Post
The post starts out by discussing the setup and motivation of unsupervised knowledge recovery. Suppose you wanted to make a model “honestly” report its beliefs. When the models are sub human or even human level, you can use supervised methods. Once the models are superhuman, these methods probably won’t scale for many reasons. However, if we used unsupervised methods, there might not be a stark difference between human-level and superhuman models, since we’re not relying on human knowledge.
The post then goes into some intuitions for why it might be possible: interp on vision models has found that models seem to learn meaningful features, and models seem to linearly represent many human-interpretable features.
Then, Collin talks about the key takeaways from his paper, and also lists many caveats and limitations of his results.
Finally, the post concludes by explaining the challenges of applying CCS-style knowledge discovery techniques to powerful future LMs, as well as why Collin thinks that these unsupervised techniques may scale.
A minor nitpick:
Collin says:
I think this is surprising because before this it wasn’t clear to me whether it should even be possible to classify examples as true or false from unlabeled LM representations *better than random chance*!
As discussed above, the methodology in the paper guarantees that any classifier will do at least as good at random chance. I’m not actually sure how much worse if you orient the classifier on the training set as opposed to the test set. (And I’d be excited for someone to check this!)
My Take
I think this blog post is quite good, and I wish more academic-adjacent ML people would write blog posts caveating their results and discussing where they fit in. I especially appreciated the section on what the paper does and does not show, which I think accurately represents the evidence presented in the paper (as opposed to overhyping or downplaying it). In addition, I think Collin makes a strong case for studying more unsupervised approaches to alignment.
I also strongly agree with Collin that it can be “extremely valuable to sketch out what a full solution could plausibly look like given [your] current model of how deep learning systems work”, and wish more people would do this.

—
Acknowledgments
Thanks to Aryan Bhatt, Stephen Casper, Adria-Garriga Alonso, David Rein, and others for helpful conversations on this topic. Thanks for Raymond Arnold for poking me into writing this.

(EDIT Jan 15th: added my footnotes back in to the comment, which were lost during the copy/paste I did from Google Docs.)
1. ^
  This originally was supposed to link to a list of projects I’d be excited for people to do in this area, but I ran out of time before the LW review deadline.
2. ^
  I also draw on evidence from many, many other papers in related areas, which unfortunately I do not have time to list fully. A lot of my intuitions come from work on steering LLMs with activation additions, conversations with Redwood researchers on various kinds of coup or deception probes, and linear probing in general.
3. ^
  That is, a direction with an additional bias term.
4. ^
  Unlike other linear probing techniques, CCS does this without needing to know if the positive or negative answer is correct. However, some amount of labeled data is needed later to turn this pair into a classifier for truth/falsehood.
5. ^
  I’ll use “Positive” and “Negative” for the sake of simplicity in this review, though in the actual paper they also consider “Yes” and “No” as well as different labels for news classification and story completion.
6. ^
  Note that this gives their method a small advantage in their evaluation, since CCS gets to fit a binary parameter on the test set while the other methods do not. Using a fairer baseline does significantly negatively affect their headline numbers, but I think that the results generally hold up anyways. (I haven’t actually written any code for this review, so I’m trusting the reports of collaborators who have looked into it, as well as Fabien’s results that use randomly initialized directions with the CCS evaluation methodology instead of trained CCS directions.)
  
  Also, while writing this review, I originally thought that this issue was addressed in section 3.3.3, but that only addresses fitting the classifiers with fewer contrast pairs, as opposed to deciding whether the combined classifier corresponds to correct or incorrect answers. After spending ~5 minutes staring at the numbers and thinking that they didn’t make sense, I realized my mistake. Ugh.
7. ^
  This originally said “by introducing the idea”, but some people who reviewed the post convinced me otherwise. It’s definitely an introduction for many academics, however.
8. ^
  The authors call this a “direction” in their abstract, but CCS really learns two directions. This is a nitpick.
What links here?
- Voting Results for the 2022 Review by Ben Pace (2 Feb 2024 20:34 UTC; 57 points)

Open Phil releases RFPs on LLM Benchmarks and Forecasting

LawrenceC11 Nov 2023 3:01 UTC

53 points

0 comments2 min readLW link

(www.openphilanthropy.org)

Paper: Superposition, Memorization, and Double Descent (Anthropic)

LawrenceC5 Jan 2023 17:54 UTC

53 points

11 comments1 min readLW link

(transformer-circuits.pub)

LawrenceC(Lawrence Chan)

How my views on AI(S) have changed over the last 5.5 years

AI Capabilities/​Development

AI X-Risk

AI Safety Research

Bonus: Some Personal Updates

Introduction/​Overview

The Paper and Follow-up Work

Important takeaways

Caveating claims from the CCS Paper

My Take

The Post

My Take

Acknowledgments

AI Capabilities/Development

Introduction/Overview