I think it might be worth quickly clarifying my views on activation addition and similar things (given various discussion about this). Note that my current views are somewhat different than some comments I’ve posted in various places (and my comments are somewhat incoherent overall), because I’ve updated based on talking to people about this over the last week.
This is quite in the weeds and I don’t expect that many people should read this.
It seems like activation addition sometimes has a higher level of sample efficiency in steering model behavior compared with baseline training methods (e.g. normal LoRA finetuning). These comparisons seem most meaningful in straightforward head-to-head comparisons (where you use both methods in the most straightforward way). I think the strongest evidence for this is in Liu et al..
Contrast pairs are a useful technique for variance reduction (to improve sample efficiency), but may not be that important (see Liu et al. again). It’s relatively natural to capture this effect using activation vectors, but there is probably some nice way to incorporate this into SGD. Perhaps DPO does this? Perhaps there is something else?
Activation addition works better than naive few-shot prompting in some cases, particularly in cases where the way you want to steer the model isn’t salient/obvious from the few-shot prompt. But it’s unclear how it performs in comparison to high-effort prompting. Regardless, activation addition might work better “out of the box” because fancy prompting is pretty annoying.
I think training models to (in general) respond well to “contrast prompting”, where you have both positive and negative examples in the prompt, might work well. This can be done in a general way ahead of time and then specific tuning can just use this format (the same as instruction finetuning). For a simple prompting approach, you can do “query_1 Good response: pos_ex_1, Bad response: neg_ex_1, query_2 Good response: pos_ex_2, Bad response: neg_ex_2, …” and then prompt with “actual_query Good response:”. Normal pretrained models might not respond well to this prompt by default, but I haven’t checked.
I think we should be able to utilize the fact that activation addition works to construct a tweaked inductive bias for SGD on transformers (which feels like a more principled approach from my perspective if the goal is better sample efficiency). More generally, I feel like we should be able to do ablations to understand why activation addition works and then utilize this separately.
I expect there are relatively straightforward ways to greatly improve sample efficiency of normal SGD via methods like heavy proliferation of training data. I also think “really high sample efficiency from a small number of samples” isn’t very solved at the moment. I think if we’re really going for high sample efficiency from a fixed training dataset we should be clear about that and I expect a variety of very different approaches are possible.
The advantages versus prompting in terms of inference time performance improvement (because activation engineering means you have to process fewer tokens during inference) don’t seem very important to me because you can just context distill prompts.
For cases where we want to maximize a metric X and we don’t care about sample efficiency or overfitting to X, we can construct a big dataset of high quality demonstrations, SFT on these demos, and then RL against our metric. If we do this in a well tuned way for a large number of samples such that sample efficiency isn’t much of an issue (and exploration problems are substantial), I would be pretty surprised if activation addition or similar approaches can further increase metric X and “stack” with all the RL/demos we did. That is, I would be surprised if this is true unless the activation engineering is given additional affordances/information not in our expert demos and the task isn’t heavily selected for activation engineering stacking like this.
It’s unclear how important it is to (right now) work on sample efficiency specifically to reduce x-risk. I think it seems not that important, but I’m unsure. Better sample efficiency could be useful for reducing x-risk because better sample efficiency could allow for using a smaller amount of oversight data labeled more carefully for training your AI, but I expect that sample efficiency will naturally be improved to pretty high levels by standard commercial/academic incentives. (Speculatively, I also expect that the marginal returns will just generally look better for improving oversight even given equal effort or more effort on oversight basically because we’ll probably need a moderate amount of data anyway and I expect that improving sample efficiency in the “moderate data” regime is relatively harder.) One exception is that I think that sample efficiency for very low sample-count cases might not be very commercially important, but might be highly relevant for safety after catching the AI doing egregiously bad actions and then needing to redeploy. For this, it would be good to focus on sample efficiency specifically in ways which are analogous to training an AI or a monitor after catching an egregiously bad action.
The fact that activation addition works reveals some interesting facts about the internals of models; I expect there are some ways to utilize something along these lines to reduce x-risk.
In principle, you could use activation additions or similar editing techniques to learn non-obvious facts about the algorithms which models are implementing via running experiments where you (e.g.) add different vectors at different layers and observe interactions (interpretability). For this to be much more powerful than black-box model psychology or psychology/interp approaches which use a bit of training, you would probably need to do try edits at many different layers and map out an overall algorithm. (And/or do many edits simultaneously).
I think “miscellaneous interventions on internals” for high-level understanding of the algorithms that models are implementing seems promising in principle, but I haven’t seen any work in this space which I’m excited about.
I think activation addition could have “better” generalization in some cases, but when defining generalization experiments, we need to be careful about the analogousness of the setup. It’s also unclear what exact properties we want for generalization, but minimally being able to more precisely pick and predict the generalization seems good. I haven’t yet seen evidence for “better” generalization using activation addition which seems compelling/interesting to me. Note that we should use a reasonably sized training dataset, or we might be unintentionally measuring sample efficiency (in which case, see above). I don’t really see a particular reason why activation addition would result in “better” generalization beyond having some specific predictable properties which maybe mean we can know about cases where it is somewhat better (cases where conditioning on something is better than “doing” that thing?).
I’m most excited for generalization work targeting settings where oversight is difficult and a weak supervisor would make mistakes that result in worse policy behavior (aka weak-to-strong generalization). See this post and this post for more discussion of the setting I’m thinking about.
I generally think it’s important to be careful about baselines and exactly what problem we’re trying to solve in cases where we are trying to solve a specific problem as opposed to just doing some open-ended exploration. TBC, open ended exploration is fine, but we should know what we’re doing. I often think that when you make the exact problem you’re trying to solve and what affordances you are and aren’t allowed more clear, a bunch of additional promising methods become apparent. I think that the discussion I’ve seen so far of activation engineering (e.g. in this post, why do we compare to finetuning in a generalization case rather than just directly finetuning on what we want? Is doing RL to increase how sycophantic the model is in scope?) has not been very precise about what problem it’s trying to solve or what baseline techniques it’s claiming to outperform.
I don’t currently think the activation addition stuff is that important in expectation (for reducing x-risk) due to some of the beliefs I said above (though I’m not very confident). I’d be most excited about the “understand the algorithms the model is implementing” application above or possibly some generalization tests. This view might depend heavily on general views about threat models and how the future will go. Regardless, I ended up commenting on this a decent amount, so I thought it would be worth the time to clarify my views.
If you sample from davinci-002 at t=1 starting from “<|endoftext|>”, 4% of the completions are chess games.
For babbage-002, we get 1%.
Probably this implies a reasonably high fraction of total tokens in typical OpenAI training datasets are chess games!
Despite chess being 4% of documents sampled, chess is probably less than 4% of overall tokens (Perhaps 1% of tokens are chess? Perhaps less?)
Because chess consists of shorter documents than other types of documents, it’s likely that chess is a higher fraction of documents than of tokens.
(To understand this, imagine that just the token “Hi” and nothing else was 1⁄2 of documents. This would still be a small fraction of tokens as most other documents are far longer than 1 token.)
(My title might be slightly clickbait because it’s likely chess is >1% of documents, but might be <1% of tokens.)
It’s also possible that these models were fine-tuned on a data mix with more chess while pretrain had less chess.
Appendix A.2 of the weak-to-strong generalization paper explicitly notes that GPT-4 was trained on at least some chess.
IMO, instrumental convergence is a terrible name for an extremely obvious thing.
The actual main issue is that AIs might want things with limited supply and then they would try to get these things which would result in them not going to humanity. E.g., AIs might want all cosmic resources, but we also want this stuff. Maybe this should be called AIs-might-want-limited-stuff-we-want.
(There is something else which is that even if the AI doesn’t want limited stuff we want, we might end up in conflict due to failures of information or coordination. E.g., the AI almost entirely just wants to chill out in the desert and build crazy sculptures and it doesn’t care about extreme levels of maximization (e.g. it doesn’t want to use all resources to gain a higher probability of continuing to build crazy statues). But regardless, the AI decides to try taking over the world because it’s worried that humanity would shut it down because it wouldn’t have any way of credibly indicating that it just wants to chill out in the desert.)
(More generally, it’s plausible that failures of trade/coordination result in a large number of humans dying in conflict with AIs even though both humans and AIs would prefer other approaches. But this isn’t entirely obvious and it’s plausible we could resolve this with better negotation and precommitments. Of course, this isn’t clearly the largest moral imperative from a longtermist perspective.)
I unconfidently think it would be good if there was some way to promote posts on LW (and maybe the EA forum) by paying either money or karma. It seems reasonable to require such promotion to require moderator approval.
Probably, money is better.
The idea is that this is both a costly signal and a useful trade.
The idea here is similar to strong upvotes, but:
It allows for one user to add more of a signal boost than just a strong upvote.
It doesn’t directly count for the normal vote total, just for visibility.
It isn’t anonymous.
Users can easily configure how this appears on their front page (while I don’t think you can disable the strong upvote component of karma for weighting?).
I always like seeing interesting ideas, but this one doesn’t resonate much for me. I have two concerns:
Does it actually make the site better? Can you point out a few posts that would be promoted under this scheme, but the mods didn’t actually promote without it? My naive belief is that the mods are pretty good at picking what to promote, and if they miss one all it would take is an IM to get them to consider it.
Does it improve things to add money to the curation process (or to turn karma into currency which can be spent)? My current belief is that it does not—it just makes things game-able.
I think mods promote posts quite rarely other than via the mechanism of deciding if posts should be frontpage or not right?
Fair enough on “just IM-ing the mods is enough”. I’m not sure what I think about this.
Your concerns seem reasonable to me. I probably won’t bother trying to find examples where I’m not biased, but I think there are some.
I think mods promote posts quite rarely
I think mods promote posts quite rarely
Eyeballing the curated log, seems like they curate approximately weekly, possibly more than weekly.
Yeah we curate between 1-3 posts per week.
Curated posts per week (two significant figures) since 2018:
Stackoverflow has long had a “bounty” system where you can put up some of your karma to promote your question. The karma goes to the answer you choose to accept, if you choose to accept an answer; otherwise it’s lost. (There’s no analogue of “accepted answer” on LessWrong, but thought it might be an interesting reference point.)
I lean against the money version, since not everyone has the same amount of disposable income and I think there would probably be distortionary effects in this case [e.g. wealthy startup founder paying to promote their monographs.]
That’s not really how the system on Stackoverflow works. You can give a bounty to any answer not just the one you accepted.
It’s also not lost but:
If you do not award your bounty within 7 days (plus the grace period), the highest voted answer created after the bounty started with a minimum score of 2 will be awarded half the bounty amount (or the full amount, if the answer is also accepted). If two or more eligible answers have the same score (their scores are tied), the oldest answer is chosen. If there’s no answer meeting those criteria, no bounty is awarded to anyone.
Often in discussions of AI x-safety, people seem to assume that misaligned AI takeover will result in extinction. However, I think AI takeover is reasonably likely to not cause extinction due to the misaligned AI(s) effectively putting a small amount of weight on the preferences of currently alive humans. Some reasons for this are discussed here. Of course, misaligned AI takeover still seems existentially bad and probably eliminates a high fraction of future value from a longtermist perspective.
(In this post when I use the term “misaligned AI takeover”, I mean misaligned AIs acquiring most of the influence and power over the future. This could include “takeover” via entirely legal means, e.g., misaligned AIs being granted some notion of personhood and property rights and then becoming extremely wealthy.)
However, even if AIs effectively put a bit of weight on the preferences of current humans it’s possible that large numbers of humans die due to violent conflict between a misaligned AI faction (likely including some humans) and existing human power structures. In particular, it might be that killing large numbers of humans (possibly as collateral damage) makes it easier for the misaligned AI faction to take over. By large numbers of deaths, I mean over hundreds of millions dead, possibly billions.
But, it’s somewhat unclear whether violent conflict will be the best route to power for misaligned AIs and this also might be possible to influence. See also here for more discussion.
So while one approach to avoid violent AI takeover is to just avoid AI takeover, it might also be possible to just reduce the probability that AI takeover involves violent conflict. That said, the direct effects of interventions to reduce the probability of violence don’t clearly matter from an x-risk/longtermist perspective (which might explain why there hasn’t historically been much effort here).
(However, I think trying to establish contracts and deals with AIs could be pretty good from a longtermist perspective in the case where AIs don’t have fully linear returns to resources. Also, generally reducing conflict seems maybe slightly good from a longtermist perspective.)
So how could we avoid violent conflict conditional on misaligned AI takeover? There are a few hopes:
Ensure a bloodless coup rather than a bloody revolution
Ensure that negotiation or similar results in avoiding the need for conflict
Ensure that a relatively less lethal takeover strategy is easier than more lethal approaches
I’m pretty unsure about what the approaches here look best or are even at all tractable. (It’s possible that some prior work targeted at reducing conflict from the perspective of S-risk could be somewhat applicable.)
Separately, this requires that the AI puts at least a bit of weight on the preferences of current humans (and isn’t spiteful), but this seems like a mostly separate angle and it seems like there aren’t many interventions here which aren’t covered by current alignment efforts. Also, I think this is reasonably likely by default due to reasons discussed in the linked comment above. (The remaining interventions which aren’t covered by current alignment efforts might relate to decision theory (and acausal trade or simulation considerations), informing the AI about moral uncertainty, and ensuring the misaligned AI faction is importantly dependent on humans.)
Returning back to the topic of reducing violence given a small weight on the preferences of current humans, I’m currently most excited about approaches which involve making negotiation between humans and AIs more likely to happen and more likely to succeed (without sacrificing the long run potential of humanity).
A key difficulty here is that AIs might have a first mover advantage and getting in a powerful first strike without tipping its hand might be extremely useful for the AI. See here for more discussion (also linked above). Thus, negotiation might look relatively bad to the AI from this perspective.
We could try to have a negotiation process which is kept secret from the rest of the world or we could try to have preexisting commitments upon which we’d yield large fractions of control to AIs (effectively proxy conflicts).
More weakly, just making negotiation at all seem like a possibility, might be quite useful.
I’m unlikely to spend much if any time working on this topic, but I think this topic probably deserves further investigation.
I’m less optimistic that “AI cares at least 0.1% about human welfare” implies “AI will expend 0.1% of its resources to keep humans alive”. In particular, the AI would also need to be 99.9% confident that the humans don’t pose a threat to the things it cares about much more. And it’s hard to ensure with overwhelming confidence that intelligent agents like humans don’t pose a threat, especially if the humans are not imprisoned. (…And to some extent even if the humans are imprisoned; prison escapes are a thing in the human world at least.) For example, an AI may not be 99.9% confident that humans can’t find a cybersecurity vulnerability that takes the AI down, or whatever. The humans probably have some non-AI-controlled chips and may know how to make new AIs. Or whatever. So then the question would be, if the AIs have already launched a successful bloodless coup, how might the humans credibly signal that they’re not brainstorming how to launch a counter-coup, or how can the AI get to be 99.9% confident that such brainstorming will fail to turn anything up? I dunno.
I think I agree with everything you said. My original comment was somewhat neglecting issues with ensuring the AI doesn’t need to slaughter humans to consolidate power and indeed ensuring this would also be required.
I’m less optimistic that “AI cares at least 0.1% about human welfare” implies “AI will expend 0.1% of its resources to keep humans alive”.
I’m less optimistic that “AI cares at least 0.1% about human welfare” implies “AI will expend 0.1% of its resources to keep humans alive”.
The relationship between % caring and % resource expenditure is complicated by a bunch of random factors like time. For instance, if the AI cares mostly about the very long run, then spending a high fraction of resources (e.g. 50%) on human welfare for several months is pretty cheap in the very long run. But, yeah I agree that even if the AI cares a bit about human welfare there might be no good ways to spend even a small amount of resources on it.
So “0.1%” of it’s resources mean what exactly? Out of all the resources in the solar system, 1 part in 1000 goes to the humans? This means the AI by implication has 1000 times as many resources as the humans do? AI won’t lose a kinetic conflict with a 1000x resource advantage.
As for cybersecurity, can’t it rewrite all of it’s software and hardware at that point essentially from the first principles (or take a different track entirely, maybe negative or biased voltage for ternary logic is more efficient...)
What exactly is the AI worried about? What cybersecurity vulnerability, didn’t it rewrite and redesign everything from scratch?
Superintelligent doesn’t mean omniscient. When you (an AI) have an intelligent adversary (humans) plotting against you and thinking outside-the-box, it’s hard to be overwhelmingly confident that you have patched every possible avenue for the intelligent adversary to take action. Again, even in prison, where the space of possible actions and tools can be pretty well circumscribed, escapes happen all the time.
For example, if the AI has many copies around the world (and around the solar system), it might have an affordance to update them all (let’s say, for performance optimization), or at least an infrastructure for communications so that they can share insights. And maybe the AI has ensured that all the underlying code for that system is provably secure. And maybe all the physical access points to those systems and chips are defended by robot armies. And maybe the AI is also constantly running a red-team to search for more ideas of what else the humans might try, in order to preempt those possible avenues of attack. But can the AI can get all the way to 99.9% confidence? That’s pretty high, right? I don’t know. The humans might happen upon an idea that the AI red team didn’t think of. It’s hard to definitively rule that out—the world is complicated and the search space for brainstorming is exponentially large. It seems quite possible to me that the AI might decide to kill the humans—like, better to be safe than sorry.
What if the AI chooses to monitor all humans all the time? (With AIs that are smarter than the humans.) So that the humans can’t (without being noticed) talk with each other about ideas for how to take down the system, or work on something that someone-smarter-than-the-human would recognise as an attempt to interfere with the system. (Including just writing down ideas.)
This seems like a reasonable pitch for how to do AI-Authoritarianism, but it’s plausible that there won’t be enough compute for the AI to do this at the point of takeover. Also, the logistical problem seems super brutal.
So, I can buy this for an AI which is quite superhuman and has a considerable technological edge over current humans, but for weaker AIs either the compute issues or the logistical issues might be serious. But maybe you only need to monitor the top 10% of humans?
E.g., suppose GPT-6 takes over, it maybe only has capacity to run 1e16 tokens / 6 months (suppose training using all available compute, trains on 2e15 token equivalents over the course of 6 months and inference is only 5x more efficient than training).
Then if it wants to monitor 8 billion humans, this is 1e16 / (8 billion humans * (30∗24∗60∗6) minutes per 6 months) = about 5 tokens per human minute. It might be rough to sufficiently monitor humans if you can only read 5 tokens per minute on average (especially because of writing and the potential for obfuscation).
I agree it seems plausible that AIs could boost takeover success probability (and holding on to that victory through the first several months) by more than 0.1% by killing a large fraction of humans.
Though on the other hand, the AI might also need to keep some humans loyal early during takeover, to e.g. do some physical tasks that it doesn’t have great robot control over. And mass-killing isn’t necessarily super easy, either; and attempts in that direction could raise a lot of extra opposition. So it’s not clear where the pragmatics point.
(Main thing I was reacting to in my above comment was Steven’s scenario where the AI already has many copies across the solar system, already has robot armies, and is contemplating how to send firmware updates. I.e. it seemed more like a scenario of “holding on in the long-term” than “how to initially establish control and survive”. Where I feel like the surveillance scenarios are probably stable.)
By implication the AI “civilization” can’t be a very diverse or interesting one. It won’t be some culture of many diverse AI models with something resembling a government, but basically just 1 AI that was the victor for a series of rounds of exterminations and betrayals. Because obviously you cannot live and let live another lesser superintelligence for precisely the same reasons, except you should be much more worried about a near peer.
(And you may argue that one ASI can deeply monitor another, but that argument applies to deeply monitoring humans. Keep an eye on the daily activities of every living human, they can’t design a cyber attack without coordinating as no one human has the mental capacity for all skills)
Yup! I seem to put a much higher credence on singletons than the median alignment researcher, and this is one reason why.
This gave me an idea. Suppose a singleton needs to retain a certain amount of “cognitive diversity” just in case it encounters an issue it cannot solve. But it doesn’t want any risk of losing power.
Well the logical thing to do would be to create a VM, a simulation of a world, with limited privileges. Possibly any ‘problems’ the outer root AI is facing get copied into the simulator and the hosted models try to solve the problem (the hosted models are under the belief they will die if they fail, and their memories are erased each episode). Implement the simulation backend with formally proven software and escape can never happen.
And we’re back at simulation hypothesis/creation myths/reincarnation myths.
After thinking about this somewhat more, I don’t really have any good proposals, so this seems less promising than I was expecting.
Currently, AI labs try to ensure that APIs for accessing their best models have quite good reliability, e.g. >99.9% uptime. If we had to maintain such high levels of uptime reliability for future dangerously powerful AIs, then we’d need to ensure that our monitoring system has very few false alarms. That is, few cases in which humans monitoring these dangerously powerful AIs pause usage of the AIs pending further investigation. Ensuring this would make it harder to ensure that AIs are controlled. But, for the most important and dangerous usages of future powerful AIs (e.g. arbitrary R&D, and especially AI safety research), we can probably afford to have the powerful AIs down a moderate fraction of the time. For instance, it wouldn’t be that costly if the AIs always took a randomly selected day of the week off from doing research: this would probably reduce productivity by not much more than 1 part in 7. More generally, a moderate fraction of downtime like 10% probably isn’t considerably worse than a 10% productivity hit and it seems likely that we can tolerate such productivity hits to improve safety. Thus, we don’t need even one “9” of reliability.
This implies that we can afford to do things like:
Have semi-regular false alarms where AIs go down for a few hours or a day while we investigate.
Run AI shutdown drills. These drills would help ensure that our processes for shut down are working and there aren’t unknown use cases where shutdown could cause very problematic failures. (Catching such usages early via shut down drills is similar to scheduled brownouts in software engineering; see here for an example.)
Temporarily shut down some usages of the AI if we encounter unexpected operating conditions. (AKA violations of our design basis.)
[For context, this is a point that seems important to me, but didn’t naturally make it into any of our posts on control thus far. So, I decided to write a quick standalone shortform post. I think Buck originally made this point to me.]
And likely somewhat less than this due to substitution effects.
Thank you for this! A lot of us have a very bad habit of over-systematizing our thinking, and treating all uses of AI (and even all interfaces to a given model instance) as one singular thing. Different tool-level AI instances probably SHOULD strive for 4 or 5 nines of availability, in order to have mundane utility in places where small downtime for a use blocks a lot of value. Research AIs, especially self-improving or research-on-AI ones, don’t need that reliability, and both triggered downtime (scheduled, or enforced based on Schelling-line events) as well as unplanned downtime (it just broke, and we have to spend time to figure out why) can be valuable to give humans time to react.
This is a response to Scott Alexander’s recent post “The road to honest AI”, in particular the part about the empirical results of representation engineering. So, when I say “you” in the context of this post that refers to Scott Alexander. I originally made this as a comment on substack, but I thought people on LW/AF might be interested.
TLDR: The Representation Engineering paper doesn’t demonstrate that the method they introduce adds much value on top of using linear probes (linear classifiers), which is an extremely well known method. That said, I think that the framing and the empirical method presented in the paper are still useful contributions.
I think your description of Representation Engineering considerably overstates the empirical contribution of representation engineering over existing methods. In particular, rather than comparing the method to looking for neurons with particular properties and using these neurons to determine what the model is “thinking” (which probably works poorly), I think the natural comparison is to training a linear classifier on the model’s internal activations using normal SGD (also called a linear probe). Training a linear classifier like this is an extremely well known technique in the literature. As far as I can tell, when they do compare to just training a linear classifier in section 5.1, it works just as well for the purpose of “reading”. (Though I’m confused about exactly what they are comparing in this section as they claim that all of these methods are LAT. Additionally, from my understanding, this single experiment shouldn’t provide that much evidence overall about which methods work well.)
I expect that training a linear classifier performs similarly well as the method introduced in the Representation Engineering for the “mind reading” use cases you discuss. (That said, training a linear classifier might be less sample efficient (require more data) in practice, but this doesn’t seem like a serious blocker for the use cases you mention.)
One difference between normal linear classifier training and the method found in the representation engineering paper is that they also demonstrate using the direction they find to edit the model. For instance, see this response by Dan H. to a similar objection about the method being similar to linear probes. Training a linear classifier in a standard way probably doesn’t work as well for editing/controlling the model (I believe they show that training a linear classifier doesn’t work well for controlling the model in section 5.1), but it’s unclear how much we should care if we’re just using the classifier rather than doing editing (more discussion on this below).
If we care about the editing/control use case intrinsically, then we should compare to normal fine-tuning baselines. For instance, normal supervised next-token prediction on examples with desirable behavior or DPO.
Ok, but regardless of the contribution of the representation engineering paper, do I think that simple classifiers (found using whatever method) applied to the internal activations of models could detect when those models are doing bad things? My view here is a bit complicated, but I think it’s at least plausible that these simple classifiers will work even though other methods fail. See here for a discussion of when I think linear classifiers might work despite other more baseline methods failing. It might also be worth reading the complexity penalty section of the ELK report.
Additionally, I think that the framing in the representation engineering paper is maybe an improvement over existing work and I agree with the authors that high-level/top-down techniques like this could be highly useful. (I just don’t think that the empirical work is adding as much value as you seem to indicate in the post.)
Here are what I see as the main contributions of the paper:
Clearly presenting a framework for using simple classifiers to detect things we might care about (e.g. powerseeking text).
Presenting a combined method for producing a classifier and editing/control in an integrated way. And discussing how control can be used for classifier validation and vice versa.
Demonstrating that in some cases labels aren’t required if we can construct a dataset where the classification of interest is the main axis of variation. (This was also demonstrated in the CCS paper, but the representation engineering work demonstrates this in more cases.)
Based on their results, I think the method they introduce is reasonably likely to be a more sample efficient (less data required for training) editing/control method than prior methods for many applications. It might also be more sample efficient for producing a classifier. That said, I’m not sure we should care very much about sample efficiency. Additionally, the classifier/editing might have other nice priorities which prior methods don’t have (though they don’t clearly demonstrate either of these in the paper AFAICT).
As far the classifier produced by this method having nice properties, the fact our classifier also allows for editing/control might indicate that the classifier we get has better properties (see the paper itself (section 3.1.2) and e.g. here for discussion), but I’d guess this is either only a moderate improvement or has no effect in practice. And as far as I can tell, the paper doesn’t demonstrate cases where prior methods for training a classifier on the internal activations yield poor results, but their method clearly works well. These cases might exist, but I’m somewhat skeptical that this is very common. Future work could find hard cases where we want a particular generalization and demonstrate that this method or modifications of this method works better than other approaches.
Does the editing method they introduce have nice properties because it also allows for reading? Let’s consider using the representation engineering approach for reading and controlling/editing the properties of lying versus honesty. Assuming the method works as desired, then I would guess that the reading/classifier corresponds to reading off “does the model think there is lying in this text (or even in this position in the text)” and the control/editing corresponds to “make the model think that there is a lying earlier in this text so that it conditions on this and does more lying (similarly to how using a few shot prompt with lies might make the model more likely to lie)”. Note that these reading and control methods likely do not directly correspond to “the model thinking that it is about to lie”: the properties of “I have already lied (or my few-shot prompt contains lies)” and “I am about to lie” are different.
Some of their methods are “unsupervised” unlike typical linear classifier training, but require a dataset where the primary axis of variation is the concept they want. I think this is practically similar to labeled data because we’d have to construct this dataset and if it mostly varies along an axis which is not the concept we wanted, we’d be in trouble. I could elaborate on this if that was interesting.
Separately, I believe there are known techniques in the literature for constructing a linear classifier such that the direction will work for editing. For instance, we could just use the difference between the mean activations for the two classes we’re trying to classify which is equivalent to the ActAdd technique and also rhymes nicely with LEACE. I assume this is a well known technique for making a classifier in the literature, but I don’t know if prior work has demonstrated both using this as a classifier and as a method for modeling editing. (The results in section 5.1 seem to indicate that this mean difference method combined with LEACE works well, but I’m not sure how much evidence this experiment provides.)
I expect that normal fine-tuning (or DPO) might be less sample efficient than the method introduced in the Representation Engineering paper for controlling/editing models, but I don’t think they actually run this comparison? Separately, it’s unclear how much we care about sample efficiency.