4gate

Karma: 8

4gate 1 Mar 2026 19:26 UTC
3 points
2
on: The Spectre haunting the “AI Safety” Community
I think this is accurate. Relatedly (but not as a retort or anything, I’m curious what people think) is it bad or immoral (or something along those lines) for individuals who are passionate (via intellectual interest, etc...) about specific AI safety verticals to pursue them?
For example, there are plenty of people who like science of deep learning/interpretability/vaguely things that involve understanding how DNNs “work”. It’s probably pretty easy to argue that this is unlikely to be the highest impact to reduce existential risk. Even if I’m wrong in thinking that this is easy to argue, then you could probably argue it for a lot of specific projects people are engaging in. However, I think it’s an easy to tell yourself that you are doing such research because of and to mitigate existential risk.
In such a (hypothetical) set of scenarios, let’s assume these individuals would be actively unhappy and slightly worse than their peers doing direct/straightforward DIP-like policy/politics/outreach work. So basically these hypothetical people need to pick between something useful and something fun.
So the question becomes what should these people do?
More broadly, I think doing straightforward work that is more likely to solve the problems of our society just tends not to be what people are individually interested in doing and thus we see “The Spectre” and its cousins in AI safety and probably other fields too (my generalization here is to situations where people do X because of Y but claim its because of Z where Y is provides a bit more individual utility and Z provides a bit more societal utility).
EDIT: I don’t actually know much about Control AI and I see some people are contesting some assumptions here (such as the fact that Control AI has been successful). I don’t have time to read this. I think broadly that doesn’t actually matter to my post here, because my post is (1) not about Control AI in particular, but instead about cases in AIS where you have X, Y, and Z as described ^, (2) on priors this 4-step plan stuff seems kind of self-evidently what needs to be done to get a lot of needed actions to occur (i.e. for regulation, etc...).

4gate 1 Mar 2026 18:35 UTC
1 point
0
on: Are there lessons from high-reliability engineering for AGI safety?
I don’t think this contradicts your stated opinion as I understand it, but I think a few things are worth noting (though in some I speak to an extent from ignorance):
1. The mindset that goes into high reliability engineering (HRE) could carry over to applications of AI that are somewhat narrow but not narrow enough to make the AI fail to add significant utility. For example, a lot of general-ish agent deployments are AFAIK catastrophically unsafe/insecure by default right now (openclaw, a bunch of agentic apps that can be prompt-injected in such a way that some upstream state changes that should not change, etc...). If the general culture changed, this would mitigate risk (at least banal risks caused by humans, which can still be pretty bad: bad enough to cause extinction).
2. Narrow AI can still have a lot of utility. I’m thinking here of AI that is in some sense bounded in what it can do in space, but which can accelerate progress in time. For example: an AI that is superhuman at mathematical proofs (that are verifiable) can help accelerate mathematical research but it doesn’t have a much wider impact on other parts of the world (unless someone uses its proofs). Similarly, an AI that can quickly implement software in sandboxes could be similar. AI that can search across all of human knowledge or simulate outcomes without taking actions would also fall in this category and be immensely useful and increase economic output. Obviously, the AI has to “not break out”. Generally, by bounded in “space” I mean that the set of things they can impact in the physical world (i.e. “space”) is very small (i.e. maybe the memory/disk on/of a computer system). I think there’s enough such settings that you could have a future (if we had better coordination) where everyone gets access to great AIs, but only a few people get access to super duper genius AIs and only apply them very carefully in controlled, narrow, settings (where they would use HRE).
3. HRE probably is critical to the “not breaking out” part of (2) and probably important for the best possible initial deployments of not-quite-AGI-but-almost-there AIs that we are likely to see in then near future.
4. It’s reasonable to think that “not breaking out” is hopeless but making an effort may delay it such that alignment (and other salient technologies) will progress to the point that there is a lower chance that things to dreadfully wrong.

4gate 9 Nov 2025 22:42 UTC
3 points
0
on: Legible vs. Illegible AI Safety Problems
Edit: it looks like someone gave some links below. I don’t have time to read it yet but I may do so in the future. I think that it’s better to give examples and be dismissed than to give no examples and be dismissed.
It should be nice to see some good examples of illegible problems. I understand that their illegibility may be a core issue to this, but surely someone can at least name a few?
I think it’s important so as to compare to legible problems. I assume legible problems can include things like jailbreaking-resistance, unlearning, etc.. for AI security. I don’t see why these in particular necessarily bring forward the day that ASI/AGI will be deployed. For example: if the defense results are consistently bad and the attack results are consistently good it could support arguments for more cautious policy. In fact there is an argument to be made in this specific instance that this is actually the result. Examples: https://arxiv.org/abs/2502.02260v1, https://arxiv.org/abs/2510.09023, maybe https://arxiv.org/abs/2501.04952 (I haven’t read the last one).
For this reason, I’m not sure what to meaningfully take away from this post. Would someone who was more “in the scene” for AIS intuitively understand what the legible and illegible problems are?

Scoping LLMs

erik, David Baek, emile delcourt and 4gate

10 Apr 2025 0:32 UTC

4 points

0 comments22 min readLW link

4gate 24 Dec 2024 5:09 UTC
1 point
0
in reply to: cloud’s comment on: Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
I agree that there are inductive biases towards sharing features and/or components. I’m not sure if there’s a good study of what features would be of this sort, vs. which others might actually benefit from being more seperate^[1], and I’m not sure how you would do it effectively for a truly broad set of features nor if it would necessarily be that useful anyways, so I tend to just take this on vibes since it’s pretty intuitive based on our own perception of i.e. shapes. That said there are plenty of categories/tasks/features, which I would expect are kinda seperable after some point. Specifically, anything where humans have already applied some sort of division of labor to, like software features vs. biology knowledge features vs. creative writing features, etc… (in the setting of natural language). Obviously, these all might share some basic grammatical or structural core features, but one layer of abstraction up it feels natural that they should be seperable. All this goes to say is that it seems like a good idea to give gradient routing the best possible shot at success might be to try some such partitioning of features/tasks,^[2] because unlike 3 and 8 we have some prior reason to believe that they should indeed be rather seperate. Maybe there’s other sources that can motivate what features or tasks to try to route seperately with minimal loss of utility (i.e. like what MoE papers report works well or not) but I haven’t thought about it too much.
One downside here is that all these examples that come to mind are in language settings, and so to get reasonable utiliy to start with you would probably need to be in the 1B-7B model size range.
About the edges. Have you tried all 3 combinations (route both, route one, route the other)? I think the fact that you limit to these edges in mentioned in the appendix Memory section. Surely, routing on activation edges is not actually prohibitive. Worst-case you you can just assign blocks to each category and it’ll basically just be an MoE. This really is just mathematically equivalent to MoE with a specific choice of architecture^[3] right? One idea I had vaguely a while ago but it seems rather complicated is to do alternating dense training with MoE-ffication. In the dense training phases you train densly like usual. Then, you use some clever algorithm (think: like interpretability methods on steroids) to decide which parts of the network will get which experts. Then, in the MoE-ffication phase you use the clever algorithm to basically define routes/prune edges (for the chosen partitioning). You go back and repeat. Each expert is somewhat analogous its own network so each new iteration you split further and further. The goal is to get as much splitting for minimal utility cost possible. The resulting model might be smaller/cheaper at inference time and more interpretable. I’m not honestly sure how useful this might be, but I thought it was kind of cool :P
1. ^
  Really all I care about here is that these features don’t lose too much from being seperate. With that said, I guess some features may benefit from being seperated at training time if the train set has some spurious correlations, which it probably does.
2. ^
  Unlearning virology, if you do want cellular biology, seems like the hardest possible task ngl.
3. ^
  You might be weight-sharing or smth.

4gate 9 Dec 2024 13:02 UTC
2 points
2
on: Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
This is a cool method. Are you thinking of looking more into how gradient routed model performance (on tasks and not just loss) scales with size of the problem/model? You mention that it requires a big L1 regularization in the Vision dataset, and it would be nice to try something larger than CIFAR. Looks like the LLM and RL models are also < 1B parameters, but I’m sure you’re planning to try something like a Llama model next.
I’m imagining you would do this during regular training/pre-training for your model to be modular so you can remove shards based on your needs, but if the alignment tax is really high (or lowering it is complicated—hyperparameter tuning sucks :/) it’s gonna be hard to convince people to use it and that’s just unfortunate. Maybe you are also thinking of using it as a modification to finetuning, which seems more promising since with gradient routing you are also basically doing some form of PeFT.
What do you think can be improved for finetuning & unlearning use-cases (i.e. for LLMs)?

4gate 9 Dec 2024 12:35 UTC
1 point
0
on: Cognitive Work and AI Safety: A Thermodynamic Perspective
This seems like a prety cool perspective, especially since it might make analysis a little simpler vs. a paradigm where you kind of need to know what to look out for specifically. Are there any toy mathematical models or basically simulated words/stories, etc… to make this more concrete? I briefly looked at some of the slides you shared but it doesn’t seem to be there (though maybe I missed something, since I didn’t watch he entire video(s)).
I’m not honestly sure exactly what this would look like since I don’t fully understand much here beyond the notion that concentration of intelligence/cognition can lead to larger magnitude outcomes (which we probably already knew) and the idea that maybe we could measure this or use it to reason in some way (which maybe we aren’t doing so much). Maybe we could have some sort of simulated game where different agents get to control civilizations (i.e. like Civ 5) and among the things they can invest (their resources) in, there is some measure of “cognition” (i.e. maybe it lets them plan further ahead or gives them the ability to take more variables into consideration when making decisions or to see more of the map...). With that said, it’s not clear to me what would come out of this simulation other than maybe getting a notion of the relative value (in different contexts) of cognitive vs. physical investments (i.e. having smarter strategists vs. building a better castle). There’s not clear question or hypothesis that comes to mind right now.
It looks like from some other comments that literature on agents foundations might be relevant, but I’m not familiar. If I get time I might look into it in the future. Are these sorts of frameworks are useable for actual decision making right now (and if so how can we tell/not) or are they still exploratory?
Generally just curious if there’s a way to make this more concrete i.e. to understand it better.

4gate 31 Aug 2024 16:53 UTC
1 point
0
in reply to: Arthur Conmy’s comment on: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
That’s great! Activation/representational steering is definitely important, but I wonder if it being applied right now to improve safety. I’ve read only a little bit of the literature, so maybe I’ll just find out later :P
The fact that refusal steering is possible definitely opens the possibility to gradient-based optimization attacks, or may make it possible to explain why some attacks work. Maybe you can use this to build a jailbreak detector of some kind? I do think it’s important to push to try and get techniques usable in the real world, though I also understand that science is not so linear. Where and how do you think DM’s research could get more real world grounding? (Or do you think that it’s all well and good as it stands?)

4gate 27 Aug 2024 1:04 UTC
1 point
0
on: How I started believing religion might actually matter for rationality and moral philosophy
I’m curious on your thoughts of this notion of perennial philosophy and convergence of beliefs. One interpretation that I have of perennial philosophy is purely empirical: imagine that we have two “belief systems”. We could define a belief system as a set of statements about the way the world works and valuation of world states (i.e. statements “if X then Y could happen” and “Z is good to have”). You can probably formalize it some other way, but I think this is a reasonable starter pack to keep it simple. (You can also imagine further formalizing it by using numbers or lattices for values and probabilities and some well-defined FSM to model parts of the world.) We could say that the religions have converged if they share a lot of features, by which I mean that for some definition of a feature the feature is present in both belief systems. We can define a feature in many ways, but for our simple thought experiment it can be effectively a state or relation between states in the two world views. For example, we could imagine that a feature is a function from states and their values/causal relations such that under the mapping it remains unchanged (i.e. there is some notion of this mapping being like an isomorphism on the projection of the set via the function). For example, in one belief system you might have some sort of “god” character that is somehow the cause of many things. The function here could be “(int(god is cause of x1) + int(god is cause of x2) + …) / num_objects”. If we map common objects (this spoon) to themselves in the other system (still the spoon) and god to god, we will see that in both systems the function representing how causal god is, remains close to 1, and so we may say that both systems have a notion of a “god” and therefore there has been some degree of convergence in the “having a god” stuff for the two systems.
So now with all this formal BS out of the way, which I had to do, because it highlights what is missing, the question is clear: under some reasonable such definition of what convergence means, how do you decide whether two religions have converged? The vibe I get from the perennial philosophy believers that I have thus spoken to is that “you have to go through the journey to understand” and generally it appears to be a sort of dispositional convergence, at least on face value—though I do not observe people of very different religions, who claim to have convergence, conviving for a long time (meaning that it is not verifiable whether indeed, the dispositions are truly something that we could call converged). Of course, it may be possible to find mappings that claim that two belief systems have converged or not when the opposite is a more honest appreciation.
Obviously no one is going to come out here and create a mathematical definition and just “be right” (idt that’s even a fair thing to consider to be possible), but I do not particularly like making such assertions totally “on vibes”. Often people will say that they are “spiritual” and that “spirituality” helped them overcome some psychological challenge or who knows what, but what does “spiritual” mean here? Often it’s associated with some belief system that we would, as laymen, call religious or spiritual (i.e. in the enumerable list of christianity and its sub-branches, buddhism and it’s, etc...), but who is to say that it is not only some part of the phenomenon that person experienced, which happened to be caused by the spiritual system which was present at the time and place, that was the truer cause of the change of psyche? It seems compelling to me to want to decouple these “core truths” from the religions that hold them so as to have them presentable in a more neutral way, since in the alternative world where you must “go through the journey” of spirituality via some specific religion, you cannot know beforehand that you won’t be effectively brainwashed—and you cannot even know afterwards either… you can only get faint hints at it during the process.
So this is not to say that that anyone is getting brainwashed or that anything is good or bad or that anything should be embraced or not. I’m just saying that from an outside perspective, it is not verifiable whether religions actually converge, without diving into this stuff. However, it is is also not verifiable whether diving in is actually good, and it’s not verifiable whether afterwords it even will be verifiable. Maybe I’m stumbling into some core metaphysical whirlwind of “you cannot know anything” but I do truly believe that a more systematic exposition of how we should interpret spirituality, trapped priors, convergence, and the like is possible and would enable more productive discussion.
PS I think you’ve touched on something tangential in the statement that you should do this with trusted people. That’s trying to bootstrap, however, a resistance to manipulative misappropriation of spirituality, whereas what I’m saying I would also like more of a logical bootstrapping to the whole notion of spirituality and ideas like “convergence” so that one can leave the conversation with solid conclusions, knowing their limitations, and having a higher level of actionability.
PPS: I feel like treating a belief system, like “rationality” as a machine/tool: something which has a certain reach, certain limitations, and that usually behaves as expected in most situations but might have some bugs, is a good way to go. This will make it easier to decouple rationality with, say, spiritual traditions. At each point of time and space you can basically decide on common sense which of these two machines/tools is best to apply. Each different tool can be shown, hopefully, to be good for some cases and thus most decision making happens on the routing level: which tool to use. If you understand the tool from a third person point of view there is less of a tendency to rely on it in the wrong cases purely on dogma.

4gate 23 Aug 2024 19:22 UTC
LW: 2 AF: 1
0
AF
on: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
For a while, there has been a growing focus into safety training using activation engineering, such as via circuit breakers and LAT (more LAT). There’s also new work on improving safety training and always plenty of new red-teaming attacks that (ideally) create space for new defenses. I’m not sure if what I’m illustrating here is 100% a coherent category, but generally I mean to include methods that are applicable IRL (i.e. the Few Tokens Deep paper uses the easiest form of data augmentation ever and it seems to fix some known vulnerabilities effectively), can be iterated alongside red-teaming to get increasingly better defenses, and focus on interventions to safety-relevant phenomena (more on this below).
Is DM exploring this sort of stuff? It doesn’t seem to be under the mantle of “AGI Safety” given the post above. Maybe it’s another team? It’s true that it’s more “AI” than “AGI” safety, and that we need the more scientific/theoretical AGI Safety research illustrated in the post too, if we are to have a reasonably good future alongside AGIs. With that said, this sort of more empirical red-teaming + safety-training oriented research has some benefits:
- You get to create interesting questions for the MI people that are totally toy models, thereby making their work more useful IRL and creating more information on which to gain broader theoretical understanding of phenomena.
- You actually fix problems today. You can also fail fast. I don’t know much about the debate literature, but look at the debate example from my perspective: (1) 6 years ago someone conceptualized debate and made some theoretical argument (2) there was a community with expectations and probably a decent amount of discussion about debate in the span of those 6 years, even including (theoretical) improvements on the original debate ideas, (3) someone actually tried debate and it didn’t work as expected… today… 6 years later. I understand that for debate you probably need good-enough models (unless you are more clever than me—and probably can be), so maybe harping on debate is not fair. That’s not what I’m trying to do here, anyways. Mainly I’m just highlighting that when we can iterate by solving real problems and getting feedback in shorter timespans, we can actually get a lot more safety.
- A lot of safety training is about controlling/constraining what the AI can say/do so that it won’t say/do the bad things. The tools of this sort of control are pretty generic, so it’s not unlikely that they would provide some benefits in future situations as well. As models scale (capabilities), so long as we keep improving our methods for red-teaming+safety training, these sorts of semi-empirical tools should roughly scale (in their capability to control/constrain the AI). It is more likely that by working on pure science & tools that “will be useful eventually” we are overall less safe and have larger jumps in the size of the gap between the ability of an AI to cause harm and our ability to keep it from doing so.
The way I see it there are roughly 4 categories (though maybe this is rather procrustean and I’m missing some) of research that can be done in AI Safety:
1. Pure science: this is probably very useful but in a very long time. It will be very interesting and not show up in the real world until kind of late. I think a large proportion of MI falls into this. AFAIK no one uses SAEs IRL for safety tasks? With that said, they will surely be very scientifically useful. Maybe steering vectors are the exception, but they also are in 4 (below). Pure science is usually more about understanding how things work first before being able to intervene.
2. Evals: (in the broad sense of the word, including safety and capabilities benchmarks) self explanatory. Useful at every stage.
3. Safety Theory: into this I lump ideas like debate and amplified oversight, which don’t really do much in the real world (products people use, etc...) right now AFAIK (not sure, am I wrong?) but are a combination of (primarily, still) conceptual frameworks for how we could have AGI Safety plus the tools to enact that framework. Usually, things from this category arose from someone thinking about how we could have AI Safety in the future, and coming up with some strategy. That strategy is often not really enactable until the future, with perhaps some toy models as exceptions, so I call these “theory.”
4. Safety Practice: into this I lump basically most red-teaming attacks, prompt/activation engineering, and safety training methods that people use or are possible to plug in. These methods usually arise because there is a clear, real-world problem and so their goal is to fix that problem. They are usually applicable in short timespans and are sometimes a little bit of a patchwork, but iterative and possibly to test and improve. More so than 3 (above), they arise from a realistic current need instead of a likely future need. Unlike 1(above) they are focused on making interventions first and understanding later.
In this categorization, it seems like DM’s AGI Safety team is very much more focused on 1,2, and 3. There’s nothing wrong with any of these, but it would seem like 2 and 4 should be the bread and butter right? Is there any sort of 4 work going on? Aren’t companies like DM in a much better position to do this sort of work than the academic labs and other organizations that you find publishing this stuff? You guys have access to the surrounding systems (meaning you can gain a better understanding of attack vectors and side-effects than someone who is just testing the input/output of a chatbot) , have access to the model internals, have boatloads of compute (it would also be nice to know how things like LAT work on a full-scale model instead of just Llama3-8B), and are a common point of failure (most people are using models from OAI, Anthropic, DM, Meta). Maybe I’m conflating DM with other parts of Alphabet?
Anyways, I’m curious where things along the lines of 4 figure in to your plan for AGI Safety. It would be criminal to try and make AI “safe” while ignoring all the real world, challenging-but-tractable, information-rich challenges that arise from things such as red-teaming attacks that can happen today. Also curious to hear if you think this categorization is flawed in some key way.

4gate 23 Aug 2024 3:56 UTC
1 point
0
on: Measuring Structure Development in Algorithmic Transformers
How is this translational symmetry measure checking for the translational symmetry of the circuit? QK, for example, is being used as a bilinear form, so it’s not clear to me, for example, what the “difference in the values” is mapping onto here (since I think these “numbers” are actually corresponding to unique embeddings). More broadly, do you have a good sense of how to interpret these bilinear forms? There is clearly a lot of structure in the standard weight basis in these pictures, and I’m not sure exactly what it means. I’m guessing you can see that some sections are rather empty corresponding to the “model learns to specialize on certain parts of the vocabulary for xyz head” being potentially associated with some sort of one-hot or generally standard-basis-privilege situation. Let me know if I’m misunderstanding something dumb. I haven’t seen this being done much elsewhere, but it would be nice to have a github because it’s really easy to resolve these questions by reading pytorch code. Is it available somewhere?
One other thing I’m curious about is results for more control experiments. For example, for the noise, if you fully noised the output (i.e. output a random permutation) we should expect the model to fail to learn anything at all and to fail to get a high LLC right? It’s also possible to noise by inserting new elements in the output (or input… I guess it’s equivalent) to replace others, but keeping the order the same. In this case, maybe the network can learn to understand what the ordering is even if it doesn’t know exactly which outputs will be there in the end, so even with very high amounts of noise a “structured” solution makes sense (though I reckon the way you propagate loss will matter in this case).

What’s going on with Per-Component Weight Updates?

4gate22 Aug 2024 21:22 UTC

1 point

0 comments6 min readLW link

4gate 19 Aug 2024 23:22 UTC
1 point
0
on: EIS V: Blind Spots In AI Safety Interpretability Research
Not sure exactly how to frame this question, and I know the article is a bit old. Mainly curious about the program synthesis idea.
On some level, any explanatory model for literally any phenomena can, it would seem, appear to be claimed to be a “program synthesis problem”. For example, historically, we have wanted to synthesize a set of mathematical equations to describe/predict (model) the movement of stars in the sky, or rates of chemical reactions in terms of certain measurements (and so on). Even in non-mathematical cases, we have wanted to find context-specific languages (not necessarily formal, but with some elements of formality such as constraints on what relations are allowed, etc...) that map onto things such as biology, psychology, etc...
I think it’s fair to call these programs, since they are tools you use in a sort of causal way to say what will happen. Usually, you imagine certain objects that follow certain rules to do things, thereby changing the state of the world. They are things you could write as programs or instructions.
The art here is to be able to formalize a language that has the right parametrization to describe and predict the desired phenomena well, while being expressive enough to grow in a useful way, as we discover more.
But anyways, there are sort of two questions that naturally arise here:
1. Why is MI more closely related to program synthesis than any other field that wishes to explain a process that can be thought of as a program (i.e. it has causal components that happen over time)?
2. I was under the impression that MI is in the business of trying to establish the right language and concepts to use to describe the information processing done by deep learning models. The field has not really cracked the “art” here yet AFAIK. With that said, I’m guessing that the program synthesis literature and tooling has a slightly different goal and therefore carries certain baggage of how one goes about thinking about these problems (i.e. maybe more of a lean towards symbolic methods). But the program synthesis literature probably doesn’t actually create the right language to have a 10x conceptual framework for the science of deep learning information processing because otherwise we would have a lot more solved problems than we do. So in this sense, a new start is not necessarily bad. You can think about this in some sense fuzzily analogous to how Galois invented (if you can say that) a new branch of math to solve the so-far-unsolved 5th degree polynomial root-finding problem. It was not by using the already-existent tools that this problem was solved. You can also think of this as a society-level de-biasing strategy. If DL is ever to have an explanatory framework on par with, say, Classical Mechanics, it appears that we need a conceptual 10x-ing. Do you agree with this framing? If so, what do you think is a healthy amount of rediscovery?

4gate 6 May 2024 20:33 UTC
3 points
0
on: Mechanistically Eliciting Latent Behaviors in Language Models
This is really cool! Exciting to see that it’s possible to explore the space of possible steering vectors without having to know what to look for a priori. I’m new to this field so I had a few questions. I’m not sure if they’ve been answered elsewhere
1. Is there a reason to use Qwen as opposed to other models? Curious if this model has any differences in behavior when you do this sort of stuff.
2. It looks like the hypersphere constraint is so that the optimizer doesn’t select something far away due to being large. Is there any reason to use this sort of constraint other than that?
3. How do people usually constrain things like norm or do orthogonality constraints as a hard constraint? I assume not regular loss-based regularization since that’s not hard. I assume iterative “optimize and project” is not always optimal but maybe it’s usually optimal (it seems to be what is going on here but not sure?). Do lagrange multipliers work? It seems like they should but I’ve never used them for ML. I’m guessing that in the bigger picture this doesn’t matter.
4. Have you experimented with adaptor rank and/or is there knowledge on what ranks tend to work were? I’m curious of the degree of sparsity. You also mention doing LoRA for attention instead and I’m curious if you’ve tried it yet.
5. W.r.t. the “spiky” parametrization options, have you tried just optimizing over certain subspaces? I guess the motivation of the spikiness must be that we would like to maintain as much as possible of the “general processing” going on but I wonder if having a large power can axe the gradient for R < 1.
6. Is there a way to propagate this backwards to prompts that you are exploring? Some people do bring up the question in the comments about how natural these directions might be.
7. Not sure to what extent we understand how RLHF, supervised finetuning and other finetuning methods currently work. What are your intuitions? If we are able to simply add some sort of vector in an early layer it would seem to support the mental model that finetuning mainly switches which behavior gets preferentially used instead of radically altering what is present in the model.
Thanks!

4gate 6 May 2024 19:32 UTC
1 point
0
in reply to: Andrew Mack’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Why do you guys think this is happening? It sounds to me like one possibility is that maybe the model might have some amount of ensembling (thinking back to The Clock and The Pizza where in a toy setting ensembling happened). W.r.t. “across all steering vectors” that’s pretty mysterious, but at least in the specific examples in the post even 9 was semi-fantasy.
Also what are ya’lls intuitions on picking layers for this stuff. I understand that you describe in the post that you control early layers because we suppose that they might be acting something like switches to different classes of functionality. However, implicit in layer 10 it seems like you probably don’t want to go too early because maybe in the very early layers it’s unembedding and learning basic concepts like whether a word is a noun or whatever. Do you choose layers based on experience tinkering in jupyter notebooks and the like, or have you run some sort of grid to get a notion of what the effects elsewhere are. If the latter, it would be nice to know to aid in hypothesis formation and the like.

4gate 6 May 2024 18:31 UTC
1 point
0
AF
in reply to: tailcalled’s comment on: Mechanistically Eliciting Latent Behaviors in Language Models
Maybe a dumb question but (1) how can we know for sure if we are on manifold, (2) why is it so important to stay on manifold? I’m guessing that you mean that vaguely we want to stay within the space of possible activations induced by inputs from data that is in some sense “real-world.” However, there appear to be a couple complications: (1) measuring distributional properties of later layers from small to medium sized datasets doesn’t seem like a realistic estimate of what should be expected of an on-manifold vector since it’s likely later layers are more semantically/high-level focused and sparse; (2) what people put into the inputs does change in small ways simply due to new things happening in the world, but also there are prompt engineering attacks that people use that are likely in some sense “off-distribution” but still in the real world and I don’t think we should ignore these fully. Is this notion of a manifold a good way to think about the notion of getting indicative information of real world behavior? Probably, but I’m not sure so I thought I might ask. I am new to this field.
I do thing at the end of the day we want indicative information, so I think somewhat artifical environments might at times have a certain usefulness.
Also one convoluted (perhaps inefficient) idea but which felt kind of fun to stay on manifold is to do the following: (1) train your batch of steering vectors, (2) optimize in token space to elicit those steering vectors (i.e. by regularizing for the vectors to be close to one of the token vectors or by using an algorithm that operates on text), (3) check those tokens to make sure that they continue to elicit the behavior and are not totally wacky. If you cannot generate that steer from something that is close to a prompt, surely it’s not on manifold right? You might be able to automate by looking at perplexity or training a small model to estimate that an input prompt is a “realistic” sentence or whatever.
Curious to hear thoughts :)

4gate

Scop­ing LLMs

What’s go­ing on with Per-Com­po­nent Weight Up­dates?

Scoping LLMs

What’s going on with Per-Component Weight Updates?