Joseph Bloom

Karma: 816

Joseph Bloom 3 May 2022 20:12 UTC
3 points
on: What DALL-E 2 can and cannot do
Thanks Swimmer963! This was very interesting.

I have a general question for the community. Does anyone know of any similar such descriptions of model limitations with so many examples performed for any language models such as GPT-3?

My personal experience is that visual output is inherently more intuitive, but I’d love to explore my intuition around language models with an equivalent article for GPT-3 or PaLM for example.

I’d predict such articles exist with high confidence but finding the appropriate article with sufficient quality might be trickier. I’m curious which articles commenters here would select in particular.

Joseph Bloom 7 Oct 2022 22:05 UTC
3 points
1
in reply to: elifland’s comment on: So, geez there’s a lot of AI content these days
I think it’s plausible that it is either harmful to perpetuate “every alignment person needs to read the Sequences / Rationality A-Z” or maybe even inefficient.

For example, to the extent that alignment needs more really good machine learning engineers, it’s possible they might benefit less from the sequences than a conceptual alignment researcher.

However, relying on anecdotal evidence seems potentially unnecessary. We might be able to use polls, or otherwise systemically investigate the relationship between interest/engagement with the sequences and various paths to contribution with AI. A prediction market might also work for information aggregation.

I’d bet that all else equal, engagement with the sequences is beneficial but that this might be less pronounced among those growing up in academically inclined cultures.

Joseph Bloom 24 Oct 2022 8:33 UTC
7 points
2
on: AI researchers announce NeuroAI agenda
To the extent that one might not have predicted scientists to hold these views, I can see why this paper might cause a positive predictive update on brainlike AGI.

However, technological development is not a zero-sum game. Opportunities or enthusiasm in neuroscience doesn’t in itself make prosaic AGI less likely and I don’t feel like any of the provided arguments are knockdown arguments against ANN’s leading to prosaic AGI.

I don’t particularly find arguments about human level intelligence being unprecedented outside of humans convincing, in part because of the analogy to “the god of the gaps”. Many predictions about what computers can’t do have been falsified, sometimes in unexpected ways (ie: arguments about AI not being able to make art). Moreover, that more is different in AI and the development of single-shot models seem powerful argument about the potential of prosaic AI systems when scaled.

Joseph Bloom 25 Oct 2022 3:54 UTC
1 point
0
in reply to: Cameron Berg’s comment on: AI researchers announce NeuroAI agenda
Understood. Maybe if the first argument was more concrete, we can examine it’s predictions. For example, what fundamental limitations exist in current systems? What should a breakthrough do (at least conceptually) in order to move us into the new paradigm?

I think it’s reasonable that understanding the brain better may yield insights but I believe Paul’s comment about return on existing insights diminishing over time. Technologies like dishbrain seem exciting and might change that trend?

Joseph Bloom 11 Nov 2022 11:51 UTC
6 points
0
on: Why I’m Working On Model Agnostic Interpretability
Thanks Jessica!
I like 1) and think this is worth doing. I believe that Mechanistic Interpretability researchers are already somewhat concerned about insight not generalising from toy models to larger models let alone to novel architectures so work on model agnostic levels could be useful in the same paradigm too.
Something to note, I’m not confident about the track record of model agnostic methods (such as saliency maps). I’ve heard from at least one ML researcher that saliency maps have a poor track record and have been shown to be unreliable in a variety of experiments. Do you know of any other examples of model agnostic interpretability methods which you think might be very useful? Maybe saliency maps don’t matter as much as the idea of model agnostic methods in which case feel free to disregard this. I’ve heard before of interest in generally approaching models as block boxes “ML psychologist” while we try to understand them so don’t think the value of this approach lies too heavily in specific prior methods.
With respect to 2), while I think this is reasonable, I believe the salient point is whether models from the current paradigm are sufficiently dangerous fast enough that they warrant more/less focus. Theoretically, the space of possible ML architecture paradigms producing doom is large and the order in which they will manifest is roughly the order in which we should solve them. (ie: align current systems, then new paradigm systems, then new paradigm systems, each buying time).
However, I think there are good enough reasons to work on model agnostic methods that don’t rely on AGI doom originating in a new paradigm.
Overall, very exciting! good luck!

Joseph Bloom 27 Nov 2022 18:30 UTC
LW: 3 AF: 2
0
AF
on: What I Learned Running Refine
Hey Adam, thanks for running Refine and writing this up.
Out of curiosity, do you (or anyone else) know if there are statistics for previous SERI-MATS cohorts/other programs designed to generate conceptual alignment researchers?

Decision Transformer Interpretability

Joseph Bloom and Paul Colognese

6 Feb 2023 7:29 UTC

84 points

13 comments24 min readLW link

Joseph Bloom 7 Feb 2023 21:07 UTC
LW: 2 AF: 2
0
AF
in reply to: TurnTrout’s comment on: Decision Transformer Interpretability
Thank you for letting me know about your work on procgen with MI. It sounds like you’re making progress, particularly I’d be interested in your visualisation techniques (how do they compare to what was done in Understanding RL Vision?) and the reproduction of the cheese-maze policies (is this tricky? Do you think a DT could be well-calibrated on this problem?).

Some questions that might be useful to discuss more:
- What are the pros/cons of doing DT vs actor-critic MI? (You’re using Actor-Critic of some form?). It could also be interesting to study analogous circuits in the DT vs AC scenarios.
- I haven’t done anything with CNNs yet, for simplicity, but I might be able to calibrate my expectations on the value/challenges involved by chatting to the team shard MATS stream.
Glad to hear your progress is going well! I’ll be in the Bay Area for EAG if anyone from the team would like to chat.

Joseph Bloom 7 Feb 2023 21:41 UTC
10 points
0
in reply to: TurnTrout’s comment on: Decision Transformer Interpretability
The ablations seems surprisingly clear-cut. I consider myself to be very on board with “RL-trained behaviors are contextually activated and modulated”, but even I wasn’t expecting such strong localization.
Neither was I. After the fact, it seems easy to come up with reasons why this might be the case. I think measuring this with something like excluded loss might enable a more precise quantification of exactly how localised it was. I also don’t see strong reasons to expect this to generalise to larger models. If I see similar results when the task is more complicated, the model is bigger and especially with a larger context window, then I will be more interested in trying to precisely describe how/why/when you get more localisation.
Seems to me that randomness wouldn’t prevent the agent from bieng calibrated, because even though any given episode might deviate from the prescribed number of steps, on average the randomness can (presumably) be made to add up to that number. EG it might be hard to bump into the goal after exactly 14 steps due to random obstacles, but I’d imagine ensuring this falls between 10 and 18 steps is feasible?
I think you might be right in the infinite training data regime, where I would expect it to be unbiased however I suspect that the training data being sparse, especially in the low positive RTG reason is enough to make the signal weak. The loss incurred by finishing in the wrong number of steps is likely very small compared to failing when you have positive RTG or succeeding when you have positive RTG, so it could also be that the model doesn’t allocate much capacity to being well-calibrated in this sense.
This seems like an important manifestation of “models don’t ‘get reward’, they are shaped by reward”; even on a simple task where presumably agents can fully explore all relevant options. The e.g. observation encoding (where an obstacle is “3/4″ of a goal) matters when predicting what behavioral shards & subroutines get trained into the policy, or considering what e.g. early-stage policies will be like.
I thought this too and should have remembered to cite “Reward is not the Optimization target”. I feel like that concept is now more visceral for me. A relevant takeaway that may or may not be obvious is that simulators with inductive biases might be better/worse at simulating particular stuff as a function of their inductive biases. In this case, the positive RTG range was more miscalibrated as a result of the bias.
Wait, how? Isn’t the state observation constant in this task? I’m guessing you’re discussing something else?
I’m not sure what you mean by constant. If it were constant then the agent/obstacles wouldn’t be moving? I’ll elaborate a little bit in case that helps.
Since RTG gives information about whether the agent should go forward/not in many contexts, you might have expected (although now I think I wouldn’t) the residual stream embedding for the state token not to directly contribute to one logit over another before RTG has been seen. In practice, it seems like it does. For example, in this situation (picture below from the app) with RTG = 90, there is a wall to the left of the agent and the state appears to strongly encourage forward/right. I interpret this as as “some agent behaviours are independent of RTG and can be encouraged as a function of the observation/state before RTG is seen”.

To clear up some language as well:
- state encoding → how is the state represented? weird minigrid schema.
- state token → the value of a state represented as a vector input to the model.
- state embedding → an internal representation of the state at some point in the model.

I should have written state-token not state-embedding in the quotes paragraph. Apologies if this led to confusion.

Joseph Bloom 8 Feb 2023 6:07 UTC
1 point
0
in reply to: Victor Levoso’s comment on: Decision Transformer Interpretability
Hi Victor,
Glad you are keen on this area, I’d be very happy to collaborate. I’ll respond to your comments here but am happy to talk more after you’ve read the post.
The linked paper (Can Wikipedia help Offline Reinforcement Learning) is very interesting in a few ways, however, I’d be interested in targeted reasons to investigate this specifically. I think working with larger models is often justified but it might make sense to squeeze more juice out of the small models before moving to larger models. Happy to hear the arguments though.
Also the dt paper finds their model generalizes to bigger rtg than the training set in the seaquest env and it would be interesting to get a mechanistic explanation of why that happens (tough that’s an atari task and I think you are right in that that’s probably going to have to come later cause CNN are probably harder to work with)
I’m glad you asked about this. In terms of extrapolation, an earlier model I trained seemed to behave like this. Analysis techniques like the RTG Scan functionality in the app present ways to explore the mechanisms behind this which I decided not to explore in this post (and possibly in general) for a few reasons:
It’s not clear to me that this is more than a coincidence. I think it could be that in the space of functions that map RTG to behaviour, for certain games, it is possible to learn coincidentally extrapolating functions. If the model were to develop qualitatively different behaviour (under some definition) in out-of-distribution RTG ranges for any task, then my interest will be renewed.
I suspect doing something like integrated gradients for CNN layers is pretty doable (maybe that’s what the MATS shard team have done, see one of Alex Turner’s comments) but yeah, they are probably harder to work with.
Another thing to note is that OpenAI’s VPT while it’s technically not a decision transformer (because it doesn’t predict rewards if I remember correctly) it a similar kind of thing in that is Ofline RL as sequence prediction, and is probably one of the biggest publicly avaliable pretrained models of this kind. There’s also multiple open source implementation of Gato that could probably be interesting to try to do interpretability on. https://github.com/Shanghai-Digital-Brain-Laboratory/BDM-DB1
Thank you for sharing this! I’d be very excited to see attempts to understand these models. I’ve started with toy models for reasons like simplicity and complete control but I can see many arguments in favour of jumping to larger models. The main challenge I see would be loading the weights into a TransformerLens model so we can get the cache enabling easy analysis. This is likely quite doable.
Also training decision transformers on minerl(or in eleuther’s future minetest enviroment) seems like what might come next after atari(the task gato is trained are mostly atari games and google stuff that is not publicly avaliable if I remember correctly)
Interesting!

Joseph Bloom 10 Feb 2023 10:25 UTC
2 points
0
in reply to: Victor Levoso’s comment on: Decision Transformer Interpretability
Agreed about the better way to sample. I think in the long run this is the way to go. Might make sense to deal with this at the same time we deal with environment learning schedules (learning on a range of environments where the difficulty gets harder over time).
I hadn’t thought about learning different policies at the same RTG. This could be interesting. Uniform action preference at RTG = 0 make sense to me, since while getting a high RTG is usually low entropy (there’s one or a few ways to do it), there are many ways not to get reward. In practice, I might expect the actions to just match the base frequencies of the training data which might be uniform.

Joseph Bloom 24 Feb 2023 18:32 UTC
3 points
2
on: Behavioral and mechanistic definitions (often confuse AI alignment discussions)
Thanks, Lawrence! I liked this post which I think I’m going to bookmark and refer back to when I’m trying to think/write about analysis. I think t being disciplined about the distinction between these two types of definitions is crucial to think clearly.

I recently heard Buck use the terms “Model Psychology” and “Model Neuroscience” to distinguish types of analysis with small models. My understanding of his position in that public discussion was that people should distinguish between the two because (now using your terms) we shouldn’t confuse behavioural insights with mechanistic insights and this is a trap that people fall into, leading to some amount of miscalibrated confidence about how well we understand models. I suspect I have also fallen into this trap, so having this post to refer back to seems especially valuable.

Joseph Bloom 5 Mar 2023 18:50 UTC
2 points
1
in reply to: Cleo Nardo’s comment on: The Waluigi Effect (mega-post)
I would be very surprised if complex high level behavior was mediated strongly by a single neuron due to superposition. Engineering polysemanticity (“making it depend on many different neurons”) feels like the flip side of engineering monosemanticity so you might want to read Adam Jermyn’s post on the topic.

Joseph Bloom 9 Mar 2023 20:19 UTC
9 points
5
on: Anthropic’s Core Views on AI Safety
Thanks Zac.

My high level take is I found this very useful for understanding Anthropic broader strategy and think that I agree with a lot of the thinking. It definitely seems like some of this research could backfire but Anthropic is aware of that. The rest of my thoughts are below.

I found a lot of value in the examination of different scenarios. I think this provides the clearest explanations for why Anthropic is taking an empirical/portfolio approach. My mental models of people disagreeing with this approach involves them being either more confident about pessimistic (they would say realistic scenarios) or that they disagree with specific research agendas/have favorites. I’m very uncertain about which scenario we live in but in the context of that uncertainty, the portfolio approach seems reasonable.

I think the most contentious part of this post will probably be the arguments in favor of working with frontier models. It seems to me that while this is dangerous, the knowledge required to correctly assess a) whether this is necessary, b) what, if any, results that arise from such research should be published seems closely tied to that work itself (ie: questions like how many safety relevant phenomena just don’t exist in smaller models and how redundant work on small models becomes).

Writing this comment, I feel a strong sense of, “gee, I feel like if anyone would have the insights to know whether this stuff is a safe bet, it would be the teams at Anthropic” and that feels kind of dangerous. Independent oversight such as ARC evals might help us but a strong internal culture of red-teaming different strategies would also be good.

Quoting from the main article, I wanted to highlight some points:
Furthermore, we think that in practice, doing safety research isn’t enough – it’s also important to build an organization with the institutional knowledge to integrate the latest safety research into real systems as quickly as possible.
I think this is a really good point. The actual implementation of many alignment strategies might be exceedingly technically complicated and it seems unlikely that we could attain that knowledge quickly as opposed to over years of working with frontier models.
In a sense one can view alignment capabilities vs alignment science as a “blue team” vs “red team” distinction, where alignment capabilities research attempts to develop new algorithms, while alignment science tries to understand and expose their limitations.
This distinction also seems good to me. If there is work that can’t be published or until functional independent evaluation is working well, then high quality internal red-teaming seems essential.

Joseph Bloom 9 Mar 2023 20:19 UTC
20 points
8
on: Anthropic’s Core Views on AI Safety

For comparison, others might want to see the DeepMind alignment team’s strategy: https://www.lesswrong.com/posts/a9SPcZ6GXAg9cNKdi/linkpost-deepmind-alignment-team-s-strategy

I think this is the equivalent post for OpenAI but someone feel free to correct me:
https://www.lesswrong.com/posts/28sEs97ehEo8WZYb8/openai-s-alignment-plans

Joseph Bloom 15 Apr 2023 0:34 UTC
1 point
0
on: On Developing a Mathematical Theory of Interpretability
Hey Spencer,

I realize this is a few months late and I’ll preface with the caveat that this is a little outside my comfort area. I generally agree with your post and am interested in practical forward steps towards making this happen.

Apologies for how long this comment turned out to be. I’m excited about people working on this and based on my own niche interests have takes on what could be done/interesting in the future with this. I don’t think any of this is info hazardy in theory but significant success in this work might be best kept on the down low.

Some questions:
- Why call this a theory of interpretability as opposed to a theory of neural networks? It seems to me that we want a theory of neural networks. To the extent this theory succeeds, we would expect various outcomes like the ability a priori to predict properties or details of neural networks. One such property we would expect, as with physical systems, might be to take measurements of some properties of a system and map them to other properties (such as the ideal gas equation). If you thought it was possible to develop a more specific theory only dealing with something like interpreting interventions in neural networks (eg causal tracing) then this might be better called a theory of interpretability. To the extent Mendel understood much of inheritance prior to us understanding the structure of DNA, there might be insights to be gained without building a theory from the ground up, but my sense is that’s the kind of theory you want.
- have you made any progress on this topic or do you know anyone who would describe this explicitly as their research agenda? If so what areas are they working in.
Some points:
- One thing I struggle to understand, and might bet against is that this won’t involve studying toy models. To my mind, Neel’s grokking work, Toy Models of Superposition, Bhillal’s “A Toy Model of Universality: Reverse Engineering How Networks Learn Group Operations” all seems to be contributing towards important factors that no comprehensive theory of Neural Networks could ignore. Specifically, the toy models of superposition work shows concepts like importance, sparsity and correlation of features being critical to representations in neural networks. Neel’s progress measures work (and I think at least one other anthropic paper) address memorisation and generalization. Bhillal’s work shows composition of features into actual algorithms. I emphasize all of these because even if you thought an eventual comprehensive mathematical theory would use different terminology or reinterpret those studied networks differently, it seems highly likely that it would use many of those ideas and essential that it explain similar phenomena. I think other people have already taughted heavily the successes of these works so I’ll stop there and summarize this point as: if I were a mathematician, I’d start with these concepts and build them up to explain more complex systems. I’d focus on what is a feature, how to measure it’s importance/sparsity efficiently, and then deal correlation and structure in simpler networks first and try to find a mathematical theory that started making good predictions in larger systems based on those terms. I definitely wouldn’t start elsewhere.
- In retrospect, maybe I agree that following one’s nose is also very important, but I’d want people to have these toy systems and various insights associated with them in mind. I’d also like to see these theoreticians work with empiricists to find and test ways of measuring quantities which would be the subject of the theorizing.
- “And perhaps there really is something about nature that makes it fundamentally amenable to mathematical description in a way that just won’t apply to a large neural network trained by some sort of gradient descent? ” I am tempted to take the view that nature is amenable in a very similar way to neural networks, but I believe that nature includes many domains which we can’t/don’t/struggle to make precise predictions or relations about. Specifically, I used to work in molecular systems biology which I think we understand way better than we used to but just isn’t accessible in the way Newtonian physics is. Another example might be ecology where you might be able to predict migrations or other large effects but not specific details about ecosystems like specofic species population numbers more than 6 months in advance (this is totally unqualified conjecture to illustrate a point). So if Neural Networks are amenable to a mathematical description but only to the same extent molecular systems biology then we might be in for a hard time. The only recourse I think we have is to consider that we are have a few advantages with neural networks we don’t have in other systems we want to model (see Chris Olahs blog post here https://colah.github.io/notes/interp-v-neuro/).
I should add that part of the value I see in my own work is to create slightly more complicated transformers than those which are studied the MI literature because rather than solving algorithmic tasks they solve spatial tasks based on instructions. These systems are certainly messier and may be a good challenge for a theory of neural network mechanisms/interpretability to make detailed predictions about. I think going straight from toy models to LLMs would be like going from studying viruses/virus replication to ecosystems.

Suggestions for next steps if you agreed with me about the value of existing work and building predictions of small systems first and trying to expand predictions to larger neural networks;
- Publishing/crowdsourcing/curating a literature review of anything that might be required to start building this theory (I’m guessing my knowledge of MI related work is just one small subsection of the relevant work and If I saw it as my job to create a broader theory I’d be reading way more widely.
- Identifying key goals for such a theory such as expanding or identifying quantitaties which such a theory would relate to each other, reliably retropredicting details of previously studied phenomena such as polysemanticity, grokking, memorization, generalization, which features get represented and then also making novel predictions such as about how mechanisms compose in neural networks ( not sure if anyone’s published on this yet).
The two phenomena/concepts I am most interested in seeing either good empirical or theoretical work on are:
- Composition of mechanisms in neural networks (analogous to gene regulation). Something like explaining how Bhillal’s extends to systems that solve more than one task at once where components can be reused and making predictions about how that gets structured in different architectures and as function of different details of training methods.
- Prediction centralization of processing (eg: the human prefrontal cortex). How when why would such a module arise and possibly how do we avoid it happening in LLMs if it hasn’t started happening already.
+1 to making a concerted effort to have a mathematical theory +1 on using the term not killeveryonism

Joseph Bloom 17 Apr 2023 12:37 UTC
1 point
0
in reply to: Oskar Hollinsworth’s comment on: Decision Transformer Interpretability
My apologies! I thought I had responded to this. Better late than never. All reasonable questions.

“What is the distribution of RTG in the dataset taken from the PPO agent? Presumably it is quite biased towards positive reward? Does this help to explain the state embedding having a left/right preference?”

The RTG distribution taken from the PPO agent is shown in Figure 4. It is the marginal distribution of reward, since the reward only occurs at the end of the episode, the RTG and Reward distributions are identical more/less.

If you are referring to the tendency for the agent conditioned on low RTG (0.3) to get RTG closer to (0.8) as “biased towards positive reward”. I think this is a function of the training distribution which includes way more RTG~0.8 than 0.3 so yes.

The state embedding having a left/right preference is probably a function of the PPO agent getting success going clockwise, this being reinforced and then this dominating the training data for the DT.

“Is a good approximation of the RTG=-1 model just the RTG=1 model with a linear left bias?”

I’m not sure what you mean by “linear left bias”. the RTG=-1 model behaves characteristically differently (see table 1).

”Does the state tokenizer allow the DT to see that similar positions are close to each other in state space even after you flatten? If not, might this be introducing some weird effects?”

Yes I believe the encoding used had that property. I’ve seen replicated these results with a less nasty encoding and the results are mostly similar if a little easier to interpret.

Joseph Bloom 20 Apr 2023 20:40 UTC
5 points
2
on: An open letter to SERI MATS program organisers
But this informs what kind of mechanical interpretability one should be doing: not just “random”, because “all problems in mechanical interpretability should be solved sooner or later” (no, at least not manually), but choosing the mechanical experiments to test particular predicts of this-or-that scientific theory of DNNs or Transformers.)
FWIW, my sense is that the projects people do are very often taken from Neel’s concrete open problems and are often meant for skilling up.

In traditional sciences, people often spend years learning to think better and use tools on problems that are “toy” and no serious researcher would work on, because this is how you improve and get to that point. I don’t think people should necessarily take years to get there, but it’s worth considering that upskilling is the primary motivation for much work.

If you think specific researchers aren’t testing, or working towards testing predictions about systems then this would be useful to discuss with them and to give specific feedback.

Joseph Bloom 20 Apr 2023 20:53 UTC
3 points
5
on: An open letter to SERI MATS program organisers
I think the SERI-MATS organisers likely defer to mentors on details or curriculum/research philosophy. It feels like this letter should be addressed to them. I think it’s hard to disagree with “Alignment researchers should think hard about their research methodology and strategy” but I think you are probably failing to pass the ideological turing test for various other positions which will mean this arguments won’t connect.

Joseph Bloom 20 Apr 2023 21:16 UTC
11 points
10
on: Proposal: Using Monte Carlo tree search instead of RLHF for alignment research
“In this post, I’ll present a way to turn LLMs into agents such that we can approximately model them as a utility maximizer.”

If this works it would be very dangerous and kind of thing we would want to avoid. We’re very lucky current systems are as poorly agentic as they are.

Joseph Bloom

De­ci­sion Trans­former Interpretability

Decision Transformer Interpretability