[Epistemic Status: This is an artifact of my self study. I am using it to remember links and help manage my focus. As such, I don’t expect anyone to fully read it. If you have particular interest or expertise, skip to the relevant sections, and please leave a comment, even just to say “good work/good luck”. I’m hoping for a feeling of accountability and would like input from peers and mentors. This may also help to serve as a guide for others who wish to study in a similar way to me. ]

List of acronyms: Mechanistic Interpretability (MI), AI Alignment (AIA), Outcome Influencing System (OIS), Vannessa Kosoy’s Learning Theoretic Agenda (VK LTA), Large Language Model (LLM), n-Dimensional Scatter Plot (NDSP),

Review of 1st Sprint

My goals for this sprint were:

SSJ--1: Finish writing the first draft of the definition section of my OIS article.
SSJ--2: Read VK LTA and write a small summary with my thoughts.
SSJ--3:
- Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable.
- Start studying Topoi and Linear Algebra textbooks.
SSJ--4:
- Read Neel’s “Mech Interp Prereqs”.
- Do some research and write a little bit about my plans for messing around with LLMs in some capacity.
SSJ--5:
- Review my NDSP notes.
- Experiment with tensorflowjs.

So how did it go?

SSJ--1 -- Finish writing the first draft of the definition section of my OIS article.

I wrote a good amount in the “Definitions and Properties” section, particularly in the “OIS: Outcome Influencing System” subsection. There’s still a lot more to do in the rest of the section. I’ve been considering how I am defining “system”, I would like to review how it is considered in other contexts. I’ve been going between the desire for it to be broad enough to include objects like Outcome Pumps, VS the desire for it to be concrete enough to usefully constrain thinking and base deductive predictions on.

I think in many places I am going to attempt a Bostromian exploration and naming of possibilities, that is, to make my definitions as broad as possible, but then give names for, and hopefully examples of, possible properties and categories within the broad definition. I think this approach gives the sense of overview and context that I want, but may suffer from creating overly cumbersome names, in which case it is probably valuable to find more concise names for particularly relevant objects, even if they can be described with broader terminology.

Even though the OIS document is far from complete, I would be grateful for any early readers who want to offer input, either on the ideas, document structure, or writing grammar and flow.

I think in the next sprint I will continue working on the definitions section and may also start working on the interdisciplinary section to try to help me consider how I want to approach the “system” and “substrate” sections of the definition.

SSJ--2 -- Read VK LTA and write a small summary with my thoughts.

I did a small amount of reading for this. Nothing much to say yet.

SSJ--3a—Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable.

I didn’t get around to this, but it still seems like a good idea to me.

SSJ--3b—Start studying Topoi and Linear Algebra textbooks.

Started reading Topoi. Went through the definition of categories and started looking at the list of weird categories that give intuition and understanding of the definition. I think it is fun when they draw a digraph and ask if it commutes.

SSJ--4a—Read Neel’s “Mech Interp Prereqs”.

I have read Neel’s Concrete Steps to Get Started in Transformer Mechanistic Interpretability. Neel walks us through what he thinks a MI researcher should be familiar with.

The broad categories (with my explanations):

Machine Learning (ML) -- All the current AI is, of course, built on top of ML.
Transformer Architecture—LLMs, the largest and most sophisticated AI models are built with Transformers. To study LLMs, or even smaller “LMs”, one must be familiar with them.
Tooling—There are many good tools for doing MI work. To do good work in any field, you must be familiar with the tools and use them well.
Current MI literature—One cannot contribute to current work without a familiarity with it. Well, while that’s not fully true, it’s true enough most of the time.
How to think about evidence in MI—How do we know we are effectively studying the internal workings of ML models rather than fooling ourselves? Careful focus on epistemology and evidence is required when working at the edge of human knowledge.

Neel then goes on to recommend three paths for exploration:

Paper—Choose a MI paper and engage with it deeply, including reproducing some of it’s results, potentially leading to building on those results and doing original research.
Concrete Problem—Identify an open problem in MI and spend some time trying to solve it.
Literature Review—Do research on a particular topic with the goal of writing a summary of the current research.

My thoughts on these recommendations

ML: I think I have a fairly good base in terms of ML knowledge. Admittedly, I did fail a course on theoretical ML last semester, (luckily it wasn’t required for my program,) but that was because it was focused on proving various bounds in online learning contexts, (eg, applying Rademacher complexity or Hoeffding’s Lemma to prove regret bounds). It wasn’t lack of interest, but overwhelm also struggling with my Topology class. I would like to more deeply understand this kind of ML theory, especially as it relates to AIXI and VK LTA, but truly my interest is more on semantic spaces and how we can prove things about encoded goals and preferences. Proving regret bounds feels a bit like yak shaving or bike shedding in the context of AIA. Anyway, the point is that I already have a good grasp of the basics like gradient descent and back-propagation of the loss function (from earlier classes and experience).

Transformers: This is one of the main focuses for SSJ--4, so it’s good that it comes up here. It looks like Callum McDougall has released many educational resources. Thanks Callum! I think I will take the recommendation to go through Transformers From Scratch. The newer version of the page is failing to load for me. Is it just my laptop? I might send them an email about it.

I might re-watch the 3b1b videos on LLMs and Transformers because Grant Sanderson’s work is so nice and soothing. Also apparently there’s an accompanying webpage now? Looks cool 😎

I’m also very interested in Neel’s recommendation, “A Mathematical Framework for Transformer Circuits”. I’ll have to add it to my reading list and may write a summary.

Let me know if you have any other recommendations for getting familiar with Transformers.

Tooling: I am very interested in tooling. Building tooling, especially for working with high dimensional data, is something I would like to contribute to. With that in mind, I feel I have shockingly shallow familiarity with existing tooling. Neel mentions:

TransformerLens & Callum McDougall’s guide for it.
Nostalgebraist’s transformer-utils library
Google PAIR’s Learning Interpretability Tool (LIT)
Google PAIR’s What-If Tool
Jesse Vig’s BERTViz
LOOM

So I’d like to review those, and maybe look around for other tools and see what people are saying about those tools and what they want out of their tools.

Neel also mentions The Interpretability Toolkit, which probably links to more tools and resources, and is worth looking into.

Current Literature: I think I’m well covered for this kind of thing with what I’m doing with SSJ--2, however, I like the suggestion of doing a literature review (combining SSJ--1 and SSJ--2), and I am fond of the following questions to keep in mind while doing a literature review:

What do we think we know?
How do we think we know it? (Evidence and Methods)
What are we focused on?
What aren’t we focused on?

My question from SSJ--2, “what is the current state of RSI criticality threshold knowledge” seems like it might be a good topic for a literature review, although it is outside of my normal focus area.

How to think about evidence in MI: Neel addresses this in various places throughout the article. Unfortunately there is probably no easy answer, but some helpful hints or things to think about:

What techniques are used to find evidence?
How do we distinguish between true and false beliefs about models?
Look for flaws. Look for evidence to falsify your hypothesis, not just evidence to confirm it. Watch out for cherry-picking.
Use janky hacking! Make minor edits to see how things change. What does and doesn’t break a model’s behaviour? Open ended exploration can be useful for hypothesis formation, but don’t rely on non-rigorous techniques as if they were strong evidence.

Conclusion

I think much of this advice fits into my current framework well. One think that Neel mentions is “Jump in and get your hands dirty! A common mistake is to spend days to weeks setting up epic infrastructure, or writing a perfect, sophisticated project proposal.” This may apply to my NDSP project, and I should spend some time considering that, however, my current feeling is that I would like to focus on building general tools, not ones that are married to specific experiments. In general, I agree with Neel’s focus on “tight feedback loops”.

SSJ--4b—Do some research and write a little bit about my plans for messing around with LLMs in some capacity.

I think this was well covered by SSJ--4a for now: I’m going to go through Transformers From Scratch.

SSJ--5a—Review my NDSP notes.

I gathered and sorted my notes. I think going through them and keeping them in mind while also reviewing other MI Tools would be a good idea. I might try writing an overview of different tools and my ideas about NDSP, either as the same or different articles.

SSJ--5b—Experiment with tensorflowjs.

I didn’t get around to this.

Updated Reading List (SSJ--2)

AIA Stuff:
- VK LTA
  - AIXI
- “what is the current state of RSI criticality threshold knowledge” (Lit Rev in SSJ--1)
- Cyborgism Agenda
- Goals selected from learned knowledge
- Agent Foundations
- natural-abstractions-key-claims-theorems-and-critiques-1
- [2309.01933] Provably safe systems: the only path to controllable AGI
- https://ai-2027.com/research/ai-goals-forecast
- Multinational AGI Consortium
- Vec2vec & universal geometry of embeddings
MI Stuff:
- A Mathematical Framework for Transformer Circuits (Summary in SSJ--1)
- InfoVis particularly:
  - Dimension Reduction
  - Clustering (density?)
  - Mech Interp tools --> Now part of SSJ--5
- Neel’s MI Resources:
  - A Comprehensive Mechanistic Interpretability Explainer & Glossary — Neel Nanda
  - 200 Concrete Open Problems
- SAEs
- On the Biology of a Large Language Model (attribution graphs)
How should I be using existing AI to help my studies and research?

Goals for 2nd Sprint

On the last sprint I had trouble actually making the time to sit down and work on this. For that reason, in the next update I want to include a section logging each day I work on this with a brief explanation of what I focused on. My goal is to have an entry for at least 4 days of the week, but more than that is even better. I’m hoping this will help motivate me, or at least let me see how I’m doing in terms of putting in the time to work on this.

SSJ--1 -- Write
- Continue work on my OIS article. (Maybe take a break from this on the next sprint)
  - Do some work on the “Prompting for Interdisciplinary Attention” section, with a focus gathering definitions and conceptions for the “system” and “substrate” subsections of the definition section.
  - Finish writing the first draft of the definition section. Skip “system” and “substrate” for now.
SSJ--2 -- Read
- Read VK LTA and write a small summary with my thoughts.
SSJ--3 -- Math
- Email some professors at UVic to see if I can have some conversations about my interests and other math topics that may be valuable.
- Keep studying Topoi. Next sprint switch to Linear Algebra or Computational Mechanics
SSJ--4 -- Into Code
- go through Transformers From Scratch.
SSJ--5 -- Build Tooling
- Do an informal literature review on MI Tooling and Data Visualization for High Dimensional Data.
- Places to start for MI Tooling:
  - The Interpretability Toolkit
  - TransformerLens & Callum McDougall’s guide for it.
  - Nostalgebraist’s transformer-utils library
  - Google PAIR’s Learning Interpretability Tool (LIT)
  - Google PAIR’s What-If Tool
  - Jesse Vig’s BERTViz
  - LOOM
  - CircuitVis

TT Self Study Journal # 2