Category-Theoretic Wanderings into Interpretability

Link post

I have realized I want to contribute to AI Safety in any way I can. I am currently focused on interpretability, trying to make sense of research out there, orienting myself, looking for other new ravers^[1], and ultimately learning to enact its technics^[2]. In that process, I am writing a paper. A sort of queer paper. I keep updating it as I think more about it. You can read the latest version here:

FULL PAPER AT UNRULYABSTRACTIONS.COM

I will use category theory to investigate what interpretability is. Think of this formalism as much closer to a language than a theory^[3]. I use notation and symbols to interrogate how things could come together. I am doing some scribbles and asking you, do you also feel it’s something like that?

Hope you enjoy wandering with me.

^
Ravers are those who need to rave. Raves are practices. What if practices (like this one, which I am currently feeling my way in) are also raves?
^
Technics are a general category for all making and all practices
^
A judgement but not a proposition