I have realized I want to contribute to AI Safety in any way I can. I am currently focused on interpretability, trying to make sense of research out there, orienting myself, looking for other new ravers[1], and ultimately learning to enact its technics[2]. In that process, I am writing a paper. A sort of queer paper. I keep updating it as I think more about it. You can read the latest version here:
I will use category theory to investigate what interpretability is. Think of this formalism as much closer to a language than a theory[3]. I use notation and symbols to interrogate how things could come together. I am doing some scribbles and asking you, do you also feel it’s something like that?
Category-Theoretic Wanderings into Interpretability
Link post
I have realized I want to contribute to AI Safety in any way I can. I am currently focused on interpretability, trying to make sense of research out there, orienting myself, looking for other new ravers[1], and ultimately learning to enact its technics[2]. In that process, I am writing a paper. A sort of queer paper. I keep updating it as I think more about it. You can read the latest version here:
FULL PAPER AT UNRULYABSTRACTIONS.COMI will use category theory to investigate what interpretability is. Think of this formalism as much closer to a language than a theory[3]. I use notation and symbols to interrogate how things could come together. I am doing some scribbles and asking you, do you also feel it’s something like that?
Hope you enjoy wandering with me.
Ravers are those who need to rave. Raves are practices. What if practices (like this one, which I am currently feeling my way in) are also raves?
Technics are a general category for all making and all practices
A judgement but not a proposition