I would personally count it as a form of RSI because it is a feedback loop with positive reinforcement, hence recursive, and the training doesn’t come from an external source, hence self-improvement. It can be technically phrased as a particular way of generating data (as any form of generative AI can) but that doesn’t seem very important. Likewise with the fact that the method is using stochastic gradient descent for self-improvement.
AlphaGo Zero got pretty far with that approach before plateauing, and it seems plausible that some form of self-play via reasoning RL (like this) could scale to superhuman abilities in math or coding relatively soon. A method scaling to superhuman abilities without external training data would be the typical thing expected of something called RSI but not of much else.
Of course those are just my semantic intuitions. I think you have more something like meta-learning in mind for introspective RSI. The question is how this would work on a technical level.
I think open-loop vs closed-loop is an orthogonal dimension than RSI vs not-RSI.
open-loop I-RSI: “AI plays a game and thinks about which reasoning heuristics resulted in high reward and then decides to use these heuristics more readily”
open-loop E-RSI: “AI implements different agent scaffolds for itself, and benchmarks them using FrontierMath”
open-loop not-RSI: “humans implement different agent scaffolds for itself, and benchmarks them using FrontierMath”
closed-loop I-RSI: “AI plays a game and thinks about which reasoning heuristics could’ve shortcut the decision”
closed-loop E-RSI: “humans implement different agent scaffolds for the AI, and benchmarks them using FrontierMath”
I would personally count it as a form of RSI because it is a feedback loop with positive reinforcement, hence recursive, and the training doesn’t come from an external source, hence self-improvement. It can be technically phrased as a particular way of generating data (as any form of generative AI can) but that doesn’t seem very important. Likewise with the fact that the method is using stochastic gradient descent for self-improvement.
AlphaGo Zero got pretty far with that approach before plateauing, and it seems plausible that some form of self-play via reasoning RL (like this) could scale to superhuman abilities in math or coding relatively soon. A method scaling to superhuman abilities without external training data would be the typical thing expected of something called RSI but not of much else.
Of course those are just my semantic intuitions. I think you have more something like meta-learning in mind for introspective RSI. The question is how this would work on a technical level.
I think open-loop vs closed-loop is an orthogonal dimension than RSI vs not-RSI.
open-loop I-RSI: “AI plays a game and thinks about which reasoning heuristics resulted in high reward and then decides to use these heuristics more readily”
open-loop E-RSI: “AI implements different agent scaffolds for itself, and benchmarks them using FrontierMath”
open-loop not-RSI: “humans implement different agent scaffolds for itself, and benchmarks them using FrontierMath”
closed-loop I-RSI: “AI plays a game and thinks about which reasoning heuristics could’ve shortcut the decision”
closed-loop E-RSI: “humans implement different agent scaffolds for the AI, and benchmarks them using FrontierMath”
closed-loop not-RSI: constitutionalAI, AlphaGo