Interesting link on symbolic regression. I actually tried to get an AI to write me something similar a while back[1] (not knowing that the concept was out there and foolishly not asking, though in retrospect it obviously would be).
From your response to kave:
calculate a quantity then use that as a new variable going forward
In terms of the tree structure used in symbolic regression (including my own attempt), I would characterize this as wanting to preserve a subtree and letting the rest of the tree vary.
Possible issues:
If the coding modifies the trees leaf-first, trees with different roots but common subtrees aren’t treated as close to each other. This is an issue that my own version would likely have had even if actually implemented[2]. However, I think PySR might at least partially address this issue (It uses genetic programming and the pictures in the associated paper seem to indicate that it is generating trees which at least sometimes preserve subtrees.) (Though the genetic programming approach is likely to make it hard to find the very simplest solutions in practice imo.[3])
Even if you are treating trees with common subtrees as close to each other, if your evaluation of trees is only comparing final calculated values on the entire dataset, then it’s hard to make the call “I know this subtree is important even if I don’t know the rest of the tree” because the results are not likely to be all that close unless you already have a reasonable guess for the rest of the tree. One partial (heh) answer might be to award part marks to solutions that work well for some of the data even if wildly off for other parts. Careful thinking might be required to do this in a way that doesn’t backfire horribly, though. Hmm—or maybe you CAN do that in the existing paradigm by including if/then nodes in the tree? Say, a node that has three child nodes/subtrees, and chooses between two of them based on the value of the third? And then (in some genetic-programming-like approach perhaps) explore what happens if you copy those subtrees elsewhere, or existing subtrees into new if-then nodes?) (I can imagine the horrific unreadable mess already though...)
edited to add: it might be more appropriate to say that I had been planning on asking an AI to code something, but the initial prototype was sufficiently lame and gave me enough insight into the difficulties ahead I didn’t continue. Claude chat link if anyone’s interested.
edited to further add: hmm, what you are wanting (“new variable”) is probably not just preserving a subtree, but for the mutation system to be able to copy that subtree to other parts of the tree (and the complexity calculator to not give to much penalty to that, I guess). Interestingly, it seems that PySR’s backend at least (SymbolicRegression.jl) does have the capability to do this already, using a “form_random_connection!” mutation function that apparently allows the same subtree to appear as child of multiple parents, making a DAG instead of a tree. In general, I’ve been pretty impressed looking at SymbolicRegression.jl. Maybe other symbolic regression software is as feature-rich, but haven’t checked.
This is informed by (a tiny amount of) practical experience. After SarahNibs’ comment suggested genetic programming would have worked on the “Arena of Data”, I attempted genetic programming on it and on my initial attempt got … a horrific unreadable mess. Maybe it wasn’t “halfway decently regularized” but I updated my intuition to say: complicated ways to do things so greatly outnumber the simple ways that anything too reliant on randomness is not likely to find the simple way.
Interesting link on symbolic regression. I actually tried to get an AI to write me something similar a while back[1] (not knowing that the concept was out there and foolishly not asking, though in retrospect it obviously would be).
From your response to kave:
In terms of the tree structure used in symbolic regression (including my own attempt), I would characterize this as wanting to preserve a subtree and letting the rest of the tree vary.
Possible issues:
If the coding modifies the trees leaf-first, trees with different roots but common subtrees aren’t treated as close to each other. This is an issue that my own version would likely have had even if actually implemented[2]. However, I think PySR might at least partially address this issue (It uses genetic programming and the pictures in the associated paper seem to indicate that it is generating trees which at least sometimes preserve subtrees.) (Though the genetic programming approach is likely to make it hard to find the very simplest solutions in practice imo.[3])
Even if you are treating trees with common subtrees as close to each other, if your evaluation of trees is only comparing final calculated values on the entire dataset, then it’s hard to make the call “I know this subtree is important even if I don’t know the rest of the tree” because the results are not likely to be all that close unless you already have a reasonable guess for the rest of the tree. One partial (heh) answer might be to award part marks to solutions that work well for some of the data even if wildly off for other parts. Careful thinking might be required to do this in a way that doesn’t backfire horribly, though. Hmm—or maybe you CAN do that in the existing paradigm by including if/then nodes in the tree? Say, a node that has three child nodes/subtrees, and chooses between two of them based on the value of the third? And then (in some genetic-programming-like approach perhaps) explore what happens if you copy those subtrees elsewhere, or existing subtrees into new if-then nodes?) (I can imagine the horrific unreadable mess already though...)
edited to add: it might be more appropriate to say that I had been planning on asking an AI to code something, but the initial prototype was sufficiently lame and gave me enough insight into the difficulties ahead I didn’t continue. Claude chat link if anyone’s interested.
edited to further add: hmm, what you are wanting (“new variable”) is probably not just preserving a subtree, but for the mutation system to be able to copy that subtree to other parts of the tree (and the complexity calculator to not give to much penalty to that, I guess). Interestingly, it seems that PySR’s backend at least (SymbolicRegression.jl) does have the capability to do this already, using a “form_random_connection!” mutation function that apparently allows the same subtree to appear as child of multiple parents, making a DAG instead of a tree. In general, I’ve been pretty impressed looking at SymbolicRegression.jl. Maybe other symbolic regression software is as feature-rich, but haven’t checked.
Apparently November 2024. Feels longer ago somehow.
I hadn’t actually gone beyond breadth-first search though.
This is informed by (a tiny amount of) practical experience. After SarahNibs’ comment suggested genetic programming would have worked on the “Arena of Data”, I attempted genetic programming on it and on my initial attempt got … a horrific unreadable mess. Maybe it wasn’t “halfway decently regularized” but I updated my intuition to say: complicated ways to do things so greatly outnumber the simple ways that anything too reliant on randomness is not likely to find the simple way.