Interesting link on symbolic regression. I actually tried to get an AI to write me something similar a while back[1] (not knowing that the concept was out there and foolishly not asking, though in retrospect it obviously would be).
From your response to kave:
calculate a quantity then use that as a new variable going forward
In terms of the tree structure used in symbolic regression (including my own attempt), I would characterize this as wanting to preserve a subtree and letting the rest of the tree vary.
Possible issues:
If the coding modifies the trees leaf-first, trees with different roots but common subtrees aren’t treated as close to each other. This is an issue that my own version would likely have had even if actually implemented[2]. However, I think PySR might at least partially address this issue (It uses genetic programming and the pictures in the associated paper seem to indicate that it is generating trees which at least sometimes preserve subtrees.) (Though the genetic programming approach is likely to make it hard to find the very simplest solutions in practice imo.[3])
Even if you are treating trees with common subtrees as close to each other, if your evaluation of trees is only comparing final calculated values on the entire dataset, then it’s hard to make the call “I know this subtree is important even if I don’t know the rest of the tree” because the results are not likely to be all that close unless you already have a reasonable guess for the rest of the tree. One partial (heh) answer might be to award part marks to solutions that work well for some of the data even if wildly off for other parts. Careful thinking might be required to do this in a way that doesn’t backfire horribly, though. Hmm—or maybe you CAN do that in the existing paradigm by including if/then nodes in the tree? Say, a node that has three child nodes/subtrees, and chooses between two of them based on the value of the third? And then (in some genetic-programming-like approach perhaps) explore what happens if you copy those subtrees elsewhere, or existing subtrees into new if-then nodes?) (I can imagine the horrific unreadable mess already though...)
edited to add: it might be more appropriate to say that I had been planning on asking an AI to code something, but the initial prototype was sufficiently lame and gave me enough insight into the difficulties ahead I didn’t continue. Claude chat link if anyone’s interested.
edited to further add: hmm, what you are wanting (“new variable”) is probably not just preserving a subtree, but for the mutation system to be able to copy that subtree to other parts of the tree (and the complexity calculator to not give to much penalty to that, I guess). Interestingly, it seems that PySR’s backend at least (SymbolicRegression.jl) does have the capability to do this already, using a “form_random_connection!” mutation function that apparently allows the same subtree to appear as child of multiple parents, making a DAG instead of a tree. In general, I’ve been pretty impressed looking at SymbolicRegression.jl. Maybe other symbolic regression software is as feature-rich, but haven’t checked.
- ^
Apparently November 2024. Feels longer ago somehow.
- ^
I hadn’t actually gone beyond breadth-first search though.
- ^
This is informed by (a tiny amount of) practical experience. After SarahNibs’ comment suggested genetic programming would have worked on the “Arena of Data”, I attempted genetic programming on it and on my initial attempt got … a horrific unreadable mess. Maybe it wasn’t “halfway decently regularized” but I updated my intuition to say: complicated ways to do things so greatly outnumber the simple ways that anything too reliant on randomness is not likely to find the simple way.
My accusations, at least so far:
Danny Nova for curing:
Babblepox (always present in same sector)
Bumblepox, Scramblepox (always present in same sector OR Azeru in adjacent sector
Gurglepox (always present in same sector OR Dankon Ground in opposite sector)
Chucklepox (Danny Nova in same sector for about half of cures. All other cases Lomerius Xardus was present somewhere (todo: investigate if more patterns than mere presence))
Rumblepox (suspiciously high ratio of presence/absence in Calderia when cured, but plenty more cases unexplained—TODO: explain more)
Disquietingly Serene Bowel Syndrome (on weak evidence, see below)Azeru for curing Bumblepox, Scramblepox (see Danny Nova)
Lomerius Xardus for curing Chucklepox (see Danny Nova)
Boltholopew and Moon Finder, collectively, for curing Disease Syndrome, Parachordia, Problems Disorder (each one positioned in different adjacent sectors to the curing event)
Tehami Darke on weak evidence (see below) for curing Disquetingly Serene Bowel Syndrome
Dankon Ground for curing;
Gurglepox (present in opposite sector OR Danny Nova in same sector)
Mildly But Persistently Itchy Throat (always present in opposite sector)
Zancro for curing Scraped Elbow, Scraped Knee (always present in same sector)
Nettie Silver for curing Smokesickness (always present in same sector. It seems she is always present in Calderia, but her location and the Smokesickness curing location, when it occurs, do vary)
I have not figured out The Shivers.
Disquietingly Serene Bowel Syndrome seems tricky. Cures only started on Day 390. Tehami Darke came to Calderia on Day 162, Lomerius Xardus first arrived on day 494, Zeledin Zura first arrived on day 427, Gouberi first arrived on day 397, Ricewined first arrived on day 994. None of which is particularly suggestive of a causal connection though the p-values would be very low assuming (falsely) that each day was independent. Cures seem more likely to occur in the sector where Tehami Darke is present, which is pretty weak without further investigation (e.g. is he hanging out in some sector where it’s more likely to occur anyway? I haven’t checked). Conditioned on day 390+, the average number of cures is slightly higher on days when Danny Nova is present, also weak, but the prior for Danny being involved is high, so
might as well accuse him as well. (edit: no, seems there is no upside to this accusation given the scoring rule and the high confidence Danny Nova has been curing other stuff)