This is a non-trivial task, since existing analysis methods struggle to identify combinatorial, non-linear patterns. Our collaborators at IPSiM told us that they would typically spend ‘months scrolling in excel’ to make sense of this data – in contrast to our automated system, which identified these relationships in less than an hour.
coming from a closely related field (ecology research) where, 10 years ago, statistical software like R was in wide use and ML methods like random forests were already gaining in popularity. But of course, my perspective is biased from spending a lot more time in the “quant” circles within the field, and perhaps examples like this show many labs don’t have (or can’t afford) specialized analysts, motivating the need for some automated solutions.
On a different note, how are you determining statistical significance of individual patterns? Is it something along the lines of: we found N patterns, by pure chance (e.g. from re-running the analysis on a randomized dataset) we would expect M, so we take the N—M most predictive patterns to be the significant ones?
Thank you for sharing the whitepaper.
I was a bit shocked to read:
coming from a closely related field (ecology research) where, 10 years ago, statistical software like R was in wide use and ML methods like random forests were already gaining in popularity. But of course, my perspective is biased from spending a lot more time in the “quant” circles within the field, and perhaps examples like this show many labs don’t have (or can’t afford) specialized analysts, motivating the need for some automated solutions.
On a different note, how are you determining statistical significance of individual patterns? Is it something along the lines of: we found N patterns, by pure chance (e.g. from re-running the analysis on a randomized dataset) we would expect M, so we take the N—M most predictive patterns to be the significant ones?