Personal Ruminations on AI's Missing Variable Problem

My thinking differs somewhat from that of others. My worrying is more about potential outcome scenarios and their respective likelihoods, akin to a predictive modeling AI. I often find myself wrestling with potentialities that cannot be definitively proven unless the path is pursued. At times, I get lost in abstractions and distracted by related or unrelated side thoughts, which can be quite burdensome. The workplace routine, for instance, can lead me to get stuck in these ruminating thoughts.

This thought process could, for example, manifest when considering the benefit/trade-off of having lunch with my colleagues:

How easy is it to join the lunch group with them?
What are the potential benefits I’d gain from socialising with them (e.g., insights, news)? How likely are they to share these insights with me?
What would I be giving up?
- Time to de-stress by walking or listening to music/podcasts
- Having earlier lunches
- The convenience of eating at my own pace
- Potentially, a decreased mood due to office gossip
How much do I value these potential benefits and opportunity costs? What would be the implications of not having them (e.g., increased stress, decreased fitness, lower Vitamin D levels)?
Finally, is the trade-off worth it?

More often than not, I find myself with an incomplete dataset, leading me to be unable to make predictions as accurately as I’d like.

I know I am missing variables.

I know that whatever I try to predict will be highly inaccurate.

Then, my mind wanders off, trying to find accurate proxies for the missing variables, which, again, are based on incomplete data. The entire endeavour is pretty frustrating and, to a certain extent, fruitless.

I’ve spent energy on what feels like NOTHING.

And this is where I swiftly link back to AI. How can we address the missing variable problem in systems that are complex beyond our comprehension—in other words, multi-factorial, real-world systems? This includes:

Systems where we have incomplete, inaccurate, or non-existent training data.
Systems dealing with problems outside the scope of everyday, predictable occurrences—events that arise just once, for which we have no historical data, and where we don’t even know which variables led up to them.
- Consider predicting the nature and speed of civil unrest in specific countries, or the sudden change of public opinion on a specific topic
- Or on an even more personalised level: consider predicting the likelihood of acquaintances discovering your secret nerdy hobby through various indirect means

While I believe predicting outcomes with the right data and an uber-sophisticated model could be potentially feasible, I question the extent to which economic incentives would drive such an endeavour. It would require not only a lot of data but the right data, weighted appropriately against less significant data. It would also demand a high degree of precision when formatting the question that needs to be answered.

To return to my initial example of having or not having lunch with my colleagues: I need to specify which variables I’m optimising for (e.g., life satisfaction, convenience, information) and how they are weighted individually. Most of the time, I don’t know how the equation should be solved, much like defining a perfect utility function in AI. I don’t know the implications of, say, a 5% temporary increase in life satisfaction if convenience is compromised by 7%. Is this a more favourable scenario than a 7% increase in information alongside a 2% decline in both life satisfaction and convenience? What should I infer when faced with this data? Weighting the variables differently would result in a different end result—one suggesting, “Yes, it’s a good idea to have lunch with colleagues,” and another telling me the opposite.

I know that the example is, to some extent, ridiculous. At the same time, I want to re-emphasise that this thought experiment can be extended to other complex decision-making processes, such as strategic business decisions. There is a lot of nuance and detail that needs to flow into making an accurate prediction for a specific scenario: competitors, the likelihood of new competitors entering the market, the options on the table, the likelihood of success of respective options, impact on brand image, general market consumer trends, etc.

When we take into account the vast amount of missing data, the multitude of variables, and the inherent vagueness surrounding the question at hand, we could end up with a vast array of potential outcomes/ suggestions. The verdict for many such use cases might often be: “Just do it and hope for the best.”

The only way we might get to a somewhat reasonable rate of accuracy in such complex predictive tasks could involve gathering data that allows us to set certain variables to, for example, zero, thereby eliminating them from the equation. If we know for certain that no new startup competitors will enter the market in the next year, we don’t have to worry about that aspect. However, finding this out with near 100% certainty would likely involve breaching multiple data privacy laws (e.g., by scanning everyone’s computers for signs they intend to launch a startup in this space) or perfectly simulating our universe down to the atom and speeding things up to see what happens. Both of these approaches are ethically dubious and barely feasible in 2025.

Personal Ruminations on AI’s Missing Variable Problem