Beyond Curve Fitting: My Journey to Causal AI

20 min readOct 9, 2023

My research into forecasting capital markets concluded that markets are connected graphs. Each asset, be it a stock, cryptocurrency, or commodity, is a node in a graph. The edges, or connectors, to each asset are the monotonic value between that asset and its neighbor.

However, a number of these correlated relationships have a causal relationship with multiple known and unknown variables. Causal means these associations are objective physical constraints where probabilistic relationships are epistemic, reflecting what we know or believe [Pearl 2009].

By viewing a market as a causal network, it becomes possible to forecast market behavior through the conduct of individual assets that results from the influences coupled with them. Our research reveals that such causal relationships experience what Network Science calls popular influence, cascade effects, and power laws [Easley 2019].

Turing Award winner Judea Pearl quipped, “All the impressive achievements of deep learning amount to just curve fitting.” Most artificial intelligence focuses on correlation. However, Pearl believes we must build artificial intelligent systems that understand causation to go beyond pattern matching. This difference requires new mathematical tools to capture our understanding that if X causes Y, that does not mean that Y causes X [Hartnett 2021].

This paper describes a novel approach to General AI using causal relationships within a graph. The result is a system using Pearl’s calculus of causation that solves problems many statisticians deemed unsolvable [Pearl 2018], including the ability to predict and continuously improve on the predictions.

Background

We first used the Statistical Physics of Benoit Mandelbrot when we set out to forecast markets. Known as Multifractal Analysis, the primary assumption is that price does not follow the well-mannered bell curve [Moses 2021]. However, the prediction accuracy for such models could be better.

The second attempt came from Deep Learning which uses Curve Fitting to predict outcomes. Using the tools that mastered ancient games and can write papers on an undergraduate level, we increased that accuracy significantly. However, we needed more to produce the desired results for potential customers.

Curve Fitting is the method of constructing an approximate curve y = f(x), which best fits a given discrete set of points (xi, yi), i=1,2,3…n [Antoniadis 2022]. Once the curve is known, it is followed into the future to give a prediction. The problem is that the curve understands nothing about what is causing the forecast.

Another issue with the Curve Fitting method of Deep Learning is that a stock price is a short-term prediction of itself [Cotton 2022]. However, it only lasts a few minutes at most before being replaced by a new forecast. Therefore, a new procedure is needed to accurately predict the next price projection. One to answer the question. “why did it change?”

With causal relationships in mind, our third attempt began with a graph of assets. From here, some strange things occurred. The notions of popularity, influence, and cascade events directed us to the work of Causal Influence by Turing Award winner Judea Pearl.

Graphs

Graphs are mathematical objects that consist of vertices (or nodes) connected by edges that provide a way to represent a network formally [Joshi 2017]. In graph theory, we have ordered pairs (v, e) where v is a set of vertices (or nodes), and e is the set of its edges. Thus a graph is defined as G = (V, E).

Once objects connect to a graph, the phenomenons of Network Effects apply. These include popularity, influence, and cascade events [Easley 2019]. In essence, a means to more fully model market behavior.

Correlation

Correlation is the measure of association between two variables. Whenever two variables are so related, one change accompanies a positive or negative change in the other.

The Pearson correlation measures the strength of the linear relationship between two variables. It has a value between -1 to 1, with a value of -1 meaning a negative linear correlation, 0 being no correlation, and + 1 indicating a positive correlation [Williams 2022]. However, this measure only works for independence of cases where the result of one value does not influence the other.

In a monotonic relationship, one value affects the other it is compared with, as with most market data [Ramzai 2020]. For this reason, we use Spearman’s rank-order correlation test to calculate the monotonic (instead of the linear) relationship’s strength and direction between two variables. The scale of -1 to 1 from the Pearson correlation still applies.

Based on the results of the Spearman’s correlation coefficient, we can determine the following:

Between 0 and 0.3 or (0 and -0.3) indicates a weak monotonic relationship
Between 0.4 and 0.6 or (-0.4 and -0.6) reveals a moderate monotonic relationship
Between 0.7 and 1 or (-0.7 and 1) shows a robust monotonic relationship

The difference between the positive and negative Spearman’s results signifies the type of monotonic relationship. A positive number indicates that a positive change in one value never equals a negative change in the other. In contrast, a negative result dictates a positive change in one value never leads to a positive change in the other.

Causal Inference

Turing Award recipient Judea Pearl invented a new science he calls “causal inference” that deals with cause-effect relationships. While probabilities encode our beliefs about a static world, causality tells us if and how probabilities change when the world changes [Pearl 2018]. Pearl’s “calculus of causation” is the tool to answer such questions.

The aim of causal inference is to infer not only the likelihood of events under static conditions, but also the dynamics of events under changing conditions. Such questions require some knowledge of the data-gathering process [Pearl 2008].

This “language of knowledge” corresponds to a symbolic “language of queries” to build the questions we want to answer [Pearl 2018]. For example, consider the effect of interest rates (I) on an asset price (A). We could write it as P(I | do(A)). The “do” operator is a new construct to ensure that the observed change results from the interest rate (I) and not some other factor.

Prior to the “do” operator, there was no way to specifically state that X causes Y. For example, P(Y|X) means that X is evidence of a situation in which Y is more likely. Pearl explains, “classical statistics only summarizes data,” whereas causal inference “provides a notation and, more importantly, offers a solution.”

Inference Engine

All a deep-learning application can do is to fit a function to data [Pearl 2018]. Each time the world changes, the application must learn a new prediction function. This was the issue we experienced while trying to use deep-learning for forecasting market volatility.

In contrast, consider a theoretical machine that Pearl calls the inference engine. Its inputs consist of assumptions, queries, and data. The first output is a boolean yes or no that flags a query as a candidate for the causal model. If the answer is Yes, the inference engine produces an Estimand. A mathematical formula that functions as a recipe for generating answers from any hypothetical data. Last, the inference engine analyzes the data for an actual answer estimate and a statistical measure of uncertainty [Pearl 2018].

Pearl’s diagram of how an “inference engine” combines data and knowledge to produce answers from queries.

The details of which are as follows:

Knowledge accounts for past reservations, actions, education, and cultural experiences deemed relevant to the query.
Assumptions are the beliefs accepted to be true. For example, a person looking out a seventh-story window may believe the sky is blue despite being unable to see it because the sun is shining.
Causal Model is the type of model used. According to Pearl, this can be structural equations, logical statements, or causal diagrams.
Testable Implications refer to the observable patterns or dependencies in the data.
Queries Submitted refer to the questions we want to answer. They are presented in the format of causal vocabulary such as P(I | do(A)).
Estimand is Latin for “that which is to be estimated.” According to Pearl, “It is the statistical quantity to be estimated from the data that, once estimated, can legitimately represent the answer to our query.”
Data is the ingredients for the Estimand recipe. Pearl explains, “data is profoundly dumb about causal relationships.”
The Statistical Estimation is an approximate result that must be wrangled out of uncertainty using statistical methods.
The Estimate is the answer — the result of additional statistical analysis on the raw output.

The revolutionary concept is adaptability. A machine that can adapt as the environment changes, bringing a new frontier to machine learning towards systems that can reason.

Causal Model

An autonomous intelligent system cannot rely exclusively on pre-programmed causal knowledge. Instead, it must utilize direct observations from cause-and-effect relationships [Pearl 2009]. Using a model in the mathematical sense, we want to assign truth values to sentences that represent some aspect of reality [Pearl 2009].

Known as a Structural Causal Model (SCM), this system consists of a set of Endogenous (V) and a set of Exogenous (U) variables connected by a set of functions (F) [Gonçalves 2020]. Endogenous means internal cause, while Exogenous refers to external stimuli.

The blueprint for this model is a directed acyclic graph (DAG). Representing the flow of information, the variables U are the input, while the variables V are the nodes that process data [Gonçalves 2020].

DAG

Acyclic refers to the fact that the flow of information does not cycle back to a previous node. Our model’s directed acyclic graph (DAG) consists of nodes as a variable in U or V, where each edge is a function f. Since each variable is independent of all its nondescendants and conditional on its parents, it creates a Markov condition [Pearl 2009].

The Markov condition or Causal Markov (CM) condition states that a node is independent of all variables which are not direct causes of that node [Neapolitan 2007]. Suppose the variable Y is the child of variable X. In that case, we say that Y is caused by X or that X is the direct cause of Y. If the variable Y is the descendant of a variable X, then we say that Y is potentially caused by X or that X is the potential cause of Y.

Consider this simple directed acyclic graph (DAG):

Just by viewing the graph, we can determine [Gonçalves 2020]:

X and Y have no incoming edges, so they are Exogenous variables (belonging to U).
Z has two incoming edges, so it’s an Endogenous variable (belonging to V).
Z has two direct causes X and Y, or, in other words, the value of Z depends explicitly on the values of X and Y and fz=f(X, Y).

We need the full specification of the Structural Causal Model (SCM) to know what the function fz is that determines the value of Z. Such a DAG can be associated with a causal model M as G(M) [Pearl 2009] where

U = {X, Y}

V = {Z}

F = {Fz : Z = 2X + 3Y}

Model Definition

A causal model is a tuple M = <U, V, F> where [Pearl 2009]:

U is a set of background variables that are determined by outside factor (exogenous).
V is a set of variables {V1, V2, …, Vn} that are determined by variables in the model (endogenous).
F is a set of function {f1, f2, …, fn} where each fi in vi = fi(pai, ui), i=1,…,n, assigns a value to Vi that is based on a select set of variables of V ∪ U, and the entire set F has a unique solution V (u).

Submodels

Submodels are used for representing the effect of local actions and hypothetical changes, including those implied by counterfactual antecedents [Pearl 2009]. A submodel is a causal model MX with a set of variables in V and x is a particular realization of X [Pearl 2009]. A submodel Mx of M is the causal model M = <U, V, Fx> where

Fx = {fi : Vi X} ∪ {X = x}.

Basically, Fx is formed by deleting from F all functions fi corresponding to members of the set X and replacing them with the set of constant functions X = x.

Action and Response

Let M be a causal model, X a set of variables in V, and x a particular realization of X [Pearl 2009]. The effect of action do(X = x) on M is given by the submodel Mx.

Let X be two subsets of variables in V. The potential response of Y to action do(X — x), denoted Yx(u), is the solution for Y of the set of equations Fx, that is Yx(u) = YMx(u) [Pearl 2009].

Counterfactuals

A counterfactual is a phrase such as “had X been x.” It represents a hypothetical modification of the equation in the model to simulate an external action or spontaneous change [Pearl 2009]. Such statements can be interpreted as conveying a set of predictions under a well-defined set of conditions.

Let X and Y be two subsets of variables in V. The counterfactual sentence “Y would be y (in situation u), had X been x” is interpreted as the equality of Yx(u) = y, with Yx(u) being the potential response of Y to X = x [Pearl 2009].

Pearl explains, “For the predictions to be valid, two components must remain invariant: the laws (or mechanisms) and the boundary conditions.” In other words, U will not change when our predictive claim is applied or tested.

Causal AI

Artificial Intelligence (AI) is closely related to simulating human behavior, whereas machine learning encompasses computers performing tasks outside of programmed instructions. While researchers at Microsoft and IBM work to apply causality to machine learning, most of the work in AI is based on curve fitting [Hartnett 2021].

Despite the rapid advancement of AI over the past few years, current systems need to catch up when it comes to tasks where there is a need to understand the actual causes behind an outcome [Shekhar 2022]. The problem is that models learn from training data, and understanding cause-and-effect can only come from knowledge of the data-gathering process, not the data itself [Pearl 2008].

Causal AI is an artificial intelligence system that explains the cause and effect of events or phenomena. This technology is in its infancy, but Pearl believes the complete application of causal inference is the key to unlocking AIs full potential [Hartnett 2021].

The Ladder

Alan Turing, the father of artificial intelligence (AI), proposed a classification system for cognitive systems based on the type of queries they can answer [Pearl 2018]. To address this, Pearl describes a ladder where

First is statistical and predictive reasoning
The second is Interventional reasoning. The ability to predict what happens when a system is changed.
Third is counterfactual reasoning. The ability to ponder what would have happened if circumstances were different.

Most current AI systems never make it past the ladder’s first rung. We are proposing a Causal AI system that addresses the third rung — one with the ability to process counterfactual reasoning.

Popularity Discovery

A strange thing happened while working to predict stock volatility. When graphed by monotonic relationships, the assets exhibited popularity. Like connections on Twitter, some stocks had more relationships than others. In essence, these popular stocks seem to influence the less popular ones.

Since extreme imbalances characterize popularity [Easley 2019], we can safely assume that the assets with a disproportionate number of links from them are the influencers. If correct, we should see a Power Law with about 10% of the assets in the graph noted as influencers. This is because power laws dominate in cases where the measured quantity can be viewed as a type of popularity [Easley 2019].

According to Pearl, “The value of a specific endogenous variable can depend only on the values of its parents.” If we know that a popularity effect is occurring between parent and child, we can test only those connections for causal relationships.

Causal Detection

No matter how sophisticated, predictive algorithms can fall into the trap of equating correlation with causation [Sgaier 2020]. To avoid this, we must determine if a causal relationship exists between each parent and child, such as a change in A causes a change in B.

Starting with a graph of monotonic relationships, we have a good idea of where the causal relationships exist and who the influencers are. Once causal relationships are realized, our monotonic graph evolves into a causal graph. Afterward, we can map all the different causal pathways to an outcome of interest and determine how different variables relate to each other [Sgaier 2020].

We borrow a “relative difference” equation from epidemiology to determine causation from monotonic relationships. This equation produces a measure of the susceptibility of a population to exposure x [Pearl 2008]. Susceptibility is the proportion of persons who possess “an underlying factor sufficient to make a person contract a disease following exposure.”

PS = P(y | x) — P(y | x’) / 1 — P(y |x’)

Thus we are measuring the effect of x on y after removing all other causes of y. The result may not be the actual cause or may not be the total cause. Thus our graph will look something like this:

Consider that Causal AI must have an understanding of the data-gathering process, and we now have a causal graph that may or may not be missing the actual cause in one or more relationships. If we add more data points to the graph, we have some probability of discovering the missing causes.

For example, a causal graph of stocks will show how some stocks influence others with a few hidden causes connected. If we add more stocks, then a few of those hidden causes should disappear. By adding other financial data, the Secured Overnight Financing Rate (SOFR) may be one of the hidden causes. As we add more data points, the probability of finding the missing causes goes up.

Since we can never answer causal questions from data alone [Pearl 2018], we have formulated a model of the process that generates the data. It adds datasets, determines relationships, and tests for causation. Only then is the new data inserted into the existing system. In essence, we are sideskirting the complexities of causal inference through the use of network effects.

Path Switching

The idea is to continuously add new data points and test them for monotonic relationships with the other data points in the graph to discover causal relationships. While some of these new data points will be influencers, others will be influenced, and still, others will result in path switching.

While graphing the monotonic relationships of stocks, we discovered that these relationships change during a trading day. For example, AAPL may have a monotonic relationship with MSFT of 0.951 in the morning, with a reduction to 0.923 by the afternoon. This change in relationship is significant because a shift in the influencer’s state now affects the influenced differently, a phenomenon that leads to the idea of Path-Switching Causation.

Consider a switch with a state of two positions x. In position 1 (x = 1), the switch turns on the lamp (z = 1) and turns off a flashlight (w = 0). In position 0 (x = 0), the switch turns on the flashlight (w = 1) and turns off the lamp (z = 0). Let Y = 1 by the proposition that the room is lighted [Pearl 2009].

The causal influence Mu and Mu’ are associated with the states in which the switch is in position 1 and 2.

Changing X from 1 to 0 alters the course of the causal pathway while keeping the source and destination the same. This alteration applies to our causal network because a change in the state of one node can result in a shift in the causal pathway. Thus some causal relationships act as switches.

For example, consider the monotonic relationship between AAPL and MSFT. We may discover that AAPL does not cause MSFT to change. Instead, a third influencer may be at work. One that changes AAPL and MSFT and, as a result, alters the causal path of IBM. Therefore our model will ensure that some nodes can act as switches when required.

Network Effects

Google PageRank is the algorithm that created a multi-billion dollar company by answering the question, “What fraction of pages on the web have k in-links?” In other words, rank each page by popularity. The reason such predictions are so exact is the notion of popularity can be analyzed with very accurate precision through basic models of network behavior [Easley 2019].

The core tenant of our model is that nodes have popularity that corresponds to influence. Thus a change in state to a popular node will affect a significant degree of those nodes connected to it. We then test these connections for causality by isolating the relationship variables.

We introduce Network Effects into the model by basing causal testing on popularity. Like a social network, each dataset applied to the model benefits the other datasets already included. However, since we are dealing with popularity, power laws apply or why some posts on Twitter reach millions and others just a handful.

Black Swans and Epidemics

The idea that graphs can represent markets occurred when researching the 1987 market crash known as “Black Monday,” a global catastrophe with no compelling reason for the crash [Dolan 2022].

On October 19, 1987, U.S. markets fell more than 20% in a single day. However, the loss was not limited to just the U.S. Eight other markets declined by 20 to 29%, three by 30 to 39% (Malaysia, Mexico, and New Zealand), and three others by more than 40% (Hong Kong, Australia, and Singapore). Such a single market event reverberation reveals the interconnected nature of markets and assets.

A recurring theme from all accounts of this event is the words, “no compelling reason for the crash.” Such a statement illustrates the complexity involved in causal discovery. This is similar to the patterns by which epidemics spread through groups of people and why we use an equation from epidemiology to determine causation from monotonic relationships.

The opportunities for a disease to spread is calculated from a contact network [Easley 2019]. These models encompass people, animals, and even plants through branching. The first wave begins when a person or animal carrying a new disease enters a population of k people or animals and transmits it with probability p. The second wave starts when one newly infected enters a new population and interacts with k people or animals where k x k = k2 people or animals. The process continues over multiple waves revealing the network effects of popularity with an ability to generate cascade events.

In Network Science, the cascade capacity is the threshold for which some finite set of early adopters can cause a complete cascade [Easley 2019]. While initially used for social network trend forecasts, this measure also applies to other networks as illustrated by epidemics.

Another aspect of network effects is clusters. These are groups of nodes under a similar influence or sharing the same source of influence. Clusters serve as fences to stop cascades. However, if the shock is big enough, it can breach a cluster which signifies a “Black Swan” event forming. The ability of a cascade to penetrate depends on how dense or how many nodes the cluster has.

Deep Understanding

A deep understanding means knowing how things behaved yesterday and how they will behave under new hypothetical circumstances [Pearl 2009].

During the 1980s, researchers began using Bayesian networks to give machines human-like understanding. Instead of teaching machines cause and effect, a new paradigm evolved where it was sufficient to base your claims on assumptions as long as you make them transparent [Pearl 2018].

In contrast, we propose a new network model of causal relationships that a machine can build and infer from. The purpose is to facilitate continuous predictions from a network of influencers and influenced nodes.

Our model is a directed acyclic graph (DAG) where each node must undergo certain steps before being included. These steps are as follows:

Determine the monotonic relationships between the new node and existing ones through Spearman’s rank-order correlation

Using Network Effects of Popularity, determine the direction of influence by counting the k number of outgoing relationships and comparing it to each possible child.
From these monotonic relationships. calculate the susceptibility of influence using PS = P(y | x) — P(y | x’) / 1 — P(y |x’) to rebuild the connections based on causal relationships.

Note that each existing network node must be recalculated using the above steps to add a new node. However, this does not have to happen at once. The recalculation order can begin with the existing nodes sharing a relationship with the new one and branch out to other parts of the graph over some time. Including the ability to have more than one graph to represent alternative future states.

The relationships between nodes represent a point in time. Understanding comes from the machine’s ability to postulate over future states of the graph through counterfactual reasoning. For example, to predict the future price of AAPL, the system can consider all of the stock’s influencers and their influencers to calculate a range of possible future states using Pearl’s calculus of causation.

Each future state is assigned a probability of occurrence. Therefore, the answer to the question of “What will be the closing price of AAPL this Friday?” is the most probable future state.

However, such a model applies to more than just predicting stocks. It works with any time series data, such as distribution, blockchain, energy demand, etc. Furthermore, the model is designed to work with multiple verticals of datasets to answer complex questions such as:

How does demand for gasoline affect new car prices?
What stocks are most influential to the price of BTC?
What times are gas prices cheapest on Ethereum?
What temperature drives the highest energy usage?
When will the next stock market crash occur?

And many more…

Conclusion

What began as an attempt to forecast capital markets led Autonomous Predictive Intelligence. Our effort to use Deep Learning hit a wall because many variables responsible for market changes did not work with the current models. For example, Deep Learning cannot realize times for earnings calls, Fed announcements, and similar predictable events that impact specific market assets.

To account for cause and effect, we were introduced to the work of Turing Award winner Judea Pearl. While most attempts at Artificial General Intelligence (AGI) focus on correlation, Pearl believes that causal analysis is the key to unlocking AIs full potential [Hartnett 2021]. Pearl explains, “while probabilities encode our beliefs about a static world, causality tells us whether and how probabilities change when the world changes.”

We took it a step further with a multiple-worlds approach where models on different timelines work together to predict future outcomes. The result is a universal machine that reasons, forecasts, and learns from cause and effect. So far, we have identified world-changing applications in Finance, Energy, Agriculture, Manufacturing, and National Security.

References

Antoniadis, P. (2022) Introduction to Curve Fitting. Baeldung. https://www.baeldung.com/cs/curve-fitting
Chiou, L, Whitehead, S, Pilling, G, et al. (2023) Graph Theory. Brilliant.org, https://brilliant.org/wiki/graph-theory/
Cotton, P. (2022) Microprediction: Building an Open AI Network. MIT Press
Easley, D, Kleinberg, J. (2019) Networks, Crowds, and Markets. Cambridge University Press.
Dolan, B. (2022) What Caused Black Monday, the 1987 Stock Market Crash? Investopedia, https://www.investopedia.com/ask/answers/042115/what-caused-black-monday-stock-market-crash-1987.asp
Gonçalves, B. (2020) Structural Causal Models. Data for Science. https://medium.data4sci.com/causal-inference-part-iv-structural-causal-models-df10a83be580
Hartnett, K. (2021) To Build Truly Intelligent Machines, Teach Them Cause and Effect. QuantaMagazine. https://www.quantamagazine.org/to-build-truly-intelligent-machines-teach-them-cause-and-effect-20180515/
Joshi, V. (2017) A Gentle Introduction To Graph Theory. basecs. https://medium.com/basecs/a-gentle-introduction-to-graph-theory-77969829ead8
Neapolitan, R, Jiang, X. (2007) Probabilistic Methods for Financial and Marketing Informatics. Morgan Kaufmann
Ma, Y. (2021) Deep Learning on Graph. Cambridge University Press.
Moses, T. (2021) Mandelbrot’s Multifractal Model of Asset Returns Explained. DataDrivenInvestor. https://medium.datadriveninvestor.com/mandelbrots-multifractal-model-of-asset-returns-explained-bde7b5121be0
Pearl, J., Mackenzie, D., (2018) The Book of Why. Basic Books
Pearl, J. (2009) Causality. Cambridge University Press.
Pearl, J. (2008) Causality. NIPS 2008 workshop on causality. http://proceedings.mlr.press/v6/pearl10a/pearl10a.pdf
Ramzai, J. (2020) Clearly explained: Pearson V/S Spearman Correlation Coefficient. Towards Data Science. https://towardsdatascience.com/clearly-explained-pearson-v-s-spearman-correlation-coefficient-ada2f473b8
Sgaier, S. Huang, V., Charles G. (2020) The Case for Causal AI. Sanford Social Innovation Review. https://ssir.org/articles/entry/the_case_for_causal_ai
Shekhar, G. (2022) Causal AI — Enabling Data-Driven Decisions. Towards Data Science. https://towardsdatascience.com/causal-ai-enabling-data-driven-decisions-d162f2a2f15e
Stamile, C, Marzulio A, Eusebio, E, (2021) Graph Machine Learning, Packt Publishing
Williams, B, Cremaschi, S. (2022) Computer Aided Chemical Engineering. Elsevier