### CAUSALPATH – THE SCIENCE

Conceptually, CD methods can be thought of as trying to identify and quantify all plausible causal mechanisms that explain the data equally well (called the Markov Equivalent set). To make predictions one reasons with the set of possible models, e.g., if all plausible models agree that X causes Y, this hypothesis is postulated. To eliminate possible models certain assumptions about the nature of causality are made.

Suppose in a dataset we observe that two quantities X and Y are statistically associated (interchangeably, correlated or dependent) denoted as Dep(X;Y) (or reciprocally Ind(X;Y) to denote independence). A reasonable assumption is that associations appear because of causal structure, i.e,. either one is causing the other or (inclusively) a third confounding variable is causing both. This is known as the Reichenbach’s Common Cause Principle. Another foundational assumption is called the Markov Condition (Spirtes et al. 2001): indirect influences or associations become independent given the direct causes. Thus, if X influences Y indirectly (only) though Z, then we expect Ind(X ; Z | Y). If two variables X and Z are not correlated, could they nevertheless be causally related to each other? The answer is yes, but only in special cases requiring precisely balanced relationships between the probability densities involved. Assuming that this is not the case is called the Faithfulness Condition (hence, no dependency implies no causal relationship) (Spirtes et al. 2001). Thus, if we observe that Dep(X;Y), Dep(Y;Z), and Ind(X;Z) we can infer X – Y – Z where the edges denote a causal relation of unknown direction or a latent confounder; the missing edge denotes lack thereof. Next, suppose the data suggest that Dep(X;Z|Y), i.e., a conditional dependence. The simplest way to explain the dependence is that Y does not cause either X nor Z: if it did, it would be a common cause of X and Z and we would expect Dep(X ; Z) instead. We denote with L latent confounding variables, i.e., unobserved common causes of the observed quantities. Assembling the above constraints together we can graphically represent some causal possibilities: X → Y ← Z, X ← L → Y ← Z, X ← L → Y ← Z and additionally X → Y (both latent confounder and causal relation), or even X ← L1 → Y ← L2 → Z. Assuming that there exists no latent confounding variables is a strong assumption, called Causal Sufficiency. Additionally, the lack of feedback cycles is of-ten assumed, i.e., the causal structure is representable by a Directed Acyclic Graph. When all assumptions stated above are made the only causal structure that remains is X → Y ← Z which forms the graph of a Bayesian Network.

Dropping the Causal Sufficiency assumption requires a new type of graph to represent possible latent confounding variables called Maximal Ancestral Graphs (MAGs) (T. Richardson & Spirtes 2002). MAGs are graphical models generalizing Bayesian Networks to distributions with latent variables. MAGs capture causal probabilistic relations, they can represent different types of structural uncertainty (e.g., the fact that A maybe causing B or the two may be sharing a latent common cause), they capture the observable dependencies and independencies. They assume cross-sectional data and they do not allow the presence of causal cycles (feedback loops). There exist (asymptotically) sound and complete algorithms for inducing such models such as FCI (Colombo et al. 2012; Spirtes et al. 2001). MAGs can also be interpreted as path-diagrams that are subcases of Structural Equation Models [Kaplan]. The equivalence class of MAGs is represented with another type of graph called Partially-Oriented Ancestral Graph (PAG), e.g., X •→ Y ←• Z, where the • mark denotes the relation is X causes Y or (inclusively) they may have a latent confounder. In some cases, complicated analysis of the data independencies may lead to inducing pure-causal relations X → Y even when admitting latent confounders. Additionally, further analysis may re-veal the presence of latent confounding variables, a feat conceptually equivalent to postulating the presence of a yet invisible planet (named Pluto eventually) from observing disturbances in the orbits of nearby planets in the late 19th century. This basic approach presented is complemented by other causal principles or assumptions, reasoning on the parameters of the models and the relations, as well as algorithms for inducing models and relations, which together form the field of Causal Discovery.

*“Meta analysis is a data fusion problem aimed at combining results from many experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to synthesize an aggregate measure of effect size that is “better,” in some sense, than any one study in isolation. This fusion problem has received enormous attention in the health and social sciences, where data are scarce and experiments are costly. Unfortunately, current techniques of meta-analysis do little more than take weighted averages of the various studies, thus averaging apples and oranges to infer properties of bananas. One should be able to do better.”*CAUSALPATH is an effort to do better. Rigorous theories and algorithms for considering different experimental conditions as well as over-lapping datasets are under way by our group (Vincenzo Lagani et al. 2012) as well as other groups (Hauser & Bühlmann 2011) and will be further developed and converted to easy-to-use tools (

**Objectives 8 and 9**).