### CAUSALPATH – THE SCIENCE

Conceptually, CD methods can be thought of as trying to identify and quantify all plausible causal mechanisms that explain the data equally well (called the Markov Equivalent set). To make predictions one reasons with the set of possible models, e.g., if all plausible models agree that X causes Y, this hypothesis is postulated. To eliminate possible models certain assumptions about the nature of causality are made.

Suppose in a dataset we observe that two quantities X and Y are statistically associated (interchangeably, correlated or dependent) denoted as Dep(X;Y) (or reciprocally Ind(X;Y) to denote independence). A reasonable assumption is that associations appear because of causal structure, i.e,. either one is causing the other or (inclusively) a third confounding variable is causing both. This is known as the Reichenbach’s Common Cause Principle. Another foundational assumption is called the Markov Condition (Spirtes et al. 2001): indirect influences or associations become independent given the direct causes. Thus, if X influences Y indirectly (only) though Z, then we expect Ind(X ; Z | Y). If two variables X and Z are not correlated, could they nevertheless be causally related to each other? The answer is yes, but only in special cases requiring precisely balanced relationships between the probability densities involved. Assuming that this is not the case is called the Faithfulness Condition (hence, no dependency implies no causal relationship) (Spirtes et al. 2001). Thus, if we observe that Dep(X;Y), Dep(Y;Z), and Ind(X;Z) we can infer X – Y – Z where the edges denote a causal relation of unknown direction or a latent confounder; the missing edge denotes lack thereof. Next, suppose the data suggest that Dep(X;Z|Y), i.e., a conditional dependence. The simplest way to explain the dependence is that Y does not cause either X nor Z: if it did, it would be a common cause of X and Z and we would expect Dep(X ; Z) instead. We denote with L latent confounding variables, i.e., unobserved common causes of the observed quantities. Assembling the above constraints together we can graphically represent some causal possibilities: X → Y ← Z, X ← L → Y ← Z, X ← L → Y ← Z and additionally X → Y (both latent confounder and causal relation), or even X ← L1 → Y ← L2 → Z. Assuming that there exists no latent confounding variables is a strong assumption, called Causal Sufficiency. Additionally, the lack of feedback cycles is of-ten assumed, i.e., the causal structure is representable by a Directed Acyclic Graph. When all assumptions stated above are made the only causal structure that remains is X → Y ← Z which forms the graph of a Bayesian Network.

Dropping the Causal Sufficiency assumption requires a new type of graph to represent possible latent confounding variables called Maximal Ancestral Graphs (MAGs) (T. Richardson & Spirtes 2002). MAGs are graphical models generalizing Bayesian Networks to distributions with latent variables. MAGs capture causal probabilistic relations, they can represent different types of structural uncertainty (e.g., the fact that A maybe causing B or the two may be sharing a latent common cause), they capture the observable dependencies and independencies. They assume cross-sectional data and they do not allow the presence of causal cycles (feedback loops). There exist (asymptotically) sound and complete algorithms for inducing such models such as FCI (Colombo et al. 2012; Spirtes et al. 2001). MAGs can also be interpreted as path-diagrams that are subcases of Structural Equation Models [Kaplan]. The equivalence class of MAGs is represented with another type of graph called Partially-Oriented Ancestral Graph (PAG), e.g., X •→ Y ←• Z, where the • mark denotes the relation is X causes Y or (inclusively) they may have a latent confounder. In some cases, complicated analysis of the data independencies may lead to inducing pure-causal relations X → Y even when admitting latent confounders. Additionally, further analysis may re-veal the presence of latent confounding variables, a feat conceptually equivalent to postulating the presence of a yet invisible planet (named Pluto eventually) from observing disturbances in the orbits of nearby planets in the late 19th century. This basic approach presented is complemented by other causal principles or assumptions, reasoning on the parameters of the models and the relations, as well as algorithms for inducing models and relations, which together form the field of Causal Discovery.

Large efforts in biology and medicine concern Data Integration (not analysis) tools and databases that facilitate the collection, retrieval, inspection, merging or visualization of all available information. The synthesis and analysis of this information however, is still left up to the human expert. Such tools for example, will never infer the lack of correlation between two never jointly measured variables as in the Example of INCA in Figure 2. Integrative Analysis in bioinformatics often refers to the analysis of several different types of modalities, e.g., genetic data, expression data, CpG methylation levels, etc. It is conceptually similar to multi-view learning in computer science although the techniques employed target the idiosyncrasies of the biological data. However, these modalities are measured on the same population, under the same conditions and thus essentially form a single dataset (study), albeit with heterogeneous types of information. They cannot be used to piece together datasets with overlapping variable sets.

The integrative analysis of heterogeneous datasets in the sense used in this proposal has connections to statistical Meta-Analysis, Multi-Task Learning, Transfer Learning, Statistical Matching, and Relational Learning. For example, statistical matching (a.k.a. data fusion) (D’Orazio et al. 2006) concerns the analysis of datasets measuring different sets of variables (as in the example), and not the full spectrum of the type of analysis we envision (e.g., studies under different experimental conditions). We consider this approach as complementary and expect synergies to arise. Statistical meta-analysis combines the results of several studies to weigh all available evidence regarding a single hypothesis (correlation). In addition, meta-analysis is limited to the co-analysis of studies with similar sampling and experimental design characteristics. Multi-Task Learning in Machine Learning solves multiple learning tasks together from different datasets using a shared representation. The idea is to learn common induced features that may help learning. Again, these inferences are limited as they can only combine studies under similar sampling and experimental conditions on the same sets of variables. Transfer Learning or inductive transfer generalizes multi-task learning to include sharing search-control techniques, model priors and other characteristics among different tasks (datasets), exhibits though similar limitations. The field of Relational Learning does not really address the problem of learning from different datasets, rather than a single dataset (the one stemming from implicitly propositionalizing the database) in the form of relational tables. While integrative analysis of heterogeneous datasets is paramount in science even established fields, such as meta-analysis face significant methodological problems, exactly because they ignore causality and do not perform causal modeling. J. Pearl criticizes them heavily illustrating the need for new, formal methods (Pearl 2012): *“Meta analysis is a data fusion problem aimed at combining results from many experimental and observational studies, each conducted on a different population and under a different set of conditions, so as to synthesize an aggregate measure of effect size that is “better,” in some sense, than any one study in isolation. This fusion problem has received enormous attention in the health and social sciences, where data are scarce and experiments are costly. Unfortunately, current techniques of meta-analysis do little more than take weighted averages of the various studies, thus averaging apples and oranges to infer properties of bananas. One should be able to do better.”* CAUSALPATH is an effort to do better. Rigorous theories and algorithms for considering different experimental conditions as well as over-lapping datasets are under way by our group (Vincenzo Lagani et al. 2012) as well as other groups (Hauser & Bühlmann 2011) and will be further developed and converted to easy-to-use tools (**Objectives 8 and 9**).