Postgraduate research fellow, Department of Computer Science, University of Crete, Hellas
Postgraduate research fellow, Department of Computer Science, University of Crete, Hellas
Department of Computer Science, University of Crete, Hellas
Postgraduate student, Department of Computer Science, University of Crete, Hellas
Postgraduate student, Department of Computer Science, University of Crete, Hellas
Postgraduate student, Department of Computer Science, University of Crete, Hellas
Postgraduate student, Department of Computer Science, University of Crete, Hellas
Postgraduate student, Department of Computer Science, University of Crete, Hellas
- O. D. Røe, M. Markaki, I. Tsamardinos, V. Lagani, O. T. D. Nguyen, J. H. Pedersen, Z. Saghir, and H. G. Ashraf, “‘Reduced’ HUNT model outperforms NLST and NELSON study criteria in predicting lung cancer in the Danish screening trial ,” BMJ Open Respiratory Research , vol. 6, iss. 1, 2019.
Hypothesis We hypothesise that the validated HUNT Lung Cancer Risk Model would perform better than the NLST (USA) and the NELSON (Dutch‐Belgian) criteria in the Danish Lung Cancer Screening Trial (DLCST). Methods The DLCST measured only five out of the seven variables included in validated HUNT Lung Cancer Model. Therefore a ‘Reduced’ model was retrained in the Norwegian HUNT2-cohort using the same statistical methodology as in the original HUNT model but based only on age, pack years, smoking intensity, quit time and body mass index (BMI), adjusted for sex. The model was applied on the DLCST-cohort and contrasted against the NLST and NELSON criteria. Results Among the 4051 smokers in the DLCST with 10 years follow-up, median age was 57.6, BMI 24.75, pack years 33.8, cigarettes per day 20 and most were current smokers. For the same number of individuals selected for screening, the performance of the ‘Reduced’ HUNT was increased in all metrics compared with both the NLST and the NELSON criteria. In addition, to achieve the same sensitivity, one would need to screen fewer people by the ‘Reduced’ HUNT model versus using either the NLST or the NELSON criteria (709 vs 918, p=1.02e-11 and 1317 vs 1668, p=2.2e-16, respectively). Conclusions The ‘Reduced’ HUNT model is superior in predicting lung cancer to both the NLST and NELSON criteria in a cost-effective way. This study supports the use of the HUNT Lung Cancer Model for selection based on risk ranking rather than age, pack year and quit time cut-off values. When we know how to rank personal risk, it will be up to the medical community and lawmakers to decide which risk threshold will be set for screening.
- G. Papoutsoglou, V. Lagani, A. Schmidt, K. Tsirlis, D. Cabrero, J. Tegner, and I. Tsamardinos, “Challenges in the Multivariate Analysis of Mass Cytometry Data: The Effect of Randomization,” Cytometry Part A, 2019.
Cytometry by time‐of‐flight (CyTOF) has emerged as a high‐throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high‐dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.
- D. Gomez-Cabrero, S. Tarazona, I. Ferreirós-Vidal, R. N. Ramirez, C. Company, A. Schmidt, T. Reijmers, V. von Saint Paul, F. Marabita, J. Rodr’iguez-Ubreva, A. Garcia-Gomez, T. Carroll, L. Cooper, Z. Liang, G. Dharmalingam, F. van der Kloet, A. C. Harms, L. Balzano-Nogueira, V. Lagani, I. Tsamardinos, M. Lappe, D. Maier, J. A. Westerhuis, T. Hankemeier, A. Imhof, E. Ballestar, A. Mortazavi, M. Merkenschlager, J. Tegner, and A. Conesa, “STATegra, a comprehensive multi-omics dataset of B-cell differentiation in mouse,” Scientific Data, vol. 6, iss. 1, 2019.
Multi-omics approaches use a diversity of high-throughput technologies to profile the different molecular layers of living cells. Ideally, the integration of this information should result in comprehensive systems models of cellular physiology and regulation. However, most multi-omics projects still include a limited number of molecular assays and there have been very few multi-omic studies that evaluate dynamic processes such as cellular growth, development and adaptation. Hence, we lack formal analysis methods and comprehensive multi-omics datasets that can be leveraged to develop true multi-layered models for dynamic cellular systems. Here we present the STATegra multi-omics dataset that combines measurements from up to 10 different omics technologies applied to the same biological system, namely the well-studied mouse pre-B-cell differentiation. STATegra includes high-throughput measurements of chromatin structure, gene expression, proteomics and metabolomics, and it is complemented with single-cell data. To our knowledge, the STATegra collection is the most diverse multi-omics dataset describing a dynamic biological system.
- K. Lakiotaki, G. Georgakopoulos, E. Castanas, O. D. Røe, G. Borboudakis, and I. Tsamardinos, “A data driven approach reveals disease similarity on a molecular level,” npj Systems Biology and Applications , vol. 5, iss. 39, pp. 1-10, 2019.
Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.
- M. Tsagris and I. Tsamardinos, “Feature selection with the R package MXM,” F1000Research, vol. 7, p. 1505, 2019.
Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications. Keywords
- D. Kyriakis, A. Kanterakis, T. Manousaki, A. Tsakogiannis, M. Tsagris, I. Tsamardinos, L. Papaharisis, D. Chatziplis, G. Potamias, and C. Tsigenopoulos, “Scanning of Genetic Variants and Genetic Mapping of Phenotypic Traits in Gilthead Sea Bream Through ddRAD Sequencing,” Frontiers in Genetics , vol. 10, p. 675, 2019.
Gilthead sea bream (Sparus aurata) is a teleost of considerable economic importance in Southern European aquaculture. The aquaculture industry shows a growing interest in the application of genetic methods that can locate phenotype–genotype associations with high economic impact. Through selective breeding, the aquaculture industry can exploit this information to maximize the financial yield. Here, we present a Genome Wide Association Study (GWAS) of 112 samples belonging to seven different sea bream families collected from a Greek commercial aquaculture company. Through double digest Random Amplified DNA (ddRAD) Sequencing, we generated a per-sample genetic profile consisting of 2,258 high-quality Single Nucleotide Polymorphisms (SNPs). These profiles were tested for association with four phenotypes of major financial importance: Fat, Weight, Tag Weight, and the Length to Width ratio. We applied two methods of association analysis. The first is the typical single-SNP to phenotype test, and the second is a feature selection (FS) method through two novel algorithms that are employed for the first time in aquaculture genomics and produce groups with multiple SNPs associated to a phenotype. In total, we identified 9 single SNPs and 6 groups of SNPs associated with weight-related phenotypes (Weight and Tag Weight), 2 groups associated with Fat, and 16 groups associated with the Length to Width ratio. Six identified loci (Chr4:23265532, Chr6:12617755, Chr:8:11613979, Chr13:1098152, Chr15:3260819, and Chr22:14483563) were present in genes associated with growth in other teleosts or even mammals, such as semaphorin-3A and neurotrophin-3. These loci are strong candidates for future studies that will help us unveil the genetic mechanisms underlying growth and improve the sea bream aquaculture productivity by providing genomic anchors for selection programs.
- J. Fernandes Sunja, H. Morikawa, E. Ewing, S. Ruhrmann, N. Joshi Rubin, V. Lagani, N. Karathanasis, M. Khademi, N. Planell, A. Schmidt, I. Tsamardinos, T. Olsson, F. Piehl, I. Kockum, M. Jagodic, J. Tegnér, and D. Gomez-Cabrero, “Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients,” Nature Scientific Reports, vol. 9, iss. 11996, 2019.
Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. the triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials.
- E. Ewing, L. Kular, S. J. Fernandes, N. Karathanasis, V. Lagani, S. Ruhrmann, I. Tsamardinos, J. Tegner, F. Piehl, D. Gomez-Cabrero, and M. Jagodic, “Combining evidence from four immune cell types identifies DNA methylation patterns that implicate functionally distinct pathways during Multiple Sclerosis progression,” EBioMedicine, vol. 43, pp. 411-423, 2019.
Background Multiple Sclerosis (MS) is a chronic inflammatory disease and a leading cause of progressive neurological disability among young adults. DNA methylation, which intersects genes and environment to control cellular functions on a molecular level, may provide insights into MS pathogenesis. Methods We measured DNA methylation in CD4+ T cells (n = 31), CD8+ T cells (n = 28), CD14+ monocytes (n = 35) and CD19+ B cells (n = 27) from relapsing-remitting (RRMS), secondary progressive (SPMS) patients and healthy controls (HC) using Infinium HumanMethylation450 arrays. Monocyte (n = 25) and whole blood (n = 275) cohorts were used for validations. Findings B cells from MS patients displayed most significant differentially methylated positions (DMPs), followed by monocytes, while only few DMPs were detected in T cells. We implemented a non-parametric combination framework (omicsNPC) to increase discovery power by combining evidence from all four cell types. Identified shared DMPs co-localized at MS risk loci and clustered into distinct groups. Functional exploration of changes discriminating RRMS and SPMS from HC implicated lymphocyte signaling, T cell activation and migration. SPMS-specific changes, on the other hand, implicated myeloid cell functions and metabolism. Interestingly, neuronal and neurodegenerative genes and pathways were also specifically enriched in the SPMS cluster. Interpretation We utilized a statistical framework (omicsNPC) that combines multiple layers of evidence to identify DNA methylation changes that provide new insights into MS pathogenesis in general, and disease progression, in particular. Fund This work was supported by the Swedish Research Council, Stockholm County Council, AstraZeneca, European Research Council, Karolinska Institutet and Margaretha af Ugglas Foundation.
- M. S. Loos, R. Ramakrishnan, W. Vranken, A. Tsirigotaki, E. Tsare, V. Zorzini, J. D. Geyter, B. Yuan, I. Tsamardinos, M. Klappa, J. Schymkowitz, F. Rousseau, S. Karamanou, and A. Economou, “Structural Basis of the Subcellular Topology Landscape of Escherichia coli,” Frontiers in Microbiology, vol. 10, 2019.
Cellular proteomes are distributed in multiple compartments: on DNA, ribosomes, on and inside membranes, or they become secreted. Structural properties that allow polypeptides to occupy subcellular niches, particularly to after crossing membranes, remain unclear. We compared intrinsic and extrinsic features in cytoplasmic and secreted polypeptides of the Escherichia coli K-12 proteome. Structural features between the cytoplasmome and secretome are sharply distinct, such that a signal peptide-agnostic machine learning tool distinguishes cytoplasmic from secreted proteins with 95.5% success. Cytoplasmic polypeptides are enriched in aliphatic, aromatic, charged and hydrophobic residues, unique folds and higher early folding propensities. Secretory polypeptides are enriched in polar/small amino acids, β folds, have higher backbone dynamics, higher disorder and contact order and are more often intrinsically disordered. These non-random distributions and experimental evidence imply that evolutionary pressure selected enhanced secretome flexibility, slow folding and looser structures, placing the secretome in a distinct protein class. These adaptations protect the secretome from premature folding during its cytoplasmic transit, optimize its lipid bilayer crossing and allowed it to acquire cell envelope specific chemistries. The latter may favor promiscuous multi-ligand binding, sensing of stress and cell envelope structure changes. In conclusion, enhanced flexibility, slow folding, looser structures and unique folds differentiate the secretome from the cytoplasmome. These findings have wide implications on the structural diversity and evolution of modern proteomes and the protein folding problem.
- I. Ferreirós-Vidal, T. Carroll, T. Zhang, V. Lagani, R. N. Ramirez, E. Ing-Simmons, A. Garcia, L. Cooper, Z. Liang, G. Papoutsoglou, G. Dharmalingam, Y. Guo, S. Tarazona, S. J. Fernandes, P. Noori, G. Silberberg, A. G. Fisher, I. Tsamardinos, A. Mortazavi, B. Lenhard, A. Conesa, J. Tegner, M. Merkenschlager, and D. Gomez-Cabrero, “Feedforward regulation of Myc coordinates lineage-specific with housekeeping gene expression during B cell progenitor cell differentiation,” PLOS Biology, vol. 17, iss. 4, pp. 1-28, 2019.
The human body is made from billions of cells comprizing many specialized cell types. All of these cells ultimately come from a single fertilized oocyte in a process that has two key features: proliferation, which expands cell numbers, and differentiation, which diversifies cell types. Here, we have examined the transition from proliferation to differentiation using B lymphocytes as an example. We find that the transition from proliferation to differentiation involves changes in the expression of genes, which can be categorized into cell-type–specific genes and broadly expressed “housekeeping” genes. The expression of many housekeeping genes is controlled by the gene regulatory factor Myc, whereas the expression of many B lymphocyte–specific genes is controlled by the Ikaros family of gene regulatory proteins. Myc is repressed by Ikaros, which means that changes in housekeeping and tissue-specific gene expression are coordinated during the transition from proliferation to differentiation.
- Y. Pantazis and I. Tsamardinos, “A unified approach for sparse dynamical system inference from temporal measurements,” Bioinformatics, 2019.
Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously.In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL’s accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway.Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics.Supplementary data are available at Bioinformatics online.
- G. Borboudakis and I. Tsamardinos, “Forward-Backward Selection with Early Dropping,” Journal of Machine Learning Research, vol. 20, iss. 8, pp. 1-39, 2019.
Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists
- M. Panagopoulou, M. Karaglani, I. Balgkouranidou, V. Vasilakakis, E. Biziota, T. Koukaki, E. Karamitrousis, E. Nena, I. Tsamardinos, G. Kolios, E. Lianidou, S. Kakolyris, and E. Chatzaki, “Circulating cell free DNA in Breast cancer: size profiling, levels and methylation patterns lead to prognostic and predictive classifiers,” Oncogene , vol. 38, iss. 18, pp. 3387-3401, 2019.
Blood circulating cell-free DNA (ccfDNA) is a suggested biosource of valuable clinical information for cancer, meeting the need for a minimally-invasive advancement in the route of precision medicine. In this paper, we evaluated the prognostic and predictive potential of ccfDNA parameters in early and advanced breast cancer. Groups consisted of 150 and 16 breast cancer patients under adjuvant and neoadjuvant therapy respectively, 34 patients with metastatic disease and 35 healthy volunteers. Direct quantification of ccfDNA in plasma revealed elevated concentrations correlated to the incidence of death, shorter PFS, and non-response to pharmacotherapy in the metastatic but not in the other groups. The methylation status of a panel of cancer-related genes chosen based on previous expression and epigenetic data (KLK10, SOX17, WNT5A, MSH2, GATA3) was assessed by quantitative methylation-specific PCR. All but the GATA3 gene was more frequently methylated in all the patient groups than in healthy individuals (all p < 0.05). The methylation of WNT5A was statistically significantly correlated to greater tumor size and poor prognosis characteristics and in advanced stage disease with shorter OS. In the metastatic group, also SOX17 methylation was significantly correlated to the incidence of death, shorter PFS, and OS. KLK10 methylation was significantly correlated to unfavorable clinicopathological characteristics and relapse, whereas in the adjuvant group to shorter DFI. Methylation of at least 3 or 4 genes was significantly correlated to shorter OS and no pharmacotherapy response, respectively. Classification analysis by a fully automated, machine learning software produced a single-parametric linear model using ccfDNA plasma concentration values, with great discriminating power to predict response to chemotherapy (AUC 0.803, 95% CI [0.606, 1.000]) in the metastatic group. Two more multi-parametric signatures were produced for the metastatic group, predicting survival and disease outcome. Finally, a multiple logistic regression model was constructed, discriminating between patient groups and healthy individuals. Overall, ccfDNA emerged as a highly potent predictive classifier in metastatic breast cancer. Upon prospective clinical evaluation, all the signatures produced could aid accurate prognosis.
- K. Tsirlis, V. Lagani, S. Triantafillou, and I. Tsamardinos, “On scoring Maximal Ancestral Graphs with the Max\textendashMin Hill Climbing algorithm,” International Journal of Approximate Reasoning, vol. 102, pp. 74-85, 2018.
We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max–Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraint-based first phase that greatly reduces the space of structures to investigate. On a large scale experimentation we show that the proposed algorithm greatly improves on GSMAG in all comparisons, and over a set of known networks from the literature it compares positively against FCI and cFCI as well as competitively against GFCI, three well known constraint-based approaches for causal-network reconstruction in presence of latent confounders.
- M. Tsagris, “Bayesian Network Learning with the PC Algorithm: An Improved and Correct Variation,” Applied Artificial Intelligence , vol. 33, iss. 2, pp. 101-123, 2018.
PC is a prototypical constraint-based algorithm for learning Bayesian networks, a special case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was developed to make the skeleton phase order independent. In return, it has notably increased execution time. In this paper, we clarify that the PC algorithm the skeleton phase of PC is indeed order independent. The modification we propose outperforms pcalg’s variant of the PC in terms of returning correct networks of better quality as is less prone to errors and in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s variant does not return valid acyclic graphs.
- I. Tsamardinos, E. Greasidou, and G. Borboudakis, “Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation,” Machine Learning, vol. 107, iss. 12, pp. 1895-1922, 2018.
Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV’s main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822–829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.
- I. Tsamardinos, G. Borboudakis, P. Katsogridakis, P. Pratikakis, and V. Christophides, “A greedy feature selection algorithm for Big Data of high dimensionality,” Machine Learning, vol. 108, iss. 2, pp. 149-202, 2018.
We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
- M. Adamou, G. Antoniou, E. Greasidou, V. Lagani, P. Charonyktakis, I. Tsamardinos, and M. Doyle, “Toward Automatic Risk Assessment to Support Suicide Prevention,” Crisis, vol. 40, pp. 249-256, 2018.
Background: Suicide has been considered an important public health issue for years and is one of the main causes of death worldwide. Despite prevention strategies being applied, the rate of suicide has not changed substantially over the past decades. Suicide risk has proven extremely difficult to assess for medical specialists, and traditional methodologies deployed have been ineffective. Advances in machine learning make it possible to attempt to predict suicide with the analysis of relevant data aiming to inform clinical practice. Aims: We aimed to (a) test our artificial intelligence based, referral-centric methodology in the context of the National Health Service (NHS), (b) determine whether statistically relevant results can be derived from data related to previous suicides, and (c) develop ideas for various exploitation strategies. Method: The analysis used data of patients who died by suicide in the period 2013–2016 including both structured data and free-text medical notes, necessitating the deployment of state-of-the-art machine learning and text mining methods. Limitations: Sample size is a limiting factor for this study, along with the absence of non-suicide cases. Specific analytical solutions were adopted for addressing both issues. Results and Conclusion: The results of this pilot study indicate that machine learning shows promise for predicting within a specified period which people are most at risk of taking their own life at the time of referral to a mental health service.
- M. Tsagris, V. Lagani, and I. Tsamardinos, ” Feature selection for high-dimensional temporal data,” BMC Bioinformatics, vol. 19, iss. 17, pp. 1-14, 2018.
Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work, we extend established constrained-based, feature-selection methods to high-dimensional “omics” temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. The use of this algorithm is suggested for variable selection with high-dimensional temporal data.
- K. Lakiotaki, N. Vorniotakis, M. Tsagris, G. Georgakopoulos, and I. Tsamardinos, “BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology,” Database, vol. 2018, iss. bay011, pp. 1-14, 2018.
Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/
- M. Tsagris, G. Borboudakis, V. Lagani, and I. Tsamardinos, “Constraint-based causal discovery with mixed data,” International Journal of Data Science and Analytics, vol. 6, iss. 1, pp. 19-30, 2018.
We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and or-dinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simu-lated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.
- M. Adamou, G. Antoniou, E. Greassidou, V. Lagani, P. Charonyktakis, and I. Tsamardinos, “Mining Free-Text Medical Notes for Suicide Risk Assessment,” in SETN ’18 Proceedings of the 10th Hellenic Conference on Artificial Intelligence, 2018.
Suicide has been considered as an important public health issue for a very long time, and is one of the main causes of death worldwide. Despite suicide prevention strategies being applied, the rate of suicide has not changed substantially over the past decades. Advances in machine learning make it possible to attempt to predict suicide based on the analysis of relevant data to inform clinical practice. This paper reports on findings from the analysis of data of patients who died by suicide in the period 2013-2016 and made use of both structured data and free-text medical notes. We focus on examining various text-mining approaches to support risk assessment. The results show that using advance machine learning and text-mining techniques, it is possible to predict within a specified period which people are most at risk of taking their own life at the time of referral to a mental health service.
- M. Markaki, I. Tsamardinos, A. Langhammer, V. Lagani, K. Hveem, and O. D. Røe, “A Validated Clinical Risk Prediction Model for Lung Cancer in Smokers of All Ages and Exposure Types: A HUNT Study.,” EBioMedicine, vol. 31, pp. 34-46, 2018.
Lung cancer causes >1·6 million deaths annually, with early diagnosis being paramount to effective treatment. Here we present a validated risk assessment model for lung cancer screening. The prospective HUNT2 population study in Norway examined 65,237 people aged >20years in 1995-97. After a median of 15·2years, 583 lung cancer cases had been diagnosed; 552 (94·7%) ever-smokers and 31 (5·3%) never-smokers. We performed multivariable analyses of 36 candidate risk predictors, using multiple imputation of missing data and backwards feature selection with Cox regression. The resulting model was validated in an independent Norwegian prospective dataset of 45,341 ever-smokers, in which 675 lung cancers had been diagnosed after a median follow-up of 11·6years. Our final HUNT Lung Cancer Model included age, pack-years, smoking intensity, years since smoking cessation, body mass index, daily cough, and hours of daily indoors exposure to smoke. External validation showed a 0·879 concordance index (95% CI 0·866-0·891) with an area under the curve of 0·87 (95% CI 0·85-0·89) within 6years. Only 22% of ever-smokers would need screening to identify 81·85% of all lung cancers within 6years. Our model of seven variables is simple, accurate, and useful for screening selection.
- G. Borboudakis, T. Stergiannakos, M. Frysali, E. Klontzas, I. Tsamardinos, and G. E. Froudakis, “Chemically intuited, large-scale screening of MOFs by machine learning techniques,” NPJ Computational Materials, vol. 3, iss. 40, 2017.
A novel computational methodology for large-scale screening of MOFs is applied to gas storage with the use of machine learning technologies. This approach is a promising trade-off between the accuracy of ab initio methods and the speed of classical approaches, strategically combined with chemical intuition. The results demonstrate that the chemical properties of MOFs are indeed predictable (stochastically, not deterministically) using machine learning methods and automated analysis protocols, with the accuracy of predictions increasing with sample size. Our initial results indicate that this methodology is promising to apply not only to gas storage in MOFs but in many other material science projects.
- V. Lagani, G. Athineou, A. Farcomeni, M. Tsagris, and I. Tsamardinos, “Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets,” Journal of Statistical Software, vol. 80, iss. 7, 2017.
The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning ‘mind from the machine’ in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.
- G. Orfanoudaki, M. Markaki, K. Chatzi, I. Tsamardinos, and A. Economou, “MatureP: prediction of secreted proteins with exclusive information from their mature regions,” Nature Scientific Reports, vol. 7, iss. 1, p. 3263, 2017.
More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.
- K. Siomos, E. Papadaki, I. Tsamardinos, K. Kerkentzes, M. Koygioylis, and C. Trakatelli, “Prothrombotic and Endothelial Inflammatory Markers in Greek Patients with Type 2 Diabetes Compared to Non-Diabetics,” Endocrinology & Metabolic Syndrome, vol. 6, iss. 1, 2017.
Objective: To evaluate specific factors of coagulation and endothelial inflammatory markers namely, thrombomodulin, soluble receptor of the protein C (sEPCR), factor VIII, plasminogen activator inhibitor 1, Von Willebrandt factor, fibrinogen, fibrinogen dimers (d-dimers), high sensitivity C-reactive protein and homocysteine in a subset of Greek subjects with and without Type 2 (T2) Diabetes. Design: 84 subjects, of which 44 patients with T2 diabetes, were included in the randomized comparative prospective cross sectional study. The subjects were split into a Τ2 diabetics group and a group of healthy controls of similar age, anthropometric profiles and similar gender distribution. Results: A total of 47 variables and biomarkers together with indicators for metabolic profiles, clinical history, as well as detailed anthropometric profiles and traditional risk factors, were evaluated. Dipeptidyl peptidase-4 (DPP4), Insulin, use of Sulfonylurea, high HBA1c and glucose levels, were clearly statistically differentiated in the two groups, while no other biomarkers including the new potential indicators were found to be different. High values of thrombomodulin and homocysteine were correlated with a rise in creatinine and thus seem to affect renal function in the diabetic patients group while in the non-diabetics group the correlations are different with sEPCR having a relative strong negative correlation in renal function as measured with The Modification of Diet in Renal Disease, in agreement with the latest international findings. Conclusions: The presence of T2 diabetes in conjunction with age clearly correlates with problems in renal function, thrombomodulin and homocysteine could serve as indicators for renal damage in diabetics but not in healthy individuals. sEPCR on the other hand could be a potential generic indicator for renal damage. Thrombomodulin and sEPCR as prothombotic agents, did not show any indication that they can be utilised as markers for the prevention and/or treatment of thrombotic complications in diabetic patients.
- G. Papoutsoglou, G. Athineou, V. Lagani, I. Xanthopoulos, A. Schmidt, S. Éliás, J. Tegnér, and I. Tsamardinos, “SCENERY: a web application for (causal) network reconstruction from cytometry data,” Nucleic Acids Research, vol. 45, p. W270-W275, 2017.
Flow and mass cytometry technologies can probe proteins as biological markers in thousands of individual cells simultaneously, providing unprecedented opportunities for reconstructing networks of protein interactions through machine learning algorithms. The network reconstruction (NR) problem has been well-studied by the machine learning community. However, the potentials of available methods remain largely unknown to the cytometry community, mainly due to their intrinsic complexity and the lack of comprehensive, powerful and easy-to-use NR software implementations specific for cytometry data. To bridge this gap, we present Single CEll NEtwork Reconstruction sYstem (SCENERY), a web server featuring several standard and advanced cytometry data analysis methods coupled with NR algorithms in a user-friendly, on-line environment. In SCENERY, users may upload their data and set their own study design. The server offers several data analysis options categorized into three classes of methods: data (pre)processing, statistical analysis and NR. The server also provides interactive visualization and download of results as ready-to-publish images or multimedia reports. Its core is modular and based on the widely-used and robust R platform allowing power users to extend its functionalities by submitting their own NR methods. SCENERY is available at scenery.csd.uoc.gr or http://mensxmachina.org/en/software/.
- S. Triantafillou, V. Lagani, C. Heinze-Deml, A. Schmidt, J. Tegner, and I. Tsamardinos, “Predicting Causal Relationships from Biological Data: Applying Automated Casual Discovery on Mass Cytometry Data of Human Immune Cells,” Nature Scientific Reports, vol. 7, iss. 12724, 2017.
Learning the causal relationships that define a molecular system allows us to predict how the system will respond to different interventions. Distinguishing causality from mere association typically requires randomized experiments. Methods for automated causal discovery from limited experiments exist, but have so far rarely been tested in systems biology applications. In this work, we apply state-of-the art causal discovery methods on a large collection of public mass cytometry data sets, measuring intra-cellular signaling proteins of the human immune system and their response to several perturbations. We show how different experimental conditions can be used to facilitate causal discovery, and apply two fundamental methods that produce context-specific causal predictions. Causal predictions were reproducible across independent data sets from two different studies, but often disagree with the KEGG pathway databases. Within this context, we discuss the caveats we need to overcome for automated causal discovery to become a part of the routine data analysis in systems biology.
Mens Ex Machina, Mind from the Machine or “Ο από Μηχανής Νους” paraphrases the latin expression Deus Ex Machina, God from the Machine. The name was suggested by Lucy Sofiadou, Prof. Tsamardinos’ wife.
We are a research group, founded in October 2006, led by Professor Ioannis Tsamardinos, interested in Artificial Intelligence, Machine Learning, and Biomedical Informatics and affiliated with the Computer Science Department of University of Crete. The aims of the group are to progress science and disseminate knowledge via educational activities and computer tools. Our group is involved in
Theoretical, algorithmic, and applied research in all of the above areas; we are also involved in interdisciplinary collaborations with biologists, physicians and practitioners from other fields.
Educational activities, such as teaching university courses, tutorials, summers schools, as well as supervising undergraduate dissertations, masters projects, and Ph.D. theses.
Systems and Software:
Implementation of tools, systems, and code libraries to aid the dissemination of the research results. Funding is provided from and through the University of Crete, often originating from European and International research grants.
Current research activities include but not limited to the following:
- Causal discovery methods and the induction of causal models from observational studies. Specifically, we have recently introduced the problem of Integrative Causal Analysis (INCA).
- Feature selection (a.k.a. variable selection) for classification and regression.
- Induction of graphical models, such as Bayesian Networks from data.
- Analysis of biomedical data and applications of AI and Machine Learning methods to induce new biomedical knowledge.
- Activity recognition in Ambient Intelligent environments.
Professor, Department of Computer Science, University of Crete