Mens X Machina

Our software

Recent News






Projects



CAUSAL PATH

Next Generation Causal Analysis inspired by the induction of biological pathways from cytometry data

Details
HUNT

Lorem ipsum d

Details
STATEGRA

Statistical methods and tools for the integrative analysis of omics data

Details
EPILOGEAS – ARISTEIA II

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod

Details

Our Team

Theo. Giakoumakis

Ph.D

Ph.D. student,
Department of Computer Science, University of Crete, Hellas

J. Xanthopoulos

Msc

Postgraduate student, Department of Computer Science, University of Crete, Hellas

Konstantina Biza

Msc

Postgraduate student, Department of Computer Science, University of Crete, Hellas

Stefanos Fafalios

Msc

Postgraduate student, Department of Computer Science, University of Crete, Hellas

Myrto Krana

Msc

Postgraduate student, Department of Computer Science, University of Crete, Hellas

Ioulia Karagiannaki

Msc

Postgraduate student, Department of Computer Science, University of Crete, Hellas

Publications

2019

  • Y. Pantazis and I. Tsamardinos, “A Unified Approach for Sparse Dynamical System Inference from Temporal Measurements, (to appear),” Bioinformatics, 2019.
    [Summary]

2018

  • [DOI] I. Tsamardinos, E. Greasidou, and G. Borboudakis, “Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation,” Machine Learning, vol. 107, iss. 12, pp. 1895-1922, 2018.
    [Summary]

    Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV’s main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822–829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.

  • [DOI] I. Tsamardinos, G. Borboudakis, P. Katsogridakis, P. Pratikakis, and V. Christophides, “A greedy feature selection algorithm for Big Data of high dimensionality,” Machine Learning, 2018.
    [Summary]

    We present the Parallel, Forward–Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

  • [DOI] M. Adamou, G. Antoniou, E. Greasidou, V. Lagani, P. Charonyktakis, I. Tsamardinos, and M. Doyle, “Toward Automatic Risk Assessment to Support Suicide Prevention,” Crisis, pp. 1-8, 2018.
    [Summary]

    Background: Suicide has been considered an important public health issue for years and is one of the main causes of death worldwide. Despite prevention strategies being applied, the rate of suicide has not changed substantially over the past decades. Suicide risk has proven extremely difficult to assess for medical specialists, and traditional methodologies deployed have been ineffective. Advances in machine learning make it possible to attempt to predict suicide with the analysis of relevant data aiming to inform clinical practice. Aims: We aimed to (a) test our artificial intelligence based, referral-centric methodology in the context of the National Health Service (NHS), (b) determine whether statistically relevant results can be derived from data related to previous suicides, and (c) develop ideas for various exploitation strategies. Method: The analysis used data of patients who died by suicide in the period 2013–2016 including both structured data and free-text medical notes, necessitating the deployment of state-of-the-art machine learning and text mining methods. Limitations: Sample size is a limiting factor for this study, along with the absence of non-suicide cases. Specific analytical solutions were adopted for addressing both issues. Results and Conclusion: The results of this pilot study indicate that machine learning shows promise for predicting within a specified period which people are most at risk of taking their own life at the time of referral to a mental health service.

  • [DOI] K. Lakiotaki, N. Vorniotakis, M. Tsagris, G. Georgakopoulos, and I. Tsamardinos, “BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology,” Database, iss. bay011, 2018.
    [Summary]

    Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/

  • [DOI] M. Markaki, I. Tsamardinos, A. Langhammer, V. Lagani, K. Hveem, and O. D. Røe, “A Validated Clinical Risk Prediction Model for Lung Cancer in Smokers of All Ages and Exposure Types: A HUNT Study.,” EBioMedicine, 2018.
    [Summary]

    Lung cancer causes >1·6 million deaths annually, with early diagnosis being paramount to effective treatment. Here we present a validated risk assessment model for lung cancer screening. The prospective HUNT2 population study in Norway examined 65,237 people aged >20years in 1995-97. After a median of 15·2years, 583 lung cancer cases had been diagnosed; 552 (94·7%) ever-smokers and 31 (5·3%) never-smokers. We performed multivariable analyses of 36 candidate risk predictors, using multiple imputation of missing data and backwards feature selection with Cox regression. The resulting model was validated in an independent Norwegian prospective dataset of 45,341 ever-smokers, in which 675 lung cancers had been diagnosed after a median follow-up of 11·6years. Our final HUNT Lung Cancer Model included age, pack-years, smoking intensity, years since smoking cessation, body mass index, daily cough, and hours of daily indoors exposure to smoke. External validation showed a 0·879 concordance index (95% CI 0·866-0·891) with an area under the curve of 0·87 (95% CI 0·85-0·89) within 6years. Only 22% of ever-smokers would need screening to identify 81·85% of all lung cancers within 6years. Our model of seven variables is simple, accurate, and useful for screening selection.

  • M. Panagopoulou, M. Karaglani, I. Balgkouranidou, V. Vasilakakis, E. Biziota, T. Koukaki, E. Karamitrousis, E. Nena, I. Tsamardinos, G. Kolios, E. Lianidou, S. Kakolyris, and E. Chatzaki, “Circulating cell free DNA in Breast cancer: size profiling, levels and methylation patterns lead to prognostic and predictive classifiers,” (to appear) Oncogene , 2018.
    [Summary]

    Blood circulating cell-free DNA (ccfDNA) is a suggested biosource of valuable clinical information for cancer, meeting the need for a minimally-invasive advancement in the route of precision medicine. In this paper, we evaluated the prognostic and predictive potential of ccfDNA parameters in early and advanced breast cancer. Groups consisted of 150 and 16 breast cancer patients under adjuvant and neoadjuvant therapy respectively, 34 patients with metastatic disease and 35 healthy volunteers. Direct quantification of ccfDNA in plasma revealed elevated concentrations correlated to the incidence of death, shorter PFS, and non-response to pharmacotherapy in the metastatic but not in the other groups. The methylation status of a panel of cancer-related genes chosen based on previous expression and epigenetic data (KLK10, SOX17, WNT5A, MSH2, GATA3) was assessed by quantitative methylation-specific PCR. All but the GATA3 gene was more frequently methylated in all the patient groups than in healthy individuals (all p < 0.05). The methylation of WNT5A was statistically significantly correlated to greater tumor size and poor prognosis characteristics and in advanced stage disease with shorter OS. In the metastatic group, also SOX17 methylation was significantly correlated to the incidence of death, shorter PFS, and OS. KLK10 methylation was significantly correlated to unfavorable clinicopathological characteristics and relapse, whereas in the adjuvant group to shorter DFI. Methylation of at least 3 or 4 genes was significantly correlated to shorter OS and no pharmacotherapy response, respectively. Classification analysis by a fully automated, machine learning software produced a single-parametric linear model using ccfDNA plasma concentration values, with great discriminating power to predict response to chemotherapy (AUC 0.803, 95% CI [0.606, 1.000]) in the metastatic group. Two more multi-parametric signatures were produced for the metastatic group, predicting survival and disease outcome. Finally, a multiple logistic regression model was constructed, discriminating between patient groups and healthy individuals. Overall, ccfDNA emerged as a highly potent predictive classifier in metastatic breast cancer. Upon prospective clinical evaluation, all the signatures produced could aid accurate prognosis.

  • [DOI] M. Tsagris, V. Lagani, and I. Tsamardinos, ” Feature selection for high-dimensional temporal data,” BMC Bioinformatics, iss. 1, 2018.
    [Summary]

    Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work, we extend established constrained-based, feature-selection methods to high-dimensional “omics” temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. The use of this algorithm is suggested for variable selection with high-dimensional temporal data.

  • [DOI] M. Tsagris, G. Borboudakis, V. Lagani, and I. Tsamardinos, “Constraint-based causal discovery with mixed data,” International Journal of Data Science and Analytics, 2018.
    [Summary]

    We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and or-dinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simu-lated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.

  • [DOI] M. Adamou, G. Antoniou, E. Greassidou, V. Lagani, P. Charonyktakis, and I. Tsamardinos, “Mining Free-Text Medical Notes for Suicide Risk Assessment.” 2018.
    [Summary]

    Suicide has been considered as an important public health issue for a very long time, and is one of the main causes of death worldwide. Despite suicide prevention strategies being applied, the rate of suicide has not changed substantially over the past decades. Advances in machine learning make it possible to attempt to predict suicide based on the analysis of relevant data to inform clinical practice. This paper reports on findings from the analysis of data of patients who died by suicide in the period 2013-2016 and made use of both structured data and free-text medical notes. We focus on examining various text-mining approaches to support risk assessment. The results show that using advance machine learning and text-mining techniques, it is possible to predict within a specified period which people are most at risk of taking their own life at the time of referral to a mental health service.

2017

  • K. Tsirlis, V. Lagani, S. Triantafillou, and I. Tsamardinos, “On Scoring Maximal Ancestral Graphs with the Max-Min Hill Climbing Algorithm.” 2017.
    [Summary]

  • M. Tsagris, G. Borboudakis, V. Lagani, and I. Tsamardinos, “Constraint-based Causal Discovery with Mixed Data.” 2017.
    [Summary]

  • [DOI] S. Triantafillou, V. Lagani, C. Heinze-Deml, A. Schmidt, J. Tegner, and I. Tsamardinos, “Predicting Causal Relationships from Biological Data: Applying Automated Casual Discovery on Mass Cytometry Data of Human Immune Cells,” Triantafillou S, Lagani V, Heinze-Deml C, Schmidt A, Tegner J, Tsamardinos I. Predicting Causal Relationships from Biological Data: Applying Automated Causal Discovery on Mass Cytometry Data of Human Immune Cells. Scientific Reports. 2017;7:12724. doi:10., 2017.
    [Summary]

  • [DOI] K. Siomos, E. Papadaki, I. Tsamardinos, K. Kerkentzes, M. Koygioylis, and C. Trakatelli, “Prothrombotic and Endothelial Inflammatory Markers in Greek Patients with Type 2 Diabetes Compared to Non-Diabetics,” Endocrinology & Metabolic Syndrome, iss. 1, 2017.
    [Summary]

  • [DOI] G. Papoutsoglou, G. Athineou, V. Lagani, I. Xanthopoulos, A. Schmidt, S. éliás, J. Tegnér, and I. Tsamardinos, “SCENERY: a web application for (causal) network reconstruction from cytometry data,” Nucleic Acids Research, 2017.
    [Summary]

  • G. Orfanoudaki, M. Markaki, K. Chatzi, I. Tsamardinos, and A. Economou, “MatureP: prediction of secreted proteins with exclusive information from their mature regions,” Scientific Reports, iss. 1, 2017.
    [Summary]

  • V. Lagani, G. Athineou, A. Farcomeni, M. Tsagris, and I. Tsamardinos, “Feature Selection with the R Package MXM: Discovering Multiple, Statistically-Equivalent, Predictive Feature Subsets,” Journal of Statistical Software, iss. 7, 2017.
    [Summary]

Read more

About Us

Mens Ex Machina, Mind from the Machine or “Ο από Μηχανής Νους” paraphrases the latin expression Deus Ex Machina, God from the Machine. The name was suggested by Lucy Sofiadou, Prof. Tsamardinos’ wife.

We are a research group, founded in October 2006, led by Associate Professor Ioannis Tsamardinos, interested in Artificial Intelligence, Machine Learning, and Biomedical Informatics and affiliated with the Computer Science Department of University of Crete. The group’s aims are to progress science and disseminate knowledge via educational activities and computer tools. Our group is involved in

Research:

Theoretical, algorithmic, and applied research in all of the above areas; we are also involved in interdisciplinary collaborations with biologists, physicians and practitioners from other fields.

Education:

Educational activities, such as teaching university courses, tutorials, summers schools, as well as supervising undergraduate dissertations, masters projects, and Ph.D. theses.

Systems and Software:

Implementation of tools, systems, and code libraries to aid the dissemination of the research results.Funding is provided from through the University of Crete, often originating from European and International research grants.

Current research activities include but not limited to the following:

        • Causal discovery methods and the induction of causal models from observational studies. Specifically, we have recently introduced the problem of Integrative Causal Analysis (INCA).

        • Feature selection (a.k.a. variable selection) for classification and regression.

        • Induction of graphical models, such as Bayesian Networks from data.

        • Analysis of biomedical data and applications of AI and Machine Learning methods to induce new biomedical knowledge.

        • Activity recognition in Ambient Intelligent environments.

       

Ioannis Tsamardinos
Associate Professor, Department of Computer Science, University of Crete