Mens X Machina

Our software

Recent News






Projects



Huawei Technologies

Causal discovery and inference for surrogate-assisted optimization

Details
CAUSAL PATH

Next Generation Causal Analysis inspired by the induction of biological pathways from cytometry data

Details
HUNT

Our aim is to develop a blood test for screening of smokers and asbestos exposed individuals to detect and cure these cancers.

Details
STATEGRA

Statistical methods and tools for the integrative analysis of omics data

Details

Our Team

Publications

2023

  • A. Ntroumpogiannis, M. Giannoulis, N. Myrtakis, V. Christophides, E. Simon, and I. Tsamardinos, A Meta-level Analysis of Online Anomaly DetectorsThe VLDB Journal, 2023. doi:10.1007/s00778-022-00773-x
    [BibTeX] [Download PDF]
    @misc{https://doi.org/10.1007/s00778-022-00773-x,
      added-at = {2023-03-07T22:49:53.000+0100},
      author = {Ntroumpogiannis, Antonios and Giannoulis, Michail and Myrtakis, Nikolaos and Christophides, Vassilis and Simon, Eric and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2c6cd4b4041e3e546204b7a86899b350a/mensxmachina},
      copyright = {Creative Commons Attribution 4.0 International},
      doi = {10.1007/s00778-022-00773-x},
      interhash = {b686003d5f8fd9819551157d5e3123b2},
      intrahash = {c6cd4b4041e3e546204b7a86899b350a},
      keywords = {anomalies learning machine},
      publisher = {The VLDB Journal},
      timestamp = {2023-03-07T22:49:53.000+0100},
      title = {A Meta-level Analysis of Online Anomaly Detectors},
      url = {https://link.springer.com/article/10.1007/s00778-022-00773-x},
      year = 2023
    }

2022

  • S. Bowler, G. Papoutsoglou, A. Karanikas, I. Tsamardinos, M. J. Corley, and L. C. Ndhlovu, “A machine learning approach utilizing DNA methylation as an accurate classifier of COVID-19 disease severity,” Scientific Reports, vol. 12, iss. 1, p. 17480–, 2022. doi:10.1038/s41598-022-22201-4
    [BibTeX] [Abstract] [Download PDF]

    Since the onset of the COVID-19 pandemic, increasing cases with variable outcomes continue globally because of variants and despite vaccines and therapies. There is a need to identify at-risk individuals early that would benefit from timely medical interventions. DNA methylation provides an opportunity to identify an epigenetic signature of individuals at increased risk. We utilized machine learning to identify DNA methylation signatures of COVID-19 disease from data available through NCBI Gene Expression Omnibus. A training cohort of 460 individuals (164 COVID-19-infected and 296 non-infected) and an external validation dataset of 128 individuals (102 COVID-19-infected and 26 non-COVID-associated pneumonia) were reanalyzed. Data was processed using ChAMP and beta values were logit transformed. The JADBio AutoML platform was leveraged to identify a methylation signature associated with severe COVID-19 disease. We identified a random forest classification model from 4 unique methylation sites with the power to discern individuals with severe COVID-19 disease. The average area under the curve of receiver operator characteristic (AUC-ROC) of the model was 0.933 and the average area under the precision-recall curve (AUC-PRC) was 0.965. When applied to our external validation, this model produced an AUC-ROC of 0.898 and an AUC-PRC of 0.864. These results further our understanding of the utility of DNA methylation in COVID-19 disease pathology and serve as a platform to inform future COVID-19 related studies.

    @article{bowler2022machine,
      abstract = {Since the onset of the COVID-19 pandemic, increasing cases with variable outcomes continue globally because of variants and despite vaccines and therapies. There is a need to identify at-risk individuals early that would benefit from timely medical interventions. DNA methylation provides an opportunity to identify an epigenetic signature of individuals at increased risk. We utilized machine learning to identify DNA methylation signatures of COVID-19 disease from data available through NCBI Gene Expression Omnibus. A training cohort of 460 individuals (164 COVID-19-infected and 296 non-infected) and an external validation dataset of 128 individuals (102 COVID-19-infected and 26 non-COVID-associated pneumonia) were reanalyzed. Data was processed using ChAMP and beta values were logit transformed. The JADBio AutoML platform was leveraged to identify a methylation signature associated with severe COVID-19 disease. We identified a random forest classification model from 4 unique methylation sites with the power to discern individuals with severe COVID-19 disease. The average area under the curve of receiver operator characteristic (AUC-ROC) of the model was 0.933 and the average area under the precision-recall curve (AUC-PRC) was 0.965. When applied to our external validation, this model produced an AUC-ROC of 0.898 and an AUC-PRC of 0.864. These results further our understanding of the utility of DNA methylation in COVID-19 disease pathology and serve as a platform to inform future COVID-19 related studies.},
      added-at = {2023-03-07T22:52:39.000+0100},
      author = {Bowler, Scott and Papoutsoglou, Georgios and Karanikas, Aristides and Tsamardinos, Ioannis and Corley, Michael J. and Ndhlovu, Lishomwa C.},
      biburl = {https://www.bibsonomy.org/bibtex/224959130925e38210da9cab651bbaaaf/mensxmachina},
      doi = {10.1038/s41598-022-22201-4},
      interhash = {c95ccd60f041a590226ac5efad7c573c},
      intrahash = {24959130925e38210da9cab651bbaaaf},
      issn = {20452322},
      journal = {Scientific Reports},
      keywords = {DNA covid learning machine},
      number = 1,
      pages = {17480--},
      refid = {Bowler2022},
      timestamp = {2023-03-07T22:52:39.000+0100},
      title = {A machine learning approach utilizing DNA methylation as an accurate classifier of COVID-19 disease severity},
      url = {https://doi.org/10.1038/s41598-022-22201-4},
      volume = 12,
      year = 2022
    }

  • M. Karaglani, M. Panagopoulou, C. Cheimonidi, I. Tsamardinos, E. Maltezos, N. Papanas, D. Papazoglou, G. Mastorakos, and E. Chatzaki, “Liquid Biopsy in Type 2 Diabetes Mellitus Management: Building Specific Biosignatures via Machine Learning,” Journal of Clinical Medicine, vol. 11, iss. 4, 2022. doi:10.3390/jcm11041045
    [BibTeX] [Abstract] [Download PDF]

    Background: The need for minimally invasive biomarkers for the early diagnosis of type 2 diabetes (T2DM) prior to the clinical onset and monitoring of β-pancreatic cell loss is emerging. Here, we focused on studying circulating cell-free DNA (ccfDNA) as a liquid biopsy biomaterial for accurate diagnosis/monitoring of T2DM. Methods: ccfDNA levels were directly quantified in sera from 96 T2DM patients and 71 healthy individuals via fluorometry, and then fragment DNA size profiling was performed by capillary electrophoresis. Following this, ccfDNA methylation levels of five β-cell-related genes were measured via qPCR. Data were analyzed by automated machine learning to build classifying predictive models. Results: ccfDNA levels were found to be similar between groups but indicative of apoptosis in T2DM. INS (Insulin), IAPP (Islet Amyloid Polypeptide-Amylin), GCK (Glucokinase), and KCNJ11 (Potassium Inwardly Rectifying Channel Subfamily J member 11) levels differed significantly between groups. AutoML analysis delivered biosignatures including GCK, IAPP and KCNJ11 methylation, with the highest ever reported discriminating performance of T2DM from healthy individuals (AUC 0.927). Conclusions: Our data unravel the value of ccfDNA as a minimally invasive biomaterial carrying important clinical information for T2DM. Upon prospective clinical evaluation, the built biosignature can be disruptive for T2DM clinical management.

    @article{jcm11041045,
      abstract = {Background: The need for minimally invasive biomarkers for the early diagnosis of type 2 diabetes (T2DM) prior to the clinical onset and monitoring of β-pancreatic cell loss is emerging. Here, we focused on studying circulating cell-free DNA (ccfDNA) as a liquid biopsy biomaterial for accurate diagnosis/monitoring of T2DM. Methods: ccfDNA levels were directly quantified in sera from 96 T2DM patients and 71 healthy individuals via fluorometry, and then fragment DNA size profiling was performed by capillary electrophoresis. Following this, ccfDNA methylation levels of five β-cell-related genes were measured via qPCR. Data were analyzed by automated machine learning to build classifying predictive models. Results: ccfDNA levels were found to be similar between groups but indicative of apoptosis in T2DM. INS (Insulin), IAPP (Islet Amyloid Polypeptide-Amylin), GCK (Glucokinase), and KCNJ11 (Potassium Inwardly Rectifying Channel Subfamily J member 11) levels differed significantly between groups. AutoML analysis delivered biosignatures including GCK, IAPP and KCNJ11 methylation, with the highest ever reported discriminating performance of T2DM from healthy individuals (AUC 0.927). Conclusions: Our data unravel the value of ccfDNA as a minimally invasive biomaterial carrying important clinical information for T2DM. Upon prospective clinical evaluation, the built biosignature can be disruptive for T2DM clinical management.},
      added-at = {2022-06-22T10:51:41.000+0200},
      article-number = {1045},
      author = {Karaglani, Makrina and Panagopoulou, Maria and Cheimonidi, Christina and Tsamardinos, Ioannis and Maltezos, Efstratios and Papanas, Nikolaos and Papazoglou, Dimitrios and Mastorakos, George and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/2fa7bb5fb798e4e91d2532d3115dcbbef/mensxmachina},
      doi = {10.3390/jcm11041045},
      interhash = {f3820dbe8f6b53a53f1671c62d64dfaf},
      intrahash = {fa7bb5fb798e4e91d2532d3115dcbbef},
      issn = {2077-0383},
      journal = {Journal of Clinical Medicine},
      keywords = {biopsy diabetes learning machine mellitus},
      number = 4,
      pubmedid = {35207316},
      timestamp = {2022-06-22T10:51:41.000+0200},
      title = {Liquid Biopsy in Type 2 Diabetes Mellitus Management: Building Specific Biosignatures via Machine Learning},
      url = {https://www.mdpi.com/2077-0383/11/4/1045},
      volume = 11,
      year = 2022
    }

  • J. L. Marshall, B. N. Peshkin, T. Yoshino, J. Vowinckel, H. E. Danielsen, G. Melino, I. Tsamardinos, C. Haudenschild, D. J. Kerr, C. Sampaio, S. Y. Rha, K. T. FitzGerald, E. C. Holland, D. Gallagher, J. Garcia-Foncillas, and H. Juhl, “The Essentials of Multiomics,” The Oncologist, vol. 27, iss. 4, pp. 272-284, 2022. doi:10.1093/oncolo/oyab048
    [BibTeX] [Abstract] [Download PDF]

    Within the last decade, the science of molecular testing has evolved from single gene and single protein analysis to broad molecular profiling as a standard of care, quickly transitioning from research to practice. Terms such as genomics, transcriptomics, proteomics, circulating omics, and artificial intelligence are now commonplace, and this rapid evolution has left us with a significant knowledge gap within the medical community. In this paper, we attempt to bridge that gap and prepare the physician in oncology for multiomics, a group of technologies that have gone from looming on the horizon to become a clinical reality. The era of multiomics is here, and we must prepare ourselves for this exciting new age of cancer medicine.

    @article{10.1093/oncolo/oyab048,
      abstract = {{Within the last decade, the science of molecular testing has evolved from single gene and single protein analysis to broad molecular profiling as a standard of care, quickly transitioning from research to practice. Terms such as genomics, transcriptomics, proteomics, circulating omics, and artificial intelligence are now commonplace, and this rapid evolution has left us with a significant knowledge gap within the medical community. In this paper, we attempt to bridge that gap and prepare the physician in oncology for multiomics, a group of technologies that have gone from looming on the horizon to become a clinical reality. The era of multiomics is here, and we must prepare ourselves for this exciting new age of cancer medicine.}},
      added-at = {2022-06-22T10:50:12.000+0200},
      author = {Marshall, John L and Peshkin, Beth N and Yoshino, Takayuki and Vowinckel, Jakob and Danielsen, Håvard E and Melino, Gerry and Tsamardinos, Ioannis and Haudenschild, Christian and Kerr, David J and Sampaio, Carlos and Rha, Sun Young and FitzGerald, Kevin T and Holland, Eric C and Gallagher, David and Garcia-Foncillas, Jesus and Juhl, Hartmut},
      biburl = {https://www.bibsonomy.org/bibtex/24d888d87a990372de0d0a08a01774ad6/mensxmachina},
      doi = {10.1093/oncolo/oyab048},
      eprint = {https://academic.oup.com/oncolo/article-pdf/27/4/272/43287416/oyab048.pdf},
      interhash = {f0ee8d8b0e2acf63c050b1f6f58be762},
      intrahash = {4d888d87a990372de0d0a08a01774ad6},
      issn = {1083-7159},
      journal = {The Oncologist},
      keywords = {mensxmachina multi-omics},
      month = {02},
      number = 4,
      pages = {272-284},
      timestamp = {2022-06-22T10:50:12.000+0200},
      title = {{The Essentials of Multiomics}},
      url = {https://doi.org/10.1093/oncolo/oyab048},
      volume = 27,
      year = 2022
    }

2021

  • J. Marcos-Zambrano, K. Karaduzovic-Hadziabdic, T. Turukalo, P. Przymus, V. Trajkovik, O. Aasmets, M. Berland, G. Gruca, J. Hasic, K. Hron, T. Klammsteiner, M. Kolev, L. Lanthi, M. Lopez, V. Moreno, I. Naskinova, E. Org, I. Paciência, G. Papoutsoglou, R. Shigdel, B. Stres, B. Vilne, M. Yousef, E. Zdravevski, I. Tsamardinos, E. Carrillo de Santa Pau, M. Claesson, I. Moreno-Indias, and J. Truu, “Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment,” Frontiers in Microbiology , vol. 12, 2021 . doi:https://doi.org/10.3389/fmicb.2021.634511
    [BibTeX] [Abstract] [Download PDF]

    The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

    @article{noauthororeditor,
      abstract = {The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.},
      added-at = {2021-02-25T10:36:00.000+0100},
      author = {Marcos-Zambrano, J and Karaduzovic-Hadziabdic, K and Turukalo, TL and Przymus, P and Trajkovik, V and Aasmets, O and Berland, M and Gruca, G and Hasic, J and Hron, K and Klammsteiner, T and Kolev, M and Lanthi, L and Lopez, M and Moreno, V and Naskinova, I and Org, E and Paciência, I and Papoutsoglou, G and Shigdel, R and Stres, B and Vilne, B and Yousef, M and Zdravevski, E and Tsamardinos, I and Carrillo de Santa Pau, E and Claesson, M and Moreno-Indias, I and Truu, J},
      biburl = {https://www.bibsonomy.org/bibtex/2e4c40be94c0336da43bf409d6a1272a7/mensxmachina},
      doi = {https://doi.org/10.3389/fmicb.2021.634511},
      interhash = {4f472a04bb70097a1db5243fc5c2ba8d},
      intrahash = {e4c40be94c0336da43bf409d6a1272a7},
      journal = {Frontiers in Microbiology },
      keywords = {ML},
      timestamp = {2021-02-25T10:36:00.000+0100},
      title = {Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment},
      url = {https://www.frontiersin.org/articles/10.3389/fmicb.2021.634511/full},
      volume = 12,
      year = {2021 }
    }

2021

  • L. J. Marcos-Zambrano, K. Karaduzovic-Hadziabdic, T. Loncar Turukalo, P. Przymus, V. Trajkovik, O. Aasmets, M. Berland, A. Gruca, J. Hasic, K. Hron, T. Klammsteiner, M. Kolev, L. Lahti, M. B. Lopes, V. Moreno, I. Naskinova, E. Org, I. Paciência, G. Papoutsoglou, R. Shigdel, B. Stres, B. Vilne, M. Yousef, E. Zdravevski, I. Tsamardinos, E. Carrillo de Santa Pau, M. J. Claesson, I. Moreno-Indias, and J. Truu, “Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment,” Frontiers in Microbiology, vol. 12, 2021. doi:10.3389/fmicb.2021.634511
    [BibTeX] [Abstract] [Download PDF]

    The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

    @article{10.3389/fmicb.2021.634511,
      abstract = {The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.},
      added-at = {2022-06-22T10:58:03.000+0200},
      author = {Marcos-Zambrano, Laura Judith and Karaduzovic-Hadziabdic, Kanita and Loncar Turukalo, Tatjana and Przymus, Piotr and Trajkovik, Vladimir and Aasmets, Oliver and Berland, Magali and Gruca, Aleksandra and Hasic, Jasminka and Hron, Karel and Klammsteiner, Thomas and Kolev, Mikhail and Lahti, Leo and Lopes, Marta B. and Moreno, Victor and Naskinova, Irina and Org, Elin and Paciência, Inês and Papoutsoglou, Georgios and Shigdel, Rajesh and Stres, Blaz and Vilne, Baiba and Yousef, Malik and Zdravevski, Eftim and Tsamardinos, Ioannis and Carrillo de Santa Pau, Enrique and Claesson, Marcus J. and Moreno-Indias, Isabel and Truu, Jaak},
      biburl = {https://www.bibsonomy.org/bibtex/2b27cd61df0c85a21e0dd04b0fc7dfc6e/mensxmachina},
      doi = {10.3389/fmicb.2021.634511},
      interhash = {9365312756fb3fb9714d2f38a30626eb},
      intrahash = {b27cd61df0c85a21e0dd04b0fc7dfc6e},
      issn = {1664-302X},
      journal = {Frontiers in Microbiology},
      keywords = {applications biomarker disease learning machine microbiome predictive},
      timestamp = {2022-06-22T10:58:03.000+0200},
      title = {Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment},
      url = {https://www.frontiersin.org/article/10.3389/fmicb.2021.634511},
      volume = 12,
      year = 2021
    }

  • G. Papoutsoglou, M. Karaglani, V. Lagani, N. Thomson, O. Røe, I. Tsamardinos, and E. Chatzaki, “Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets,” Scientific Reports, vol. 11, 2021. doi:10.1038/s41598-021-94501-0
    [BibTeX]
    @article{article,
      added-at = {2022-06-22T10:56:54.000+0200},
      author = {Papoutsoglou, Georgios and Karaglani, Makrina and Lagani, Vincenzo and Thomson, Naomi and Røe, Oluf and Tsamardinos, Ioannis and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/232ca8367a87572429ee46be29bae66af/mensxmachina},
      doi = {10.1038/s41598-021-94501-0},
      interhash = {29657a11a3631c2933d03c8939af5f29},
      intrahash = {32ca8367a87572429ee46be29bae66af},
      journal = {Scientific Reports},
      keywords = {automl learning machine predictive},
      month = {07},
      timestamp = {2022-06-22T10:56:54.000+0200},
      title = {Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets},
      volume = 11,
      year = 2021
    }

  • M. Papadogiorgaki, M. Venianaki, P. Charonyktakis, M. Antonakakis, I. Tsamardinos, M. E. Zervakis, and V. Sakkalis, “Heart Rate Classification Using ECG Signal Processing and Machine Learning Methods,” in 2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE), 2021, pp. 1-6. doi:10.1109/BIBE52308.2021.9635462
    [BibTeX]
    @inproceedings{9635462,
      added-at = {2022-06-22T10:55:58.000+0200},
      author = {Papadogiorgaki, Maria and Venianaki, Maria and Charonyktakis, Paulos and Antonakakis, Marios and Tsamardinos, Ioannis and Zervakis, Michalis E. and Sakkalis, Vangelis},
      biburl = {https://www.bibsonomy.org/bibtex/22feae72b255e7875c2643efa7e6ed788/mensxmachina},
      booktitle = {2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE)},
      doi = {10.1109/BIBE52308.2021.9635462},
      interhash = {2eb66ce826c06fda1d9831644de642b5},
      intrahash = {2feae72b255e7875c2643efa7e6ed788},
      keywords = {classification ecg heart processing rate signal},
      pages = {1-6},
      timestamp = {2022-06-22T10:55:58.000+0200},
      title = {Heart Rate Classification Using ECG Signal Processing and Machine Learning Methods},
      year = 2021
    }

  • K. Rounis, D. Makrakis, C. Papadaki, A. Monastirioti, L. Vamvakas, K. Kalbakis, K. Gourlia, I. Xanthopoulos, I. Tsamardinos, D. Mavroudis, and S. Agelaki, “Prediction of outcome in patients with non-small cell lung cancer treated with second line PD-1/PDL-1 inhibitors based on clinical parameters: Results from a prospective, single institution study,” PLOS ONE, vol. 16, iss. 6, pp. 1-18, 2021. doi:10.1371/journal.pone.0252537
    [BibTeX] [Abstract] [Download PDF]

    Objective We prospectively recorded clinical and laboratory parameters from patients with metastatic non-small cell lung cancer (NSCLC) treated with 2nd line PD-1/PD-L1 inhibitors in order to address their effect on treatment outcomes. Materials and methods Clinicopathological information (age, performance status, smoking, body mass index, histology, organs with metastases), use and duration of proton pump inhibitors, steroids and antibiotics (ATB) and laboratory values [neutrophil/lymphocyte ratio, LDH, albumin] were prospectively collected. Steroid administration was defined as the use of > 10 mg prednisone equivalent for ≥ 10 days. Prolonged ATB administration was defined as ATB ≥ 14 days 30 days before or within the first 3 months of treatment. JADBio, a machine learning pipeline was applied for further multivariate analysis. Results Data from 66 pts with non-oncogenic driven metastatic NSCLC were analyzed; 15.2% experienced partial response (PR), 34.8% stable disease (SD) and 50% progressive disease (PD). Median overall survival (OS) was 6.77 months. ATB administration did not affect patient OS [HR = 1.35 (CI: 0.761–2.406, p = 0.304)], however, prolonged ATBs [HR = 2.95 (CI: 1.62–5.36, p = 0.0001)] and the presence of bone metastases [HR = 1.89 (CI: 1.02–3.51, p = 0.049)] independently predicted for shorter survival. Prolonged ATB administration, bone metastases, liver metastases and BMI < 25 kg/m2 were selected by JADbio as the important features that were associated with increased probability of developing disease progression as response to treatment. The resulting algorithm that was created was able to predict the probability of disease stabilization (PR or SD) in a single individual with an AUC = 0.806 [95% CI:0.714–0.889]. Conclusions Our results demonstrate an adverse effect of prolonged ATBs on response and survival and underscore their importance along with the presence of bone metastases, liver metastases and low BMI in the individual prediction of outcomes in patients treated with immunotherapy.

    @article{10.1371/journal.pone.0252537,
      abstract = {Objective We prospectively recorded clinical and laboratory parameters from patients with metastatic non-small cell lung cancer (NSCLC) treated with 2nd line PD-1/PD-L1 inhibitors in order to address their effect on treatment outcomes.   Materials and methods Clinicopathological information (age, performance status, smoking, body mass index, histology, organs with metastases), use and duration of proton pump inhibitors, steroids and antibiotics (ATB) and laboratory values [neutrophil/lymphocyte ratio, LDH, albumin] were prospectively collected. Steroid administration was defined as the use of > 10 mg prednisone equivalent for ≥ 10 days. Prolonged ATB administration was defined as ATB ≥ 14 days 30 days before or within the first 3 months of treatment. JADBio, a machine learning pipeline was applied for further multivariate analysis.   Results Data from 66 pts with non-oncogenic driven metastatic NSCLC were analyzed; 15.2% experienced partial response (PR), 34.8% stable disease (SD) and 50% progressive disease (PD). Median overall survival (OS) was 6.77 months. ATB administration did not affect patient OS [HR = 1.35 (CI: 0.761–2.406, p = 0.304)], however, prolonged ATBs [HR = 2.95 (CI: 1.62–5.36, p = 0.0001)] and the presence of bone metastases [HR = 1.89 (CI: 1.02–3.51, p = 0.049)] independently predicted for shorter survival. Prolonged ATB administration, bone metastases, liver metastases and BMI < 25 kg/m2 were selected by JADbio as the important features that were associated with increased probability of developing disease progression as response to treatment. The resulting algorithm that was created was able to predict the probability of disease stabilization (PR or SD) in a single individual with an AUC = 0.806 [95% CI:0.714–0.889].   Conclusions Our results demonstrate an adverse effect of prolonged ATBs on response and survival and underscore their importance along with the presence of bone metastases, liver metastases and low BMI in the individual prediction of outcomes in patients treated with immunotherapy.},
      added-at = {2021-06-04T09:18:00.000+0200},
      author = {Rounis, Konstantinos and Makrakis, Dimitrios and Papadaki, Chara and Monastirioti, Alexia and Vamvakas, Lambros and Kalbakis, Konstantinos and Gourlia, Krystallia and Xanthopoulos, Iordanis and Tsamardinos, Ioannis and Mavroudis, Dimitrios and Agelaki, Sofia},
      biburl = {https://www.bibsonomy.org/bibtex/2a0fda17bd6c2177cb4ce435c3559b648/mensxmachina},
      doi = {10.1371/journal.pone.0252537},
      interhash = {c53c8616653bdaaa2984cde14d27d241},
      intrahash = {a0fda17bd6c2177cb4ce435c3559b648},
      journal = {PLOS ONE},
      keywords = {imported},
      month = {06},
      number = 6,
      pages = {1-18},
      publisher = {Public Library of Science},
      timestamp = {2021-06-04T09:18:00.000+0200},
      title = {Prediction of outcome in patients with non-small cell lung cancer treated with second line PD-1/PDL-1 inhibitors based on clinical parameters: Results from a prospective, single institution study},
      url = {https://doi.org/10.1371/journal.pone.0252537},
      volume = 16,
      year = 2021
    }

  • G. Borboudakis and I. Tsamardinos, "Extending greedy feature selection algorithms to multiple solutions," Data Mining and Knowledge Discovery, 2021. doi:10.1007/s10618-020-00731-7
    [BibTeX] [Abstract] [Download PDF]

    Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.

    @article{Borboudakis2021,
      abstract = {Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.},
      added-at = {2021-05-10T09:37:57.000+0200},
      author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/21a02e4b98901f0889375b61fbba306a2/mensxmachina},
      day = 01,
      doi = {10.1007/s10618-020-00731-7},
      interhash = {2111a54b383124f93dad8b9ebd26afb5},
      intrahash = {1a02e4b98901f0889375b61fbba306a2},
      issn = {1573-756X},
      journal = {Data Mining and Knowledge Discovery},
      keywords = {mxmcausalpath},
      month = may,
      timestamp = {2021-05-10T09:37:57.000+0200},
      title = {Extending greedy feature selection algorithms to multiple solutions},
      url = {https://doi.org/10.1007/s10618-020-00731-7},
      year = 2021
    }

  • M. Panagopoulou, M. Karaglani, V. G. Manolopoulos, I. Iliopoulos, I. Tsamardinos, and E. Chatzaki, "Deciphering the Methylation Landscape in Breast Cancer: Diagnostic and Prognostic Biosignatures through Automated Machine Learning," Cancers, vol. 13, iss. 7, p. 1677, 2021. doi:10.3390/cancers13071677
    [BibTeX] [Abstract] [Download PDF]

    DNA methylation plays an important role in breast cancer (BrCa) pathogenesis and could contribute to driving its personalized management. We performed a complete bioinformatic analysis in BrCa whole methylome datasets, analyzed using the Illumina methylation 450 bead-chip array. Differential methylation analysis vs. clinical end-points resulted in 11,176 to 27,786 differentially methylated genes (DMGs). Innovative automated machine learning (AutoML) was employed to construct signatures with translational value. Three highly performing and low-feature-number signatures were built: (1) A 5-gene signature discriminating BrCa patients from healthy individuals (area under the curve (AUC): 0.994 (0.982–1.000)). (2) A 3-gene signature identifying BrCa metastatic disease (AUC: 0.986 (0.921–1.000)). (3) Six equivalent 5-gene signatures diagnosing early disease (AUC: 0.973 (0.920–1.000)). Validation in independent patient groups verified performance. Bioinformatic tools for functional analysis and protein interaction prediction were also employed. All protein encoding features included in the signatures were associated with BrCa-related pathways. Functional analysis of DMGs highlighted the regulation of transcription as the main biological process, the nucleus as the main cellular component and transcription factor activity and sequence-specific DNA binding as the main molecular functions. Overall, three high-performance diagnostic/prognostic signatures were built and are readily available for improving BrCa precision management upon prospective clinical validation. Revisiting archived methylomes through novel bioinformatic approaches revealed significant clarifying knowledge for the contribution of gene methylation events in breast carcinogenesis.

    @article{Panagopoulou_2021,
      abstract = {DNA methylation plays an important role in breast cancer (BrCa) pathogenesis and could contribute to driving its personalized management. We performed a complete bioinformatic analysis in BrCa whole methylome datasets, analyzed using the Illumina methylation 450 bead-chip array. Differential methylation analysis vs. clinical end-points resulted in 11,176 to 27,786 differentially methylated genes (DMGs). Innovative automated machine learning (AutoML) was employed to construct signatures with translational value. Three highly performing and low-feature-number signatures were built: (1) A 5-gene signature discriminating BrCa patients from healthy individuals (area under the curve (AUC): 0.994 (0.982–1.000)). (2) A 3-gene signature identifying BrCa metastatic disease (AUC: 0.986 (0.921–1.000)). (3) Six equivalent 5-gene signatures diagnosing early disease (AUC: 0.973 (0.920–1.000)). Validation in independent patient groups verified performance. Bioinformatic tools for functional analysis and protein interaction prediction were also employed. All protein encoding features included in the signatures were associated with BrCa-related pathways. Functional analysis of DMGs highlighted the regulation of transcription as the main biological process, the nucleus as the main cellular component and transcription factor activity and sequence-specific DNA binding as the main molecular functions. Overall, three high-performance diagnostic/prognostic signatures were built and are readily available for improving BrCa precision management upon prospective clinical validation. Revisiting archived methylomes through novel bioinformatic approaches revealed significant clarifying knowledge for the contribution of gene methylation events in breast carcinogenesis.},
      added-at = {2021-04-05T10:25:29.000+0200},
      author = {Panagopoulou, Maria and Karaglani, Makrina and Manolopoulos, Vangelis G. and Iliopoulos, Ioannis and Tsamardinos, Ioannis and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/25938c275248de01841423c461744c95c/mensxmachina},
      doi = {10.3390/cancers13071677},
      interhash = {9a46961bf0583786199d3b4d978bcb01},
      intrahash = {5938c275248de01841423c461744c95c},
      journal = {Cancers},
      keywords = {imported},
      month = apr,
      number = 7,
      pages = 1677,
      publisher = {{MDPI} {AG}},
      timestamp = {2021-04-05T10:25:29.000+0200},
      title = {Deciphering the Methylation Landscape in Breast Cancer: Diagnostic and Prognostic Biosignatures through Automated Machine Learning},
      url = {https://doi.org/10.3390%2Fcancers13071677},
      volume = 13,
      year = 2021
    }

  • G. Borboudakis and I. Tsamardinos, "Extending Greedy Feature Selection Algorithms to Multiple Solutions," Data Mining and Knowledge Discovery, vol. to appear , 2021.
    [BibTeX]
    @article{borboudakis2021mining,
      added-at = {2021-03-17T12:12:52.000+0100},
      author = {Borboudakis, G and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/295b55379724af7ef52054e5a33fd4745/mensxmachina},
      interhash = {2111a54b383124f93dad8b9ebd26afb5},
      intrahash = {95b55379724af7ef52054e5a33fd4745},
      journal = {Data Mining and Knowledge Discovery},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-18T10:07:49.000+0100},
      title = {Extending Greedy Feature Selection Algorithms to Multiple Solutions},
      volume = {to appear },
      year = 2021
    }

  • N. Myrtakis, I. Tsamardinos, and V. Christophides, "PROTEUS: Predictive Explanation of Anomalies,," , vol. IEEE 37th International Conference on Data Engineering (ICDE) 2021, 2021.
    [BibTeX] [Abstract]

    Numerous algorithms have been proposed for detecting anomalies (outliers, novelties) in an unsupervised manner. Unfortunately, it is not trivial, in general, to understand why a given sample (record) is labelled as an anomaly and thus diagnose its root causes. We propose the following reduced-dimensionality, surrogate model approach to explain detector decisions: approximate the detection model with another one that employs only a small subset of features. Subsequently, samples can be visualized in this low-dimensionality space for human understanding. To this end, we develop PROTEUS, an AutoML pipeline to produce the surrogate model, specifically designed for feature selection on imbalanced datasets. The PROTEUS surrogate model can not only explain the training data, but also the out-of-sample (unseen) data. In other words, PROTEUS produces predictive explanations by approximating the decision surface of an unsupervised detector. PROTEUS is designed to return an accurate estimate of out-of-sample predictive performance to serve as a metric of the quality of the approximation. Computational experiments confirm the efficacy of PROTEUS to produce predictive explanations for different families of detectors and to reliably estimate their predictive performance in unseen data. Unlike several ad-hoc feature importance methods, PROTEUS is robust to high-dimensional data.

    @conference{myrtakis2021proteus,
      abstract = {Numerous algorithms have been proposed for detecting anomalies (outliers, novelties) in an unsupervised manner. Unfortunately, it is not trivial, in general, to understand why a given sample (record) is labelled as an anomaly and thus diagnose its root causes. We propose the following reduced-dimensionality, surrogate model approach to explain detector decisions: approximate the detection model with another one that employs only a small subset of features. Subsequently, samples can be visualized in this low-dimensionality space for human understanding. To this end, we develop PROTEUS, an AutoML pipeline to produce the surrogate model, specifically designed for feature selection on imbalanced datasets. The PROTEUS surrogate model can not only explain the training data, but also the out-of-sample (unseen) data. In other words, PROTEUS produces predictive explanations by approximating the decision surface of an unsupervised detector. PROTEUS is designed to return an accurate estimate of out-of-sample predictive performance to serve as a metric of the quality of the approximation. Computational experiments confirm the efficacy of PROTEUS to produce predictive explanations for different families of detectors and to reliably estimate their predictive performance in unseen data. Unlike several ad-hoc feature importance methods, PROTEUS is robust to high-dimensional data.
    },
      added-at = {2021-02-10T09:57:44.000+0100},
      author = {Myrtakis, N and Tsamardinos, I and Christophides, V},
      biburl = {https://www.bibsonomy.org/bibtex/207bdf48e36b94f93849856e1a1ec258a/mensxmachina},
      interhash = {1be3182c1d6928ec21142b5f18a6ea20},
      intrahash = {07bdf48e36b94f93849856e1a1ec258a},
      keywords = {anomalies},
      timestamp = {2021-03-19T10:32:22.000+0100},
      title = {"PROTEUS: Predictive Explanation of Anomalies,"},
      volume = {IEEE 37th International Conference on Data Engineering (ICDE) 2021},
      year = 2021
    }

2020

  • A. Tsourtis, Y. Pantazis, and I. Tsamardinos, "Inference of Stochastic Dynamical Systems from Cross-Sectional Population Data ," arXiv:2012.05055v1 [cs.LG] 9 Dec 2020, 2020. doi:arXiv:2012.05055v1 [cs.LG] 9 Dec 2020
    [BibTeX] [Abstract]

    Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements there are few attempts to infer the dynamical system straight from population data. In this work, we deduce and then computationally estimate the Fokker-Planck equation which describes the evolution of the population’s probability density, based on stochastic differential equations. Then, following the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions, transforming it into a linear system of equations. Finally, we apply sparse inference methods to solve the latter system and thus induce the driving forces of the dynamical system. Our approach is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential equations, biochemical reaction networks as well as mass cytometry biological measurements.

    @article{tsourtis2020inference,
      abstract = {Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many
    others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements
    there are few attempts to infer the dynamical system straight from population data. In this work, we
    deduce and then computationally estimate the Fokker-Planck equation which describes the evolution
    of the population’s probability density, based on stochastic differential equations. Then, following
    the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions,
    transforming it into a linear system of equations. Finally, we apply sparse inference methods to
    solve the latter system and thus induce the driving forces of the dynamical system. Our approach
    is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential
    equations, biochemical reaction networks as well as mass cytometry biological measurements.},
      added-at = {2021-03-24T10:32:03.000+0100},
      author = {Tsourtis, A and Pantazis, Y and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/2f3d7571025e47ab9693c1b8a5876702d/mensxmachina},
      doi = {arXiv:2012.05055v1 [cs.LG] 9 Dec 2020},
      interhash = {1dd0cba1cddecc67bc714ff55e2fa939},
      intrahash = {f3d7571025e47ab9693c1b8a5876702d},
      journal = {arXiv:2012.05055v1 [cs.LG] 9 Dec 2020},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-24T10:32:03.000+0100},
      title = {Inference of Stochastic Dynamical Systems from Cross-Sectional Population
    Data
    },
      year = 2020
    }

  • M. Tsagris, Z. Papadovasilakis, K. Lakiotaki, and I. Tsamardinos, "The γ-OMP algorithm for feature selection with application to gene expression data," IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2020. doi:10.1109/TCBB.2020.3029952
    [BibTeX] [Abstract] [Download PDF]

    Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.

    @article{tsagris2020algorithm,
      abstract = {Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.},
      added-at = {2021-03-22T13:27:44.000+0100},
      author = {Tsagris, Michail and Papadovasilakis, Zacharias and Lakiotaki, Kleanthi and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2372b4dd105cf55a3c32ca0d937888f2e/mensxmachina},
      doi = {10.1109/TCBB.2020.3029952},
      interhash = {9bef7f59658d9a4a2f82cba160e276e4},
      intrahash = {372b4dd105cf55a3c32ca0d937888f2e},
      journal = { IEEE/ACM Transactions on Computational Biology and Bioinformatics },
      keywords = {mxmcausalpath},
      timestamp = {2021-03-22T13:27:44.000+0100},
      title = {The γ-OMP algorithm for feature selection with application to gene expression data},
      url = {https://ieeexplore.ieee.org/document/9219177/authors#authors},
      year = 2020
    }

  • Y. Pantazis, C. Tselas, K. Lakiotaki, V. Lagani, and ioannis Tsamardinos, "Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions," IEEE, 2020. doi:10.1109/BIBM49941.2020.9313286
    [BibTeX] [Abstract] [Download PDF]

    High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.

    @article{pantazis2020latent,
      abstract = {High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.},
      added-at = {2021-01-27T08:25:38.000+0100},
      author = {Pantazis, Yannis and Tselas, Christos and Lakiotaki, Kleanthi and Lagani, Vincenzo and Tsamardinos, ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/22e00727d34af38370524ab45428d1935/mensxmachina},
      doi = {10.1109/BIBM49941.2020.9313286},
      interhash = {85456c0fc077102f3eca5cd7f7dfc749},
      intrahash = {2e00727d34af38370524ab45428d1935},
      journal = {IEEE},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-08T12:07:50.000+0100},
      title = {Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions},
      url = {https://ieeexplore.ieee.org/document/9313286},
      year = 2020
    }

  • N. Phanell, V. Lagani, P. Sebastian-Leon, F. Van der Kloet, E. Ewing, N. Karathanasis, A. Urdangarin, I. Arozarena, M. Jagodic, I. Tsamardinos, S. Tarazona, A. Conesa, J. Tegner, and D. Gomez-Cabrero, "STATegra: Multi-omics data integration - A conceptual scheme and a bioinformatics pipeline," Frontiers in Genetics , vol. to appear , 2020. doi:https://doi.org/10.1101/2020.11.20.391045
    [BibTeX] [Abstract] [Download PDF]

    Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. It is therefore an unmet need to conceptualize how to integrate such data and to implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner. While in several studies we have previously combined those integrative tools, here we provide a systematic description of the STATegra framework and its validation using two TCGA case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma cases, we demonstrate an enhanced capacity to identify features in comparison to single-omics analysis. Such an integrative multi-omics analysis framework for the identification of features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled, and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package https://bioconductor.org/packages/release/bioc/html/STATegra.html.

    @article{noauthororeditor,
      abstract = {Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. It is therefore an unmet need to conceptualize how to integrate such data and to implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner. While in several studies we have previously combined those integrative tools, here we provide a systematic description of the STATegra framework and its validation using two TCGA case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma cases, we demonstrate an enhanced capacity to identify features in comparison to single-omics analysis. Such an integrative multi-omics analysis framework for the identification of features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled, and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package https://bioconductor.org/packages/release/bioc/html/STATegra.html.},
      added-at = {2021-01-25T08:02:51.000+0100},
      author = {Phanell, Nuria and Lagani, Vincenzo and Sebastian-Leon, Patricia and Van der Kloet, Frans and Ewing, Ewoud and Karathanasis, Nestoras and Urdangarin, Arantxa and Arozarena, Imanol and Jagodic, Maja and Tsamardinos, Ioannis and Tarazona, Sonia and Conesa, Ana and Tegner, Jesper and Gomez-Cabrero, David},
      biburl = {https://www.bibsonomy.org/bibtex/213d5658c490ee48b134629c33979e700/mensxmachina},
      doi = {https://doi.org/10.1101/2020.11.20.391045},
      interhash = {84dd53162ecf2659ffb75f1329f0aaad},
      intrahash = {13d5658c490ee48b134629c33979e700},
      journal = {Frontiers in Genetics },
      keywords = {data multi-omics},
      timestamp = {2021-01-25T08:02:51.000+0100},
      title = {STATegra: Multi-omics data integration - A conceptual scheme and a bioinformatics pipeline},
      url = {https://www.biorxiv.org/content/10.1101/2020.11.20.391045v1},
      volume = {to appear },
      year = 2020
    }

  • K. Karstoft, I". "Tsamardinos, K". "Eskelund, "Andersen.SB", and L. "Nissen, "Applicability of an Automated Model and Parameter Selection in the Prediction of Screening-Level PTSD in Danish Soldiers Following Deployment: Development Study of Transferable Predictive Models Using Automated Machine Learning," JMIR Medical Informatics, vol. 8, iss. 7, 2020. doi:10.2196/17119
    [BibTeX] [Abstract] [Download PDF]

    Background: Posttraumatic stress disorder (PTSD) is a relatively common consequence of deployment to war zones. Early postdeployment screening with the aim of identifying those at risk for PTSD in the years following deployment will help deliver interventions to those in need but have so far proved unsuccessful. Objective: This study aimed to test the applicability of automated model selection and the ability of automated machine learning prediction models to transfer across cohorts and predict screening-level PTSD 2.5 years and 6.5 years after deployment. Methods: Automated machine learning was applied to data routinely collected 6-8 months after return from deployment from 3 different cohorts of Danish soldiers deployed to Afghanistan in 2009 (cohort 1, N=287 or N=261 depending on the timing of the outcome assessment), 2010 (cohort 2, N=352), and 2013 (cohort 3, N=232). Results: Models transferred well between cohorts. For screening-level PTSD 2.5 and 6.5 years after deployment, random forest models provided the highest accuracy as measured by area under the receiver operating characteristic curve (AUC): 2.5 years, AUC=0.77, 95% CI 0.71-0.83; 6.5 years, AUC=0.78, 95% CI 0.73-0.83. Linear models performed equally well. Military rank, hyperarousal symptoms, and total level of PTSD symptoms were highly predictive. Conclusions: Automated machine learning provided validated models that can be readily implemented in future deployment cohorts in the Danish Defense with the aim of targeting postdeployment support interventions to those at highest risk for developing PTSD, provided the cohorts are deployed on similar missions.

    @article{karstoft2020applicability,
      abstract = {Background: Posttraumatic stress disorder (PTSD) is a relatively common consequence of deployment to war zones. Early postdeployment screening with the aim of identifying those at risk for PTSD in the years following deployment will help deliver interventions to those in need but have so far proved unsuccessful.
    
    Objective: This study aimed to test the applicability of automated model selection and the ability of automated machine learning prediction models to transfer across cohorts and predict screening-level PTSD 2.5 years and 6.5 years after deployment.
    
    Methods: Automated machine learning was applied to data routinely collected 6-8 months after return from deployment from 3 different cohorts of Danish soldiers deployed to Afghanistan in 2009 (cohort 1, N=287 or N=261 depending on the timing of the outcome assessment), 2010 (cohort 2, N=352), and 2013 (cohort 3, N=232).
    
    Results: Models transferred well between cohorts. For screening-level PTSD 2.5 and 6.5 years after deployment, random forest models provided the highest accuracy as measured by area under the receiver operating characteristic curve (AUC): 2.5 years, AUC=0.77, 95% CI 0.71-0.83; 6.5 years, AUC=0.78, 95% CI 0.73-0.83. Linear models performed equally well. Military rank, hyperarousal symptoms, and total level of PTSD symptoms were highly predictive.
    
    Conclusions: Automated machine learning provided validated models that can be readily implemented in future deployment cohorts in the Danish Defense with the aim of targeting postdeployment support interventions to those at highest risk for developing PTSD, provided the cohorts are deployed on similar missions.},
      added-at = {2020-11-04T15:45:03.000+0100},
      author = {"Karstoft, KI" and "Tsamardinos, I" and "Eskelund, K" and "Andersen.SB" and "Nissen, LR"},
      biburl = {https://www.bibsonomy.org/bibtex/2b3c6a7c433dc0137e177a389e93373d6/mensxmachina},
      doi = {10.2196/17119},
      interhash = {e4d28b268e9ea645b86d7488930824cc},
      intrahash = {b3c6a7c433dc0137e177a389e93373d6},
      journal = {JMIR Medical Informatics},
      keywords = {AutoML Automated Learning Machine application models parameter predictive selection study transferable},
      month = {July},
      number = 7,
      timestamp = {2020-11-04T15:46:26.000+0100},
      title = {Applicability of an Automated Model and Parameter Selection in the Prediction of Screening-Level PTSD in Danish Soldiers Following Deployment: Development Study of Transferable Predictive Models Using Automated Machine Learning},
      url = {https://europepmc.org/article/pmc/pmc7407253},
      volume = 8,
      year = 2020
    }

2017

  • M. Tsagris, G. Borboudakis, V. Lagani, and I. Tsamardinos, "Constraint-based Causal Discovery with Mixed Data," , 2017.
    [BibTeX] [Abstract] [Download PDF]

    We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.

    @conference{noauthororeditor2017constraintbased,
      abstract = {We address the problem of constraint-based
    causal discovery with mixed data types, such as (but
    not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on
    appropriate regression models, and show how to derive
    symmetric conditional independence tests. Such tests
    can then be directly used by existing constraint-based
    methods with mixed data, such as the PC and FCI
    algorithms for learning Bayesian networks and maximal
    ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm
    with different conditional independence tests for mixed
    data, and show that the proposed approach outperforms
    alternatives in terms of learning accuracy.},
      added-at = {2021-03-10T10:58:29.000+0100},
      author = {Tsagris, M and Borboudakis, G and Lagani, V and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/2892378444240fee14d62fd58362e856a/mensxmachina},
      interhash = {87d6a33d891429260e644392ddcba508},
      intrahash = {892378444240fee14d62fd58362e856a},
      keywords = {mxmcausalpath},
      publisher = {23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)},
      timestamp = {2021-03-10T10:58:29.000+0100},
      title = {Constraint-based Causal Discovery with Mixed Data},
      url = {http://nugget.unisa.edu.au/CD2017/papersonly/constraint-based-causal-r1.pdf},
      year = 2017
    }

  • K. Tsirlis, V. Lagani, S. Triantafillou, and I. Tsamardinos, "On Scoring Maximal Ancestral Graphs with the Max-Min Hill Climbing Algorithm," , 2017.
    [BibTeX] [Abstract] [Download PDF]

    t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that the proposed algorithm greatly improves on GSMAG, and compares positively against FCI and cFCI, two well known constraint-based approaches for causal-network reconstruction in presence of latent confounders

    @conference{tsirlis2017scoring,
      abstract = {t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing
    (M3HC) that takes as input a data set of continuous
    variables, assumed to follow a multivariate Gaussian
    distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed
    method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that
    the proposed algorithm greatly improves on GSMAG,
    and compares positively against FCI and cFCI, two well
    known constraint-based approaches for causal-network
    reconstruction in presence of latent confounders},
      added-at = {2021-03-10T10:55:47.000+0100},
      author = {Tsirlis, K and Lagani, V and Triantafillou, S and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/251782ff3d0021d9ae7b7229b39a55d75/mensxmachina},
      interhash = {4731b83fe8b2f1f60eed63d178912109},
      intrahash = {51782ff3d0021d9ae7b7229b39a55d75},
      keywords = {mxmcausalpath},
      publisher = {23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)},
      timestamp = {2021-03-10T10:55:47.000+0100},
      title = {On Scoring Maximal Ancestral Graphs with the Max-Min Hill
    Climbing Algorithm},
      url = {http://nugget.unisa.edu.au/CD2017/papersonly/maxmin-r0.pdf},
      year = 2017
    }

Read more

About Us

Mens Ex Machina, Mind from the Machine or “Ο από Μηχανής Νους” paraphrases the latin expression Deus Ex Machina, God from the Machine. The name was suggested by Lucy Sofiadou, Prof. Tsamardinos’ wife.

We are a research group, founded in October 2006, led by Professor Ioannis Tsamardinos, interested in Artificial Intelligence, Machine Learning, and Biomedical Informatics and affiliated with the Computer Science Department of University of Crete. The aims of the group are to progress science and disseminate knowledge via educational activities and computer tools. Our group is involved in

Research:

Theoretical, algorithmic, and applied research in all of the above areas; we are also involved in interdisciplinary collaborations with biologists, physicians and practitioners from other fields.

Education:

Educational activities, such as teaching university courses, tutorials, summers schools, as well as supervising undergraduate dissertations, masters projects, and Ph.D. theses.

Systems and Software:

Implementation of tools, systems, and code libraries to aid the dissemination of the research results. Funding is provided from and through the University of Crete, often originating from European and International research grants.

Current research activities include but not limited to the following:

  • Causal discovery methods and the induction of causal models from observational studies. Specifically, we have recently introduced the problem of Integrative Causal Analysis (INCA).
  • Feature selection (a.k.a. variable selection) for classification and regression.
  • Induction of graphical models, such as Bayesian Networks from data.
  • Analysis of biomedical data and applications of AI and Machine Learning methods to induce new biomedical knowledge.
  • Activity recognition in Ambient Intelligent environments.

Ioannis Tsamardinos

Professor, Department of Computer Science, University of Crete