Mens X Machina

Our software

Recent News






Projects



Huawei Technologies

Causal discovery and inference for surrogate-assisted optimization

Details
CAUSAL PATH

Next Generation Causal Analysis inspired by the induction of biological pathways from cytometry data

Details
HUNT

Our aim is to develop a blood test for screening of smokers and asbestos exposed individuals to detect and cure these cancers.

Details
STATEGRA

Statistical methods and tools for the integrative analysis of omics data

Details

Our Team

Publications

2023

  • A. Ntroumpogiannis, M. Giannoulis, N. Myrtakis, V. Christophides, E. Simon, and I. Tsamardinos, A Meta-level Analysis of Online Anomaly DetectorsThe VLDB Journal, 2023. doi:10.1007/s00778-022-00773-x
    [BibTeX] [Download PDF]
    @misc{https://doi.org/10.1007/s00778-022-00773-x,
      added-at = {2023-03-07T22:49:53.000+0100},
      author = {Ntroumpogiannis, Antonios and Giannoulis, Michail and Myrtakis, Nikolaos and Christophides, Vassilis and Simon, Eric and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2c6cd4b4041e3e546204b7a86899b350a/mensxmachina},
      copyright = {Creative Commons Attribution 4.0 International},
      doi = {10.1007/s00778-022-00773-x},
      interhash = {b686003d5f8fd9819551157d5e3123b2},
      intrahash = {c6cd4b4041e3e546204b7a86899b350a},
      keywords = {anomalies learning machine},
      publisher = {The VLDB Journal},
      timestamp = {2023-03-07T22:49:53.000+0100},
      title = {A Meta-level Analysis of Online Anomaly Detectors},
      url = {https://link.springer.com/article/10.1007/s00778-022-00773-x},
      year = 2023
    }

2022

  • S. Bowler, G. Papoutsoglou, A. Karanikas, I. Tsamardinos, M. J. Corley, and L. C. Ndhlovu, “A machine learning approach utilizing DNA methylation as an accurate classifier of COVID-19 disease severity,” Scientific Reports, vol. 12, iss. 1, p. 17480–, 2022. doi:10.1038/s41598-022-22201-4
    [BibTeX] [Abstract] [Download PDF]

    Since the onset of the COVID-19 pandemic, increasing cases with variable outcomes continue globally because of variants and despite vaccines and therapies. There is a need to identify at-risk individuals early that would benefit from timely medical interventions. DNA methylation provides an opportunity to identify an epigenetic signature of individuals at increased risk. We utilized machine learning to identify DNA methylation signatures of COVID-19 disease from data available through NCBI Gene Expression Omnibus. A training cohort of 460 individuals (164 COVID-19-infected and 296 non-infected) and an external validation dataset of 128 individuals (102 COVID-19-infected and 26 non-COVID-associated pneumonia) were reanalyzed. Data was processed using ChAMP and beta values were logit transformed. The JADBio AutoML platform was leveraged to identify a methylation signature associated with severe COVID-19 disease. We identified a random forest classification model from 4 unique methylation sites with the power to discern individuals with severe COVID-19 disease. The average area under the curve of receiver operator characteristic (AUC-ROC) of the model was 0.933 and the average area under the precision-recall curve (AUC-PRC) was 0.965. When applied to our external validation, this model produced an AUC-ROC of 0.898 and an AUC-PRC of 0.864. These results further our understanding of the utility of DNA methylation in COVID-19 disease pathology and serve as a platform to inform future COVID-19 related studies.

    @article{bowler2022machine,
      abstract = {Since the onset of the COVID-19 pandemic, increasing cases with variable outcomes continue globally because of variants and despite vaccines and therapies. There is a need to identify at-risk individuals early that would benefit from timely medical interventions. DNA methylation provides an opportunity to identify an epigenetic signature of individuals at increased risk. We utilized machine learning to identify DNA methylation signatures of COVID-19 disease from data available through NCBI Gene Expression Omnibus. A training cohort of 460 individuals (164 COVID-19-infected and 296 non-infected) and an external validation dataset of 128 individuals (102 COVID-19-infected and 26 non-COVID-associated pneumonia) were reanalyzed. Data was processed using ChAMP and beta values were logit transformed. The JADBio AutoML platform was leveraged to identify a methylation signature associated with severe COVID-19 disease. We identified a random forest classification model from 4 unique methylation sites with the power to discern individuals with severe COVID-19 disease. The average area under the curve of receiver operator characteristic (AUC-ROC) of the model was 0.933 and the average area under the precision-recall curve (AUC-PRC) was 0.965. When applied to our external validation, this model produced an AUC-ROC of 0.898 and an AUC-PRC of 0.864. These results further our understanding of the utility of DNA methylation in COVID-19 disease pathology and serve as a platform to inform future COVID-19 related studies.},
      added-at = {2023-03-07T22:52:39.000+0100},
      author = {Bowler, Scott and Papoutsoglou, Georgios and Karanikas, Aristides and Tsamardinos, Ioannis and Corley, Michael J. and Ndhlovu, Lishomwa C.},
      biburl = {https://www.bibsonomy.org/bibtex/224959130925e38210da9cab651bbaaaf/mensxmachina},
      doi = {10.1038/s41598-022-22201-4},
      interhash = {c95ccd60f041a590226ac5efad7c573c},
      intrahash = {24959130925e38210da9cab651bbaaaf},
      issn = {20452322},
      journal = {Scientific Reports},
      keywords = {DNA covid learning machine},
      number = 1,
      pages = {17480--},
      refid = {Bowler2022},
      timestamp = {2023-03-07T22:52:39.000+0100},
      title = {A machine learning approach utilizing DNA methylation as an accurate classifier of COVID-19 disease severity},
      url = {https://doi.org/10.1038/s41598-022-22201-4},
      volume = 12,
      year = 2022
    }

  • M. Karaglani, M. Panagopoulou, C. Cheimonidi, I. Tsamardinos, E. Maltezos, N. Papanas, D. Papazoglou, G. Mastorakos, and E. Chatzaki, “Liquid Biopsy in Type 2 Diabetes Mellitus Management: Building Specific Biosignatures via Machine Learning,” Journal of Clinical Medicine, vol. 11, iss. 4, 2022. doi:10.3390/jcm11041045
    [BibTeX] [Abstract] [Download PDF]

    Background: The need for minimally invasive biomarkers for the early diagnosis of type 2 diabetes (T2DM) prior to the clinical onset and monitoring of β-pancreatic cell loss is emerging. Here, we focused on studying circulating cell-free DNA (ccfDNA) as a liquid biopsy biomaterial for accurate diagnosis/monitoring of T2DM. Methods: ccfDNA levels were directly quantified in sera from 96 T2DM patients and 71 healthy individuals via fluorometry, and then fragment DNA size profiling was performed by capillary electrophoresis. Following this, ccfDNA methylation levels of five β-cell-related genes were measured via qPCR. Data were analyzed by automated machine learning to build classifying predictive models. Results: ccfDNA levels were found to be similar between groups but indicative of apoptosis in T2DM. INS (Insulin), IAPP (Islet Amyloid Polypeptide-Amylin), GCK (Glucokinase), and KCNJ11 (Potassium Inwardly Rectifying Channel Subfamily J member 11) levels differed significantly between groups. AutoML analysis delivered biosignatures including GCK, IAPP and KCNJ11 methylation, with the highest ever reported discriminating performance of T2DM from healthy individuals (AUC 0.927). Conclusions: Our data unravel the value of ccfDNA as a minimally invasive biomaterial carrying important clinical information for T2DM. Upon prospective clinical evaluation, the built biosignature can be disruptive for T2DM clinical management.

    @article{jcm11041045,
      abstract = {Background: The need for minimally invasive biomarkers for the early diagnosis of type 2 diabetes (T2DM) prior to the clinical onset and monitoring of β-pancreatic cell loss is emerging. Here, we focused on studying circulating cell-free DNA (ccfDNA) as a liquid biopsy biomaterial for accurate diagnosis/monitoring of T2DM. Methods: ccfDNA levels were directly quantified in sera from 96 T2DM patients and 71 healthy individuals via fluorometry, and then fragment DNA size profiling was performed by capillary electrophoresis. Following this, ccfDNA methylation levels of five β-cell-related genes were measured via qPCR. Data were analyzed by automated machine learning to build classifying predictive models. Results: ccfDNA levels were found to be similar between groups but indicative of apoptosis in T2DM. INS (Insulin), IAPP (Islet Amyloid Polypeptide-Amylin), GCK (Glucokinase), and KCNJ11 (Potassium Inwardly Rectifying Channel Subfamily J member 11) levels differed significantly between groups. AutoML analysis delivered biosignatures including GCK, IAPP and KCNJ11 methylation, with the highest ever reported discriminating performance of T2DM from healthy individuals (AUC 0.927). Conclusions: Our data unravel the value of ccfDNA as a minimally invasive biomaterial carrying important clinical information for T2DM. Upon prospective clinical evaluation, the built biosignature can be disruptive for T2DM clinical management.},
      added-at = {2022-06-22T10:51:41.000+0200},
      article-number = {1045},
      author = {Karaglani, Makrina and Panagopoulou, Maria and Cheimonidi, Christina and Tsamardinos, Ioannis and Maltezos, Efstratios and Papanas, Nikolaos and Papazoglou, Dimitrios and Mastorakos, George and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/2fa7bb5fb798e4e91d2532d3115dcbbef/mensxmachina},
      doi = {10.3390/jcm11041045},
      interhash = {f3820dbe8f6b53a53f1671c62d64dfaf},
      intrahash = {fa7bb5fb798e4e91d2532d3115dcbbef},
      issn = {2077-0383},
      journal = {Journal of Clinical Medicine},
      keywords = {biopsy diabetes learning machine mellitus},
      number = 4,
      pubmedid = {35207316},
      timestamp = {2022-06-22T10:51:41.000+0200},
      title = {Liquid Biopsy in Type 2 Diabetes Mellitus Management: Building Specific Biosignatures via Machine Learning},
      url = {https://www.mdpi.com/2077-0383/11/4/1045},
      volume = 11,
      year = 2022
    }

  • J. L. Marshall, B. N. Peshkin, T. Yoshino, J. Vowinckel, H. E. Danielsen, G. Melino, I. Tsamardinos, C. Haudenschild, D. J. Kerr, C. Sampaio, S. Y. Rha, K. T. FitzGerald, E. C. Holland, D. Gallagher, J. Garcia-Foncillas, and H. Juhl, “The Essentials of Multiomics,” The Oncologist, vol. 27, iss. 4, pp. 272-284, 2022. doi:10.1093/oncolo/oyab048
    [BibTeX] [Abstract] [Download PDF]

    Within the last decade, the science of molecular testing has evolved from single gene and single protein analysis to broad molecular profiling as a standard of care, quickly transitioning from research to practice. Terms such as genomics, transcriptomics, proteomics, circulating omics, and artificial intelligence are now commonplace, and this rapid evolution has left us with a significant knowledge gap within the medical community. In this paper, we attempt to bridge that gap and prepare the physician in oncology for multiomics, a group of technologies that have gone from looming on the horizon to become a clinical reality. The era of multiomics is here, and we must prepare ourselves for this exciting new age of cancer medicine.

    @article{10.1093/oncolo/oyab048,
      abstract = {{Within the last decade, the science of molecular testing has evolved from single gene and single protein analysis to broad molecular profiling as a standard of care, quickly transitioning from research to practice. Terms such as genomics, transcriptomics, proteomics, circulating omics, and artificial intelligence are now commonplace, and this rapid evolution has left us with a significant knowledge gap within the medical community. In this paper, we attempt to bridge that gap and prepare the physician in oncology for multiomics, a group of technologies that have gone from looming on the horizon to become a clinical reality. The era of multiomics is here, and we must prepare ourselves for this exciting new age of cancer medicine.}},
      added-at = {2022-06-22T10:50:12.000+0200},
      author = {Marshall, John L and Peshkin, Beth N and Yoshino, Takayuki and Vowinckel, Jakob and Danielsen, Håvard E and Melino, Gerry and Tsamardinos, Ioannis and Haudenschild, Christian and Kerr, David J and Sampaio, Carlos and Rha, Sun Young and FitzGerald, Kevin T and Holland, Eric C and Gallagher, David and Garcia-Foncillas, Jesus and Juhl, Hartmut},
      biburl = {https://www.bibsonomy.org/bibtex/24d888d87a990372de0d0a08a01774ad6/mensxmachina},
      doi = {10.1093/oncolo/oyab048},
      eprint = {https://academic.oup.com/oncolo/article-pdf/27/4/272/43287416/oyab048.pdf},
      interhash = {f0ee8d8b0e2acf63c050b1f6f58be762},
      intrahash = {4d888d87a990372de0d0a08a01774ad6},
      issn = {1083-7159},
      journal = {The Oncologist},
      keywords = {mensxmachina multi-omics},
      month = {02},
      number = 4,
      pages = {272-284},
      timestamp = {2022-06-22T10:50:12.000+0200},
      title = {{The Essentials of Multiomics}},
      url = {https://doi.org/10.1093/oncolo/oyab048},
      volume = 27,
      year = 2022
    }

2021

  • J. Marcos-Zambrano, K. Karaduzovic-Hadziabdic, T. Turukalo, P. Przymus, V. Trajkovik, O. Aasmets, M. Berland, G. Gruca, J. Hasic, K. Hron, T. Klammsteiner, M. Kolev, L. Lanthi, M. Lopez, V. Moreno, I. Naskinova, E. Org, I. Paciência, G. Papoutsoglou, R. Shigdel, B. Stres, B. Vilne, M. Yousef, E. Zdravevski, I. Tsamardinos, E. Carrillo de Santa Pau, M. Claesson, I. Moreno-Indias, and J. Truu, “Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment,” Frontiers in Microbiology , vol. 12, 2021 . doi:https://doi.org/10.3389/fmicb.2021.634511
    [BibTeX] [Abstract] [Download PDF]

    The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

    @article{noauthororeditor,
      abstract = {The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.},
      added-at = {2021-02-25T10:36:00.000+0100},
      author = {Marcos-Zambrano, J and Karaduzovic-Hadziabdic, K and Turukalo, TL and Przymus, P and Trajkovik, V and Aasmets, O and Berland, M and Gruca, G and Hasic, J and Hron, K and Klammsteiner, T and Kolev, M and Lanthi, L and Lopez, M and Moreno, V and Naskinova, I and Org, E and Paciência, I and Papoutsoglou, G and Shigdel, R and Stres, B and Vilne, B and Yousef, M and Zdravevski, E and Tsamardinos, I and Carrillo de Santa Pau, E and Claesson, M and Moreno-Indias, I and Truu, J},
      biburl = {https://www.bibsonomy.org/bibtex/2e4c40be94c0336da43bf409d6a1272a7/mensxmachina},
      doi = {https://doi.org/10.3389/fmicb.2021.634511},
      interhash = {4f472a04bb70097a1db5243fc5c2ba8d},
      intrahash = {e4c40be94c0336da43bf409d6a1272a7},
      journal = {Frontiers in Microbiology },
      keywords = {ML},
      timestamp = {2021-02-25T10:36:00.000+0100},
      title = {Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment},
      url = {https://www.frontiersin.org/articles/10.3389/fmicb.2021.634511/full},
      volume = 12,
      year = {2021 }
    }

2021

  • L. J. Marcos-Zambrano, K. Karaduzovic-Hadziabdic, T. Loncar Turukalo, P. Przymus, V. Trajkovik, O. Aasmets, M. Berland, A. Gruca, J. Hasic, K. Hron, T. Klammsteiner, M. Kolev, L. Lahti, M. B. Lopes, V. Moreno, I. Naskinova, E. Org, I. Paciência, G. Papoutsoglou, R. Shigdel, B. Stres, B. Vilne, M. Yousef, E. Zdravevski, I. Tsamardinos, E. Carrillo de Santa Pau, M. J. Claesson, I. Moreno-Indias, and J. Truu, “Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment,” Frontiers in Microbiology, vol. 12, 2021. doi:10.3389/fmicb.2021.634511
    [BibTeX] [Abstract] [Download PDF]

    The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.

    @article{10.3389/fmicb.2021.634511,
      abstract = {The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.},
      added-at = {2022-06-22T10:58:03.000+0200},
      author = {Marcos-Zambrano, Laura Judith and Karaduzovic-Hadziabdic, Kanita and Loncar Turukalo, Tatjana and Przymus, Piotr and Trajkovik, Vladimir and Aasmets, Oliver and Berland, Magali and Gruca, Aleksandra and Hasic, Jasminka and Hron, Karel and Klammsteiner, Thomas and Kolev, Mikhail and Lahti, Leo and Lopes, Marta B. and Moreno, Victor and Naskinova, Irina and Org, Elin and Paciência, Inês and Papoutsoglou, Georgios and Shigdel, Rajesh and Stres, Blaz and Vilne, Baiba and Yousef, Malik and Zdravevski, Eftim and Tsamardinos, Ioannis and Carrillo de Santa Pau, Enrique and Claesson, Marcus J. and Moreno-Indias, Isabel and Truu, Jaak},
      biburl = {https://www.bibsonomy.org/bibtex/2b27cd61df0c85a21e0dd04b0fc7dfc6e/mensxmachina},
      doi = {10.3389/fmicb.2021.634511},
      interhash = {9365312756fb3fb9714d2f38a30626eb},
      intrahash = {b27cd61df0c85a21e0dd04b0fc7dfc6e},
      issn = {1664-302X},
      journal = {Frontiers in Microbiology},
      keywords = {applications biomarker disease learning machine microbiome predictive},
      timestamp = {2022-06-22T10:58:03.000+0200},
      title = {Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment},
      url = {https://www.frontiersin.org/article/10.3389/fmicb.2021.634511},
      volume = 12,
      year = 2021
    }

  • G. Papoutsoglou, M. Karaglani, V. Lagani, N. Thomson, O. Røe, I. Tsamardinos, and E. Chatzaki, “Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets,” Scientific Reports, vol. 11, 2021. doi:10.1038/s41598-021-94501-0
    [BibTeX]
    @article{article,
      added-at = {2022-06-22T10:56:54.000+0200},
      author = {Papoutsoglou, Georgios and Karaglani, Makrina and Lagani, Vincenzo and Thomson, Naomi and Røe, Oluf and Tsamardinos, Ioannis and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/232ca8367a87572429ee46be29bae66af/mensxmachina},
      doi = {10.1038/s41598-021-94501-0},
      interhash = {29657a11a3631c2933d03c8939af5f29},
      intrahash = {32ca8367a87572429ee46be29bae66af},
      journal = {Scientific Reports},
      keywords = {automl learning machine predictive},
      month = {07},
      timestamp = {2022-06-22T10:56:54.000+0200},
      title = {Automated machine learning optimizes and accelerates predictive modeling from COVID-19 high throughput datasets},
      volume = 11,
      year = 2021
    }

  • M. Papadogiorgaki, M. Venianaki, P. Charonyktakis, M. Antonakakis, I. Tsamardinos, M. E. Zervakis, and V. Sakkalis, “Heart Rate Classification Using ECG Signal Processing and Machine Learning Methods,” in 2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE), 2021, pp. 1-6. doi:10.1109/BIBE52308.2021.9635462
    [BibTeX]
    @inproceedings{9635462,
      added-at = {2022-06-22T10:55:58.000+0200},
      author = {Papadogiorgaki, Maria and Venianaki, Maria and Charonyktakis, Paulos and Antonakakis, Marios and Tsamardinos, Ioannis and Zervakis, Michalis E. and Sakkalis, Vangelis},
      biburl = {https://www.bibsonomy.org/bibtex/22feae72b255e7875c2643efa7e6ed788/mensxmachina},
      booktitle = {2021 IEEE 21st International Conference on Bioinformatics and Bioengineering (BIBE)},
      doi = {10.1109/BIBE52308.2021.9635462},
      interhash = {2eb66ce826c06fda1d9831644de642b5},
      intrahash = {2feae72b255e7875c2643efa7e6ed788},
      keywords = {classification ecg heart processing rate signal},
      pages = {1-6},
      timestamp = {2022-06-22T10:55:58.000+0200},
      title = {Heart Rate Classification Using ECG Signal Processing and Machine Learning Methods},
      year = 2021
    }

  • K. Rounis, D. Makrakis, C. Papadaki, A. Monastirioti, L. Vamvakas, K. Kalbakis, K. Gourlia, I. Xanthopoulos, I. Tsamardinos, D. Mavroudis, and S. Agelaki, “Prediction of outcome in patients with non-small cell lung cancer treated with second line PD-1/PDL-1 inhibitors based on clinical parameters: Results from a prospective, single institution study,” PLOS ONE, vol. 16, iss. 6, pp. 1-18, 2021. doi:10.1371/journal.pone.0252537
    [BibTeX] [Abstract] [Download PDF]

    Objective We prospectively recorded clinical and laboratory parameters from patients with metastatic non-small cell lung cancer (NSCLC) treated with 2nd line PD-1/PD-L1 inhibitors in order to address their effect on treatment outcomes. Materials and methods Clinicopathological information (age, performance status, smoking, body mass index, histology, organs with metastases), use and duration of proton pump inhibitors, steroids and antibiotics (ATB) and laboratory values [neutrophil/lymphocyte ratio, LDH, albumin] were prospectively collected. Steroid administration was defined as the use of > 10 mg prednisone equivalent for ≥ 10 days. Prolonged ATB administration was defined as ATB ≥ 14 days 30 days before or within the first 3 months of treatment. JADBio, a machine learning pipeline was applied for further multivariate analysis. Results Data from 66 pts with non-oncogenic driven metastatic NSCLC were analyzed; 15.2% experienced partial response (PR), 34.8% stable disease (SD) and 50% progressive disease (PD). Median overall survival (OS) was 6.77 months. ATB administration did not affect patient OS [HR = 1.35 (CI: 0.761–2.406, p = 0.304)], however, prolonged ATBs [HR = 2.95 (CI: 1.62–5.36, p = 0.0001)] and the presence of bone metastases [HR = 1.89 (CI: 1.02–3.51, p = 0.049)] independently predicted for shorter survival. Prolonged ATB administration, bone metastases, liver metastases and BMI < 25 kg/m2 were selected by JADbio as the important features that were associated with increased probability of developing disease progression as response to treatment. The resulting algorithm that was created was able to predict the probability of disease stabilization (PR or SD) in a single individual with an AUC = 0.806 [95% CI:0.714–0.889]. Conclusions Our results demonstrate an adverse effect of prolonged ATBs on response and survival and underscore their importance along with the presence of bone metastases, liver metastases and low BMI in the individual prediction of outcomes in patients treated with immunotherapy.

    @article{10.1371/journal.pone.0252537,
      abstract = {Objective We prospectively recorded clinical and laboratory parameters from patients with metastatic non-small cell lung cancer (NSCLC) treated with 2nd line PD-1/PD-L1 inhibitors in order to address their effect on treatment outcomes.   Materials and methods Clinicopathological information (age, performance status, smoking, body mass index, histology, organs with metastases), use and duration of proton pump inhibitors, steroids and antibiotics (ATB) and laboratory values [neutrophil/lymphocyte ratio, LDH, albumin] were prospectively collected. Steroid administration was defined as the use of > 10 mg prednisone equivalent for ≥ 10 days. Prolonged ATB administration was defined as ATB ≥ 14 days 30 days before or within the first 3 months of treatment. JADBio, a machine learning pipeline was applied for further multivariate analysis.   Results Data from 66 pts with non-oncogenic driven metastatic NSCLC were analyzed; 15.2% experienced partial response (PR), 34.8% stable disease (SD) and 50% progressive disease (PD). Median overall survival (OS) was 6.77 months. ATB administration did not affect patient OS [HR = 1.35 (CI: 0.761–2.406, p = 0.304)], however, prolonged ATBs [HR = 2.95 (CI: 1.62–5.36, p = 0.0001)] and the presence of bone metastases [HR = 1.89 (CI: 1.02–3.51, p = 0.049)] independently predicted for shorter survival. Prolonged ATB administration, bone metastases, liver metastases and BMI < 25 kg/m2 were selected by JADbio as the important features that were associated with increased probability of developing disease progression as response to treatment. The resulting algorithm that was created was able to predict the probability of disease stabilization (PR or SD) in a single individual with an AUC = 0.806 [95% CI:0.714–0.889].   Conclusions Our results demonstrate an adverse effect of prolonged ATBs on response and survival and underscore their importance along with the presence of bone metastases, liver metastases and low BMI in the individual prediction of outcomes in patients treated with immunotherapy.},
      added-at = {2021-06-04T09:18:00.000+0200},
      author = {Rounis, Konstantinos and Makrakis, Dimitrios and Papadaki, Chara and Monastirioti, Alexia and Vamvakas, Lambros and Kalbakis, Konstantinos and Gourlia, Krystallia and Xanthopoulos, Iordanis and Tsamardinos, Ioannis and Mavroudis, Dimitrios and Agelaki, Sofia},
      biburl = {https://www.bibsonomy.org/bibtex/2a0fda17bd6c2177cb4ce435c3559b648/mensxmachina},
      doi = {10.1371/journal.pone.0252537},
      interhash = {c53c8616653bdaaa2984cde14d27d241},
      intrahash = {a0fda17bd6c2177cb4ce435c3559b648},
      journal = {PLOS ONE},
      keywords = {imported},
      month = {06},
      number = 6,
      pages = {1-18},
      publisher = {Public Library of Science},
      timestamp = {2021-06-04T09:18:00.000+0200},
      title = {Prediction of outcome in patients with non-small cell lung cancer treated with second line PD-1/PDL-1 inhibitors based on clinical parameters: Results from a prospective, single institution study},
      url = {https://doi.org/10.1371/journal.pone.0252537},
      volume = 16,
      year = 2021
    }

  • G. Borboudakis and I. Tsamardinos, "Extending greedy feature selection algorithms to multiple solutions," Data Mining and Knowledge Discovery, 2021. doi:10.1007/s10618-020-00731-7
    [BibTeX] [Abstract] [Download PDF]

    Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.

    @article{Borboudakis2021,
      abstract = {Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.},
      added-at = {2021-05-10T09:37:57.000+0200},
      author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/21a02e4b98901f0889375b61fbba306a2/mensxmachina},
      day = 01,
      doi = {10.1007/s10618-020-00731-7},
      interhash = {2111a54b383124f93dad8b9ebd26afb5},
      intrahash = {1a02e4b98901f0889375b61fbba306a2},
      issn = {1573-756X},
      journal = {Data Mining and Knowledge Discovery},
      keywords = {mxmcausalpath},
      month = may,
      timestamp = {2021-05-10T09:37:57.000+0200},
      title = {Extending greedy feature selection algorithms to multiple solutions},
      url = {https://doi.org/10.1007/s10618-020-00731-7},
      year = 2021
    }

  • M. Panagopoulou, M. Karaglani, V. G. Manolopoulos, I. Iliopoulos, I. Tsamardinos, and E. Chatzaki, "Deciphering the Methylation Landscape in Breast Cancer: Diagnostic and Prognostic Biosignatures through Automated Machine Learning," Cancers, vol. 13, iss. 7, p. 1677, 2021. doi:10.3390/cancers13071677
    [BibTeX] [Abstract] [Download PDF]

    DNA methylation plays an important role in breast cancer (BrCa) pathogenesis and could contribute to driving its personalized management. We performed a complete bioinformatic analysis in BrCa whole methylome datasets, analyzed using the Illumina methylation 450 bead-chip array. Differential methylation analysis vs. clinical end-points resulted in 11,176 to 27,786 differentially methylated genes (DMGs). Innovative automated machine learning (AutoML) was employed to construct signatures with translational value. Three highly performing and low-feature-number signatures were built: (1) A 5-gene signature discriminating BrCa patients from healthy individuals (area under the curve (AUC): 0.994 (0.982–1.000)). (2) A 3-gene signature identifying BrCa metastatic disease (AUC: 0.986 (0.921–1.000)). (3) Six equivalent 5-gene signatures diagnosing early disease (AUC: 0.973 (0.920–1.000)). Validation in independent patient groups verified performance. Bioinformatic tools for functional analysis and protein interaction prediction were also employed. All protein encoding features included in the signatures were associated with BrCa-related pathways. Functional analysis of DMGs highlighted the regulation of transcription as the main biological process, the nucleus as the main cellular component and transcription factor activity and sequence-specific DNA binding as the main molecular functions. Overall, three high-performance diagnostic/prognostic signatures were built and are readily available for improving BrCa precision management upon prospective clinical validation. Revisiting archived methylomes through novel bioinformatic approaches revealed significant clarifying knowledge for the contribution of gene methylation events in breast carcinogenesis.

    @article{Panagopoulou_2021,
      abstract = {DNA methylation plays an important role in breast cancer (BrCa) pathogenesis and could contribute to driving its personalized management. We performed a complete bioinformatic analysis in BrCa whole methylome datasets, analyzed using the Illumina methylation 450 bead-chip array. Differential methylation analysis vs. clinical end-points resulted in 11,176 to 27,786 differentially methylated genes (DMGs). Innovative automated machine learning (AutoML) was employed to construct signatures with translational value. Three highly performing and low-feature-number signatures were built: (1) A 5-gene signature discriminating BrCa patients from healthy individuals (area under the curve (AUC): 0.994 (0.982–1.000)). (2) A 3-gene signature identifying BrCa metastatic disease (AUC: 0.986 (0.921–1.000)). (3) Six equivalent 5-gene signatures diagnosing early disease (AUC: 0.973 (0.920–1.000)). Validation in independent patient groups verified performance. Bioinformatic tools for functional analysis and protein interaction prediction were also employed. All protein encoding features included in the signatures were associated with BrCa-related pathways. Functional analysis of DMGs highlighted the regulation of transcription as the main biological process, the nucleus as the main cellular component and transcription factor activity and sequence-specific DNA binding as the main molecular functions. Overall, three high-performance diagnostic/prognostic signatures were built and are readily available for improving BrCa precision management upon prospective clinical validation. Revisiting archived methylomes through novel bioinformatic approaches revealed significant clarifying knowledge for the contribution of gene methylation events in breast carcinogenesis.},
      added-at = {2021-04-05T10:25:29.000+0200},
      author = {Panagopoulou, Maria and Karaglani, Makrina and Manolopoulos, Vangelis G. and Iliopoulos, Ioannis and Tsamardinos, Ioannis and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/25938c275248de01841423c461744c95c/mensxmachina},
      doi = {10.3390/cancers13071677},
      interhash = {9a46961bf0583786199d3b4d978bcb01},
      intrahash = {5938c275248de01841423c461744c95c},
      journal = {Cancers},
      keywords = {imported},
      month = apr,
      number = 7,
      pages = 1677,
      publisher = {{MDPI} {AG}},
      timestamp = {2021-04-05T10:25:29.000+0200},
      title = {Deciphering the Methylation Landscape in Breast Cancer: Diagnostic and Prognostic Biosignatures through Automated Machine Learning},
      url = {https://doi.org/10.3390%2Fcancers13071677},
      volume = 13,
      year = 2021
    }

  • G. Borboudakis and I. Tsamardinos, "Extending Greedy Feature Selection Algorithms to Multiple Solutions," Data Mining and Knowledge Discovery, vol. to appear , 2021.
    [BibTeX]
    @article{borboudakis2021mining,
      added-at = {2021-03-17T12:12:52.000+0100},
      author = {Borboudakis, G and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/295b55379724af7ef52054e5a33fd4745/mensxmachina},
      interhash = {2111a54b383124f93dad8b9ebd26afb5},
      intrahash = {95b55379724af7ef52054e5a33fd4745},
      journal = {Data Mining and Knowledge Discovery},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-18T10:07:49.000+0100},
      title = {Extending Greedy Feature Selection Algorithms to Multiple Solutions},
      volume = {to appear },
      year = 2021
    }

  • N. Myrtakis, I. Tsamardinos, and V. Christophides, "PROTEUS: Predictive Explanation of Anomalies,," , vol. IEEE 37th International Conference on Data Engineering (ICDE) 2021, 2021.
    [BibTeX] [Abstract]

    Numerous algorithms have been proposed for detecting anomalies (outliers, novelties) in an unsupervised manner. Unfortunately, it is not trivial, in general, to understand why a given sample (record) is labelled as an anomaly and thus diagnose its root causes. We propose the following reduced-dimensionality, surrogate model approach to explain detector decisions: approximate the detection model with another one that employs only a small subset of features. Subsequently, samples can be visualized in this low-dimensionality space for human understanding. To this end, we develop PROTEUS, an AutoML pipeline to produce the surrogate model, specifically designed for feature selection on imbalanced datasets. The PROTEUS surrogate model can not only explain the training data, but also the out-of-sample (unseen) data. In other words, PROTEUS produces predictive explanations by approximating the decision surface of an unsupervised detector. PROTEUS is designed to return an accurate estimate of out-of-sample predictive performance to serve as a metric of the quality of the approximation. Computational experiments confirm the efficacy of PROTEUS to produce predictive explanations for different families of detectors and to reliably estimate their predictive performance in unseen data. Unlike several ad-hoc feature importance methods, PROTEUS is robust to high-dimensional data.

    @conference{myrtakis2021proteus,
      abstract = {Numerous algorithms have been proposed for detecting anomalies (outliers, novelties) in an unsupervised manner. Unfortunately, it is not trivial, in general, to understand why a given sample (record) is labelled as an anomaly and thus diagnose its root causes. We propose the following reduced-dimensionality, surrogate model approach to explain detector decisions: approximate the detection model with another one that employs only a small subset of features. Subsequently, samples can be visualized in this low-dimensionality space for human understanding. To this end, we develop PROTEUS, an AutoML pipeline to produce the surrogate model, specifically designed for feature selection on imbalanced datasets. The PROTEUS surrogate model can not only explain the training data, but also the out-of-sample (unseen) data. In other words, PROTEUS produces predictive explanations by approximating the decision surface of an unsupervised detector. PROTEUS is designed to return an accurate estimate of out-of-sample predictive performance to serve as a metric of the quality of the approximation. Computational experiments confirm the efficacy of PROTEUS to produce predictive explanations for different families of detectors and to reliably estimate their predictive performance in unseen data. Unlike several ad-hoc feature importance methods, PROTEUS is robust to high-dimensional data.
    },
      added-at = {2021-02-10T09:57:44.000+0100},
      author = {Myrtakis, N and Tsamardinos, I and Christophides, V},
      biburl = {https://www.bibsonomy.org/bibtex/207bdf48e36b94f93849856e1a1ec258a/mensxmachina},
      interhash = {1be3182c1d6928ec21142b5f18a6ea20},
      intrahash = {07bdf48e36b94f93849856e1a1ec258a},
      keywords = {anomalies},
      timestamp = {2021-03-19T10:32:22.000+0100},
      title = {"PROTEUS: Predictive Explanation of Anomalies,"},
      volume = {IEEE 37th International Conference on Data Engineering (ICDE) 2021},
      year = 2021
    }

2020

  • A. Tsourtis, Y. Pantazis, and I. Tsamardinos, "Inference of Stochastic Dynamical Systems from Cross-Sectional Population Data ," arXiv:2012.05055v1 [cs.LG] 9 Dec 2020, 2020. doi:arXiv:2012.05055v1 [cs.LG] 9 Dec 2020
    [BibTeX] [Abstract]

    Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements there are few attempts to infer the dynamical system straight from population data. In this work, we deduce and then computationally estimate the Fokker-Planck equation which describes the evolution of the population’s probability density, based on stochastic differential equations. Then, following the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions, transforming it into a linear system of equations. Finally, we apply sparse inference methods to solve the latter system and thus induce the driving forces of the dynamical system. Our approach is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential equations, biochemical reaction networks as well as mass cytometry biological measurements.

    @article{tsourtis2020inference,
      abstract = {Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many
    others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements
    there are few attempts to infer the dynamical system straight from population data. In this work, we
    deduce and then computationally estimate the Fokker-Planck equation which describes the evolution
    of the population’s probability density, based on stochastic differential equations. Then, following
    the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions,
    transforming it into a linear system of equations. Finally, we apply sparse inference methods to
    solve the latter system and thus induce the driving forces of the dynamical system. Our approach
    is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential
    equations, biochemical reaction networks as well as mass cytometry biological measurements.},
      added-at = {2021-03-24T10:32:03.000+0100},
      author = {Tsourtis, A and Pantazis, Y and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/2f3d7571025e47ab9693c1b8a5876702d/mensxmachina},
      doi = {arXiv:2012.05055v1 [cs.LG] 9 Dec 2020},
      interhash = {1dd0cba1cddecc67bc714ff55e2fa939},
      intrahash = {f3d7571025e47ab9693c1b8a5876702d},
      journal = {arXiv:2012.05055v1 [cs.LG] 9 Dec 2020},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-24T10:32:03.000+0100},
      title = {Inference of Stochastic Dynamical Systems from Cross-Sectional Population
    Data
    },
      year = 2020
    }

  • M. Tsagris, Z. Papadovasilakis, K. Lakiotaki, and I. Tsamardinos, "The γ-OMP algorithm for feature selection with application to gene expression data," IEEE/ACM Transactions on Computational Biology and Bioinformatics , 2020. doi:10.1109/TCBB.2020.3029952
    [BibTeX] [Abstract] [Download PDF]

    Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.

    @article{tsagris2020algorithm,
      abstract = {Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.},
      added-at = {2021-03-22T13:27:44.000+0100},
      author = {Tsagris, Michail and Papadovasilakis, Zacharias and Lakiotaki, Kleanthi and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2372b4dd105cf55a3c32ca0d937888f2e/mensxmachina},
      doi = {10.1109/TCBB.2020.3029952},
      interhash = {9bef7f59658d9a4a2f82cba160e276e4},
      intrahash = {372b4dd105cf55a3c32ca0d937888f2e},
      journal = { IEEE/ACM Transactions on Computational Biology and Bioinformatics },
      keywords = {mxmcausalpath},
      timestamp = {2021-03-22T13:27:44.000+0100},
      title = {The γ-OMP algorithm for feature selection with application to gene expression data},
      url = {https://ieeexplore.ieee.org/document/9219177/authors#authors},
      year = 2020
    }

  • Y. Pantazis, C. Tselas, K. Lakiotaki, V. Lagani, and ioannis Tsamardinos, "Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions," IEEE, 2020. doi:10.1109/BIBM49941.2020.9313286
    [BibTeX] [Abstract] [Download PDF]

    High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.

    @article{pantazis2020latent,
      abstract = {High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.},
      added-at = {2021-01-27T08:25:38.000+0100},
      author = {Pantazis, Yannis and Tselas, Christos and Lakiotaki, Kleanthi and Lagani, Vincenzo and Tsamardinos, ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/22e00727d34af38370524ab45428d1935/mensxmachina},
      doi = {10.1109/BIBM49941.2020.9313286},
      interhash = {85456c0fc077102f3eca5cd7f7dfc749},
      intrahash = {2e00727d34af38370524ab45428d1935},
      journal = {IEEE},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-08T12:07:50.000+0100},
      title = {Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions},
      url = {https://ieeexplore.ieee.org/document/9313286},
      year = 2020
    }

  • N. Phanell, V. Lagani, P. Sebastian-Leon, F. Van der Kloet, E. Ewing, N. Karathanasis, A. Urdangarin, I. Arozarena, M. Jagodic, I. Tsamardinos, S. Tarazona, A. Conesa, J. Tegner, and D. Gomez-Cabrero, "STATegra: Multi-omics data integration - A conceptual scheme and a bioinformatics pipeline," Frontiers in Genetics , vol. to appear , 2020. doi:https://doi.org/10.1101/2020.11.20.391045
    [BibTeX] [Abstract] [Download PDF]

    Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. It is therefore an unmet need to conceptualize how to integrate such data and to implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner. While in several studies we have previously combined those integrative tools, here we provide a systematic description of the STATegra framework and its validation using two TCGA case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma cases, we demonstrate an enhanced capacity to identify features in comparison to single-omics analysis. Such an integrative multi-omics analysis framework for the identification of features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled, and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package https://bioconductor.org/packages/release/bioc/html/STATegra.html.

    @article{noauthororeditor,
      abstract = {Technologies for profiling samples using different omics platforms have been at the forefront since the human genome project. Large-scale multi-omics data hold the promise of deciphering different regulatory layers. Yet, while there is a myriad of bioinformatics tools, each multi-omics analysis appears to start from scratch with an arbitrary decision over which tools to use and how to combine them. It is therefore an unmet need to conceptualize how to integrate such data and to implement and validate pipelines in different cases. We have designed a conceptual framework (STATegra), aiming it to be as generic as possible for multi-omics analysis, combining machine learning component analysis, non-parametric data combination and a multi-omics exploratory analysis in a step-wise manner. While in several studies we have previously combined those integrative tools, here we provide a systematic description of the STATegra framework and its validation using two TCGA case studies. For both, the Glioblastoma and the Skin Cutaneous Melanoma cases, we demonstrate an enhanced capacity to identify features in comparison to single-omics analysis. Such an integrative multi-omics analysis framework for the identification of features and components facilitates the discovery of new biology. Finally, we provide several options for applying the STATegra framework when parametric assumptions are fulfilled, and for the case when not all the samples are profiled for all omics. The STATegra framework is built using several tools, which are being integrated step-by-step as OpenSource in the STATegRa Bioconductor package https://bioconductor.org/packages/release/bioc/html/STATegra.html.},
      added-at = {2021-01-25T08:02:51.000+0100},
      author = {Phanell, Nuria and Lagani, Vincenzo and Sebastian-Leon, Patricia and Van der Kloet, Frans and Ewing, Ewoud and Karathanasis, Nestoras and Urdangarin, Arantxa and Arozarena, Imanol and Jagodic, Maja and Tsamardinos, Ioannis and Tarazona, Sonia and Conesa, Ana and Tegner, Jesper and Gomez-Cabrero, David},
      biburl = {https://www.bibsonomy.org/bibtex/213d5658c490ee48b134629c33979e700/mensxmachina},
      doi = {https://doi.org/10.1101/2020.11.20.391045},
      interhash = {84dd53162ecf2659ffb75f1329f0aaad},
      intrahash = {13d5658c490ee48b134629c33979e700},
      journal = {Frontiers in Genetics },
      keywords = {data multi-omics},
      timestamp = {2021-01-25T08:02:51.000+0100},
      title = {STATegra: Multi-omics data integration - A conceptual scheme and a bioinformatics pipeline},
      url = {https://www.biorxiv.org/content/10.1101/2020.11.20.391045v1},
      volume = {to appear },
      year = 2020
    }

  • K. Karstoft, I". "Tsamardinos, K". "Eskelund, "Andersen.SB", and L. "Nissen, "Applicability of an Automated Model and Parameter Selection in the Prediction of Screening-Level PTSD in Danish Soldiers Following Deployment: Development Study of Transferable Predictive Models Using Automated Machine Learning," JMIR Medical Informatics, vol. 8, iss. 7, 2020. doi:10.2196/17119
    [BibTeX] [Abstract] [Download PDF]

    Background: Posttraumatic stress disorder (PTSD) is a relatively common consequence of deployment to war zones. Early postdeployment screening with the aim of identifying those at risk for PTSD in the years following deployment will help deliver interventions to those in need but have so far proved unsuccessful. Objective: This study aimed to test the applicability of automated model selection and the ability of automated machine learning prediction models to transfer across cohorts and predict screening-level PTSD 2.5 years and 6.5 years after deployment. Methods: Automated machine learning was applied to data routinely collected 6-8 months after return from deployment from 3 different cohorts of Danish soldiers deployed to Afghanistan in 2009 (cohort 1, N=287 or N=261 depending on the timing of the outcome assessment), 2010 (cohort 2, N=352), and 2013 (cohort 3, N=232). Results: Models transferred well between cohorts. For screening-level PTSD 2.5 and 6.5 years after deployment, random forest models provided the highest accuracy as measured by area under the receiver operating characteristic curve (AUC): 2.5 years, AUC=0.77, 95% CI 0.71-0.83; 6.5 years, AUC=0.78, 95% CI 0.73-0.83. Linear models performed equally well. Military rank, hyperarousal symptoms, and total level of PTSD symptoms were highly predictive. Conclusions: Automated machine learning provided validated models that can be readily implemented in future deployment cohorts in the Danish Defense with the aim of targeting postdeployment support interventions to those at highest risk for developing PTSD, provided the cohorts are deployed on similar missions.

    @article{karstoft2020applicability,
      abstract = {Background: Posttraumatic stress disorder (PTSD) is a relatively common consequence of deployment to war zones. Early postdeployment screening with the aim of identifying those at risk for PTSD in the years following deployment will help deliver interventions to those in need but have so far proved unsuccessful.
    
    Objective: This study aimed to test the applicability of automated model selection and the ability of automated machine learning prediction models to transfer across cohorts and predict screening-level PTSD 2.5 years and 6.5 years after deployment.
    
    Methods: Automated machine learning was applied to data routinely collected 6-8 months after return from deployment from 3 different cohorts of Danish soldiers deployed to Afghanistan in 2009 (cohort 1, N=287 or N=261 depending on the timing of the outcome assessment), 2010 (cohort 2, N=352), and 2013 (cohort 3, N=232).
    
    Results: Models transferred well between cohorts. For screening-level PTSD 2.5 and 6.5 years after deployment, random forest models provided the highest accuracy as measured by area under the receiver operating characteristic curve (AUC): 2.5 years, AUC=0.77, 95% CI 0.71-0.83; 6.5 years, AUC=0.78, 95% CI 0.73-0.83. Linear models performed equally well. Military rank, hyperarousal symptoms, and total level of PTSD symptoms were highly predictive.
    
    Conclusions: Automated machine learning provided validated models that can be readily implemented in future deployment cohorts in the Danish Defense with the aim of targeting postdeployment support interventions to those at highest risk for developing PTSD, provided the cohorts are deployed on similar missions.},
      added-at = {2020-11-04T15:45:03.000+0100},
      author = {"Karstoft, KI" and "Tsamardinos, I" and "Eskelund, K" and "Andersen.SB" and "Nissen, LR"},
      biburl = {https://www.bibsonomy.org/bibtex/2b3c6a7c433dc0137e177a389e93373d6/mensxmachina},
      doi = {10.2196/17119},
      interhash = {e4d28b268e9ea645b86d7488930824cc},
      intrahash = {b3c6a7c433dc0137e177a389e93373d6},
      journal = {JMIR Medical Informatics},
      keywords = {AutoML Automated Learning Machine application models parameter predictive selection study transferable},
      month = {July},
      number = 7,
      timestamp = {2020-11-04T15:46:26.000+0100},
      title = {Applicability of an Automated Model and Parameter Selection in the Prediction of Screening-Level PTSD in Danish Soldiers Following Deployment: Development Study of Transferable Predictive Models Using Automated Machine Learning},
      url = {https://europepmc.org/article/pmc/pmc7407253},
      volume = 8,
      year = 2020
    }

  • M. Karaglani, K. Gourlia, I. Tsamardinos, and E. Chatzaki, "Accurate Blood-Based Diagnostic Biosignatures for Alzheimer’s Disease via Automated Machine Learning," Journal of Clinical Medicine , vol. 9, p. 3016, 2020. doi:10.3390/jcm9093016
    [BibTeX] [Abstract] [Download PDF]

    Alzheimer’s disease (AD) is the most common form of neurodegenerative dementia and its timely diagnosis remains a major challenge in biomarker discovery. In the present study, we analyzed publicly available high-throughput low-sample -omics datasets from studies in AD blood, by the AutoML technology Just Add Data Bio (JADBIO), to construct accurate predictive models for use as diagnostic biosignatures. Considering data from AD patients and age–sex matched cognitively healthy individuals, we produced three best performing diagnostic biosignatures specific for the presence of AD: A. A 506-feature transcriptomic dataset from 48 AD and 22 controls led to a miRNA-based biosignature via Support Vector Machines with three miRNA predictors (AUC 0.975 (0.906, 1.000)), B. A 38,327-feature transcriptomic dataset from 134 AD and 100 controls led to six mRNA-based statistically equivalent signatures via Classification Random Forests with 25 mRNA predictors (AUC 0.846 (0.778, 0.905)) and C. A 9483-feature proteomic dataset from 25 AD and 37 controls led to a protein-based biosignature via Ridge Logistic Regression with seven protein predictors (AUC 0.921 (0.849, 0.972)). These performance metrics were also validated through the JADBIO pipeline confirming stability. In conclusion, using the automated machine learning tool JADBIO, we produced accurate predictive biosignatures extrapolating available low sample -omics data. These results offer options for minimally invasive blood-based diagnostic tests for AD, awaiting clinical validation based on respective laboratory assays. They also highlight the value of AutoML in biomarker discovery

    @article{karaglani2020accurate,
      abstract = {Alzheimer’s disease (AD) is the most common form of neurodegenerative dementia and its timely diagnosis remains a major challenge in biomarker discovery. In the present study, we analyzed publicly available high-throughput low-sample -omics datasets from studies in AD blood, by the AutoML technology Just Add Data Bio (JADBIO), to construct accurate predictive models for use as diagnostic biosignatures. Considering data from AD patients and age–sex matched cognitively healthy individuals, we produced three best performing diagnostic biosignatures specific for the presence of AD: A. A 506-feature transcriptomic dataset from 48 AD and 22 controls led to a miRNA-based biosignature via Support Vector Machines with three miRNA predictors (AUC 0.975 (0.906, 1.000)), B. A 38,327-feature transcriptomic dataset from 134 AD and 100 controls led to six mRNA-based statistically equivalent signatures via Classification Random Forests with 25 mRNA predictors (AUC 0.846 (0.778, 0.905)) and C. A 9483-feature proteomic dataset from 25 AD and 37 controls led to a protein-based biosignature via Ridge Logistic Regression with seven protein predictors (AUC 0.921 (0.849, 0.972)). These performance metrics were also validated through the JADBIO pipeline confirming stability. In conclusion, using the automated machine learning tool JADBIO, we produced accurate predictive biosignatures extrapolating available low sample -omics data. These results offer options for minimally invasive blood-based diagnostic tests for AD, awaiting clinical validation based on respective laboratory assays. They also highlight the value of AutoML in biomarker discovery},
      added-at = {2020-09-21T09:50:10.000+0200},
      author = {Karaglani, Makrina and Gourlia, Krystallia and Tsamardinos, Ioannis and Chatzaki, Ekaterini},
      biburl = {https://www.bibsonomy.org/bibtex/2e4b7bd7e2db5b045f13fdb58de2c0b3c/mensxmachina},
      doi = {10.3390/jcm9093016},
      interhash = {9805a10159c6a371c134c15364c30b15},
      intrahash = {e4b7bd7e2db5b045f13fdb58de2c0b3c},
      journal = {Journal of Clinical  Medicine },
      keywords = {Alzheimer’s blood classifier disease learning machine model predictive},
      pages = 3016,
      timestamp = {2021-03-18T08:42:28.000+0100},
      title = {Accurate Blood-Based Diagnostic Biosignatures for Alzheimer’s Disease via Automated Machine Learning},
      url = {https://www.mdpi.com/2077-0383/9/9/3016},
      volume = 9,
      year = 2020
    }

  • K. Biza, I. Tsamardinos, and S. Triantafillou, "Tuning Causal Discovery Algorithms," Proceedings of the Tenth International Conference on Probabilistic Graphical Models, in PMLR, 2020.
    [BibTeX] [Abstract] [Download PDF]

    There are numerous algorithms proposed in the literature for learning causal graphical probabilistic models. Each one of them is typically equipped with one or more tuning hyper-parameters. The choice of optimal algorithm and hyper-parameter values is not universal; it depends on the size of the network, the density of the true causal structure, the sample size, as well as the metric of quality of learning a causal structure. Thus, the challenge to a practitioner is how to “tune” these choices, given that the true graph is unknown and the learning task is unsupervised. In the paper, we evaluate two previously proposed methods for tuning, one based on stability of the learned structure under perturbations (bootstrapping) of the input data and the other based on balancing the in-sample fitting of the model with the model complexity. We propose and comparatively evaluate a new method that treats a causal model as a set of predictive models: one for each node given its Markov Blanket. It then tunes the choices using out-of-sample protocols for supervised methods such as cross-validation. The proposed method performs on par or better than the previous methods for most metrics.

    @article{noauthororeditor,
      abstract = {There are numerous algorithms proposed in the literature for learning causal graphical probabilistic
    models. Each one of them is typically equipped with one or more tuning hyper-parameters. The
    choice of optimal algorithm and hyper-parameter values is not universal; it depends on the size
    of the network, the density of the true causal structure, the sample size, as well as the metric of
    quality of learning a causal structure. Thus, the challenge to a practitioner is how to “tune” these
    choices, given that the true graph is unknown and the learning task is unsupervised. In the paper,
    we evaluate two previously proposed methods for tuning, one based on stability of the learned
    structure under perturbations (bootstrapping) of the input data and the other based on balancing the
    in-sample fitting of the model with the model complexity. We propose and comparatively evaluate
    a new method that treats a causal model as a set of predictive models: one for each node given its
    Markov Blanket. It then tunes the choices using out-of-sample protocols for supervised methods
    such as cross-validation. The proposed method performs on par or better than the previous methods
    for most metrics.},
      added-at = {2020-09-08T10:21:24.000+0200},
      author = {Biza, K. and Tsamardinos, I. and Triantafillou, S.},
      biburl = {https://www.bibsonomy.org/bibtex/227a19344432d6e2831ae6fac806fe077/mensxmachina},
      interhash = {8a79a42946abf9f62c4d45b8110b6a94},
      intrahash = {27a19344432d6e2831ae6fac806fe077},
      journal = {Proceedings of the Tenth International Conference on Probabilistic Graphical Models, in PMLR},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-08T12:02:38.000+0100},
      title = {Tuning Causal Discovery Algorithms},
      url = {https://pgm2020.cs.aau.dk/wp-content/uploads/2020/09/biza20.pdf},
      year = 2020
    }

  • I. Karagiannaki, Y. Pantazis, E. Chatzaki, and I. Tsamardinos, "Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data," Discovery Science. DS 2020. Lecture Notes in Computer Science, vol. 12323, pp. 246-261, 2020. doi:https://doi.org/10.1007/978-3-030-61527-7_17
    [BibTeX] [Abstract] [Download PDF]

    Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.

    @article{noauthororeditor,
      abstract = {Molecular gene-expression datasets consist of samples with
    tens of thousands of measured quantities (e.g., high dimensional data).
    However, there exist lower-dimensional representations that retain the
    useful information. We present a novel algorithm for such dimensionality
    reduction called Pathway Activity Score Learning (PASL). The major
    novelty of PASL is that the constructed features directly correspond to
    known molecular pathways and can be interpreted as pathway activity
    scores. Hence, unlike PCA and similar methods, PASL’s latent space
    has a relatively straight-forward biological interpretation. As a use-case,
    PASL is applied on two collections of breast cancer and leukemia gene
    expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as
    outperforming PLIER, a recently proposed competitive method. We also
    show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is
    available at https://github.com/mensxmachina/PASL.},
      added-at = {2020-09-07T12:31:50.000+0200},
      author = {Karagiannaki, Ioulia and Pantazis, Yannis and Chatzaki, Ekaterini and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/25aa1c97303026e34c4c5d8a76d116652/mensxmachina},
      doi = {https://doi.org/10.1007/978-3-030-61527-7_17},
      editor = {"Tsoumakas, G" and "Manolopoulos, Y" and "Matwin, S"},
      interhash = {250e1c55d999f5493581587cf0627a28},
      intrahash = {5aa1c97303026e34c4c5d8a76d116652},
      journal = {Discovery Science. DS 2020. Lecture Notes in Computer Science},
      keywords = {mxmcausalpath},
      pages = {246-261},
      timestamp = {2021-03-18T08:43:09.000+0100},
      title = {Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data},
      url = {https://link.springer.com/chapter/10.1007%2F978-3-030-61527-7_17},
      volume = 12323,
      year = 2020
    }

  • I. Tsamardinos, G. Fanourgakis, E. Greasidou, E. Klontzas, K. Gkagkas, and G. Froudakis, "An Automated Machine Learning architecture for the accelerated prediction of Metal-Organic Frameworks performance in energy and environmental applications," Microporous and Mesoporos Materials , vol. 300, 2020. doi:https://doi.org/10.1016/j.micromeso.2020.110160
    [BibTeX] [Abstract] [Download PDF]

    Due to their exceptional host-guest properties, Metal-Organic Frameworks (MOFs) are promising materials for storage of various gases with environmental and technological interest. Molecular modeling and simulations are invaluable tools, extensively used over the last two decades for the study of various properties of MOFs. In particular, Monte Carlo simulation techniques have been employed for the study of the gas uptake capacity of several MOFs at a wide range of different thermodynamic conditions. Despite the accurate predictions of molecular simulations, the accurate characterization and the high-throughput screening of the enormous number of MOFs that can be potentially synthesized by combining various structural building blocks is beyond present computer capabilities. In this work, we propose and demonstrate the use of an alternative approach, namely one based on an Automated Machine Learning (AutoML) architecture that is capable of training machine learning and statistical predictive models for MOFs’ chemical properties and estimate their predictive performance with confidence intervals. The architecture tries numerous combinations of different machine learning (ML) algorithms, tunes their hyper-parameters, and conservatively estimates performance of the final model. We demonstrate that it correctly estimates performance even with few samples (<100) and that it provides improved predictions over trying a single standard method, like Random Forests. The AutoML pipeline democratizes ML to non-expert material-science practitioners that may not know which algorithms to use on a given problem, how to tune them, and how to correctly estimate their predictive performance, dramatically improving productivity and avoiding common analysis pitfalls. A demonstration on the prediction of the carbon dioxide and methane uptake at various thermodynamic conditions is used as a showcase sharable at https://app.jadbio.com/share/86477fd7-d467-464d-ac41-fcbb0475444b.

    @article{noauthororeditor,
      abstract = {Due to their exceptional host-guest properties, Metal-Organic Frameworks (MOFs) are promising materials for storage of various gases with environmental and technological interest. Molecular modeling and simulations are invaluable tools, extensively used over the last two decades for the study of various properties of MOFs. In particular, Monte Carlo simulation techniques have been employed for the study of the gas uptake capacity of several MOFs at a wide range of different thermodynamic conditions. Despite the accurate predictions of molecular simulations, the accurate characterization and the high-throughput screening of the enormous number of MOFs that can be potentially synthesized by combining various structural building blocks is beyond present computer capabilities. In this work, we propose and demonstrate the use of an alternative approach, namely one based on an Automated Machine Learning (AutoML) architecture that is capable of training machine learning and statistical predictive models for MOFs’ chemical properties and estimate their predictive performance with confidence intervals. The architecture tries numerous combinations of different machine learning (ML) algorithms, tunes their hyper-parameters, and conservatively estimates performance of the final model. We demonstrate that it correctly estimates performance even with few samples (<100) and that it provides improved predictions over trying a single standard method, like Random Forests. The AutoML pipeline democratizes ML to non-expert material-science practitioners that may not know which algorithms to use on a given problem, how to tune them, and how to correctly estimate their predictive performance, dramatically improving productivity and avoiding common analysis pitfalls. A demonstration on the prediction of the carbon dioxide and methane uptake at various thermodynamic conditions is used as a showcase sharable at https://app.jadbio.com/share/86477fd7-d467-464d-ac41-fcbb0475444b.},
      added-at = {2020-04-15T10:00:20.000+0200},
      author = {Tsamardinos, Ioannis and Fanourgakis, George and Greasidou, Elissavet and Klontzas, Emmanuel and Gkagkas, Konstantinos and Froudakis, George},
      biburl = {https://www.bibsonomy.org/bibtex/27ca892254c8e863256291ecf21aa1ba8/mensxmachina},
      doi = {https://doi.org/10.1016/j.micromeso.2020.110160},
      interhash = {38928b1e64d735cbd81ebdb04cf9f6a0},
      intrahash = {7ca892254c8e863256291ecf21aa1ba8},
      journal = {Microporous and Mesoporos Materials },
      keywords = {mxmcausalpath},
      timestamp = {2021-03-08T12:13:54.000+0100},
      title = {An Automated Machine Learning architecture for the accelerated prediction of Metal-Organic Frameworks performance in energy and environmental applications},
      url = {https://www.sciencedirect.com/science/article/abs/pii/S1387181120301633},
      volume = 300,
      year = 2020
    }

  • K. Verrou, I. Tsamardinos, and G. Papoutsoglou, "Learning Pathway Dynamics from Single‐Cell Proteomic Data: A Comparative Study," Cytometry part A, Special Issue: Machine Learning for Single Cell Data,, vol. 97, iss. 3, 2020. doi:https://doi.org/10.1002/cyto.a.23976
    [BibTeX] [Abstract] [Download PDF]

    Single‐cell platforms provide statistically large samples of snapshot observations capable of resolving intrercellular heterogeneity. Currently, there is a growing literature on algorithms that exploit this attribute in order to infer the trajectory of biological mechanisms, such as cell proliferation and differentiation. Despite the efforts, the trajectory inference methodology has not yet been used for addressing the challenging problem of learning the dynamics of protein signaling systems. In this work, we assess this prospect by testing the performance of this class of algorithms on four proteomic temporal datasets. To evaluate the learning quality, we design new general‐purpose evaluation metrics that are able to quantify performance on (i) the biological meaning of the output, (ii) the consistency of the inferred trajectory, (iii) the algorithm robustness, (iv) the correlation of the learning output with the initial dataset, and (v) the roughness of the cell parameter levels though the inferred trajectory. We show that experimental time alone is insufficient to provide knowledge about the order of proteins during signal transduction. Accordingly, we show that the inferred trajectories provide richer information about the underlying dynamics. We learn that established methods tested on high‐dimensional data with small sample size, slow dynamics, and complex structures (e.g. bifurcations) cannot always work in the signaling setting. Among the methods we evaluate, Scorpius and a newly introduced approach that combines Diffusion Maps and Principal Curves were found to perform adequately in recovering the progression of signal transduction although their performance on some metrics varies from one dataset to another. The novel metrics we devise highlight that it is difficult to conclude, which one method is universally applicable for the task. Arguably, there are still many challenges and open problems to resolve. © 2020 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.

    @article{noauthororeditor,
      abstract = {Single‐cell platforms provide statistically large samples of snapshot observations capable of resolving intrercellular heterogeneity. Currently, there is a growing literature on algorithms that exploit this attribute in order to infer the trajectory of biological mechanisms, such as cell proliferation and differentiation. Despite the efforts, the trajectory inference methodology has not yet been used for addressing the challenging problem of learning the dynamics of protein signaling systems. In this work, we assess this prospect by testing the performance of this class of algorithms on four proteomic temporal datasets. To evaluate the learning quality, we design new general‐purpose evaluation metrics that are able to quantify performance on (i) the biological meaning of the output, (ii) the consistency of the inferred trajectory, (iii) the algorithm robustness, (iv) the correlation of the learning output with the initial dataset, and (v) the roughness of the cell parameter levels though the inferred trajectory. We show that experimental time alone is insufficient to provide knowledge about the order of proteins during signal transduction. Accordingly, we show that the inferred trajectories provide richer information about the underlying dynamics. We learn that established methods tested on high‐dimensional data with small sample size, slow dynamics, and complex structures (e.g. bifurcations) cannot always work in the signaling setting. Among the methods we evaluate, Scorpius and a newly introduced approach that combines Diffusion Maps and Principal Curves were found to perform adequately in recovering the progression of signal transduction although their performance on some metrics varies from one dataset to another. The novel metrics we devise highlight that it is difficult to conclude, which one method is universally applicable for the task. Arguably, there are still many challenges and open problems to resolve. © 2020 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.},
      added-at = {2020-04-15T09:49:24.000+0200},
      author = {Verrou, Klio-Maria and Tsamardinos, Ioannis and Papoutsoglou, Georgios},
      biburl = {https://www.bibsonomy.org/bibtex/2e3bfa8becd8b4b2537754b410e035264/mensxmachina},
      doi = {https://doi.org/10.1002/cyto.a.23976},
      interhash = {cfaefe697d477e338b1b5b57bc0e7335},
      intrahash = {e3bfa8becd8b4b2537754b410e035264},
      journal = {Cytometry part A, Special Issue: Machine Learning for Single Cell Data,},
      keywords = {mxmcausalpath},
      number = 3,
      timestamp = {2021-03-08T12:03:01.000+0100},
      title = {Learning Pathway Dynamics from Single‐Cell Proteomic Data: A Comparative Study},
      url = {https://onlinelibrary.wiley.com/doi/full/10.1002/cyto.a.23976},
      volume = 97,
      year = 2020
    }

  • A. Agrapetidou, P. Charonyktakis, P. Gogas, T. Papadimitriou, and I. Tsamardinos, "An AutoML application to forecasting bank failures," Applied Economics Letters , 2020. doi:https://doi.org/10.1080/13504851.2020.1725230
    [BibTeX] [Abstract] [Download PDF]

    We investigate the performance of an automated machine learning (AutoML) methodology in forecasting bank failures, called Just Add Data (JAD). We include all failed U.S. banks for 2007–2013 and twice as many healthy ones. An automated feature selection procedure in JAD identifies the most significant forecasters and a bootstrapping methodology provides conservative estimates of performance generalization and confidence intervals. The best performing model yields an AUC 0.985. The current work provides evidence that JAD, and AutoML tools in general, could increase the productivity of financial data analysts, shield against methodological statistical errors, and provide models at par with state-of-the-art manual analysis.

    @article{noauthororeditor,
      abstract = {We investigate the performance of an automated machine learning (AutoML) methodology in forecasting bank failures, called Just Add Data (JAD). We include all failed U.S. banks for 2007–2013 and twice as many healthy ones. An automated feature selection procedure in JAD identifies the most significant forecasters and a bootstrapping methodology provides conservative estimates of performance generalization and confidence intervals. The best performing model yields an AUC 0.985. The current work provides evidence that JAD, and AutoML tools in general, could increase the productivity of financial data analysts, shield against methodological statistical errors, and provide models at par with state-of-the-art manual analysis.},
      added-at = {2020-04-15T09:44:58.000+0200},
      author = {Agrapetidou, Anna and Charonyktakis, Paulos and Gogas, Periklis and Papadimitriou, Theofilos and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2485506a39ec5bcb0d027c9ceaaffd99f/mensxmachina},
      doi = {https://doi.org/10.1080/13504851.2020.1725230},
      interhash = {681df752dae5354ddadb9b582760dccc},
      intrahash = {485506a39ec5bcb0d027c9ceaaffd99f},
      journal = {Applied Economics Letters },
      keywords = {automl},
      timestamp = {2020-04-15T09:44:58.000+0200},
      title = {An AutoML application to forecasting bank failures},
      url = {https://www.tandfonline.com/doi/citedby/10.1080/13504851.2020.1725230?scroll=top&needAccess=true},
      year = 2020
    }

  • I. Xanthopoulos, I. Tsamardinos, V. Christophides, E. Simon, and A. Salinger, "Putting the Human Back in the AutoML Loop.," in EDBT/ICDT Workshops, 2020.
    [BibTeX] [Abstract] [Download PDF]

    Automated Machine Learning (AutoML) is a rapidly rising sub-field of Machine Learning. AutoML aims to fully automate the machine learning process end-to-end, democratizing Machine Learning to non-experts and drastically increasing the productivity of expert analysts. So far, most comparisons of AutoML systems focus on quantitative criteria such as predictive performance and execution time. In this paper, we examine AutoML services for predictive modeling tasks from a user's perspective, going beyond predictive performance. We present a wide palette of criteria and dimensions on which to evaluate and compare these services as a user. This qualitative comparative methodology is applied on seven AutoML systems, namely Auger.AI, BigML, H2O's Driverless AI, Darwin, Just Add Data Bio, Rapid-Miner, and Watson. The comparison indicates the strengths and weaknesses of each service, the needs that it covers, the segment of users that is most appropriate for, and the possibilities for improvements.

    @inproceedings{conf/edbt/XanthopoulosTCS20,
      abstract = {Automated Machine Learning (AutoML) is a rapidly rising sub-field of Machine Learning. AutoML aims to fully automate the machine learning process end-to-end, democratizing Machine Learning to non-experts and drastically increasing the productivity of expert analysts. So far, most comparisons of AutoML systems focus on quantitative criteria such as predictive performance and execution time. In this paper, we examine AutoML services for predictive modeling tasks from a user's perspective, going beyond predictive performance. We present a wide palette of criteria and dimensions on which to evaluate and compare these services as a user. This qualitative comparative methodology is applied on seven AutoML systems, namely Auger.AI, BigML, H2O's Driverless AI, Darwin, Just Add Data Bio, Rapid-Miner, and Watson. The comparison indicates the strengths and weaknesses of each service, the needs that it covers, the segment of users that is most appropriate for, and the possibilities for improvements.},
      added-at = {2020-04-10T12:29:09.000+0200},
      author = {Xanthopoulos, Iordanis and Tsamardinos, Ioannis and Christophides, Vassilis and Simon, Eric and Salinger, Alejandro},
      biburl = {https://www.bibsonomy.org/bibtex/24a1699e69e6518a1e50bc2ebea5da825/mensxmachina},
      booktitle = {EDBT/ICDT Workshops},
      crossref = {conf/edbt/2020w},
      editor = {Poulovassilis, Alexandra and Auber, David and Bikakis, Nikos and Chrysanthis, Panos K. and Papastefanatos, George and Sharaf, Mohamed and Pelekis, Nikos and Renso, Chiara and Theodoridis, Yannis and Zeitouni, Karine and Cerquitelli, Tania and Chiusano, Silvia and Vargas-Solar, Genoveva and Omidvar-Tehrani, Behrooz and Morik, Katharina and Renders, Jean-Michel and Firmani, Donatella and Tanca, Letizia and Mottin, Davide and Lissandrini, Matteo and Velegrakis, Yannis},
      ee = {http://ceur-ws.org/Vol-2578/ETMLP5.pdf},
      interhash = {80875b3bcb7ce5f7b780107cb86039fa},
      intrahash = {4a1699e69e6518a1e50bc2ebea5da825},
      keywords = {mxmcausalpath},
      publisher = {CEUR-WS.org},
      series = {CEUR Workshop Proceedings},
      timestamp = {2021-03-08T12:15:34.000+0100},
      title = {Putting the Human Back in the AutoML Loop.},
      url = {http://ceur-ws.org/Vol-2578/ETMLP5.pdf},
      volume = 2578,
      year = 2020
    }

  • N. Malliaraki, K. Lakiotaki, R. Vamvoukaki, G. Notas, I. Tsamardinos, M. Kampa, and E. Castanas, "Translating vitamin D transcriptomics to clinical evidence: Analysis of data in asthma and chronic obstructive pulmonary disease, followed by clinical data meta-analysis," The Journal of Steroid Biochemistry and Molecular Biology, vol. 197, pp. 1-14, 2020. doi:https://doi.org/10.1016/j.jsbmb.2019.105505
    [BibTeX] [Abstract] [Download PDF]

    Vitamin D (VitD) continues to trigger intense scientific controversy, regarding both its bi ological targets and its supplementation doses and regimens. In an effort to resolve this dispute, we mapped VitD transcriptome-wide events in humans, in order to unveil shared patterns or mechanisms with diverse pathologies/tissue profiles and reveal causal effects between VitD actions and specific human diseases, using a recently developed bioinformatics methodology. Using the similarities in analyzed transcriptome data (c-SKL method), we validated our methodology with osteoporosis as an example and further analyzed two other strong hits, specifically chronic obstructive pulmonary disease (COPD) and asthma. The latter revealed no impact of VitD on known molecular pathways. In accordance to this finding, review and meta-analysis of published data, based on an objective measure (Forced Expiratory Volume at one second, FEV1%) did not further reveal any significant effect of VitD on the objective amelioration of either condition. This study may, therefore, be regarded as the first one to explore, in an objective, unbiased and unsupervised manner, the impact of VitD levels and/or interventions in a number of human pathologies.

    @article{noauthororeditor,
      abstract = {Vitamin D (VitD) continues to trigger intense scientific controversy, regarding both its bi ological targets and its supplementation doses and regimens. In an effort to resolve this dispute, we mapped VitD transcriptome-wide events in humans, in order to unveil shared patterns or mechanisms with diverse pathologies/tissue profiles and reveal causal effects between VitD actions and specific human diseases, using a recently developed bioinformatics methodology. Using the similarities in analyzed transcriptome data (c-SKL method), we validated our methodology with osteoporosis as an example and further analyzed two other strong hits, specifically chronic obstructive pulmonary disease (COPD) and asthma. The latter revealed no impact of VitD on known molecular pathways. In accordance to this finding, review and meta-analysis of published data, based on an objective measure (Forced Expiratory Volume at one second, FEV1%) did not further reveal any significant effect of VitD on the objective amelioration of either condition. This study may, therefore, be regarded as the first one to explore, in an objective, unbiased and unsupervised manner, the impact of VitD levels and/or interventions in a number of human pathologies.},
      added-at = {2019-12-20T11:38:36.000+0100},
      author = {Malliaraki, Niki and Lakiotaki, Kleanthi and Vamvoukaki, Rodanthi and Notas, George and Tsamardinos, Ioannis and Kampa, Marilena and Castanas, Elias},
      biburl = {https://www.bibsonomy.org/bibtex/28cfb82678ba09a59927a27a148d1959f/mensxmachina},
      doi = {https://doi.org/10.1016/j.jsbmb.2019.105505},
      interhash = {5288b5fd5d047654c87ff505dfaa1814},
      intrahash = {8cfb82678ba09a59927a27a148d1959f},
      journal = {The Journal of Steroid Biochemistry and Molecular Biology},
      keywords = {mxmcausalpath},
      pages = {1-14},
      timestamp = {2021-03-18T08:44:15.000+0100},
      title = {Translating vitamin D transcriptomics to clinical evidence: Analysis of data in asthma and chronic obstructive pulmonary disease, followed by clinical data meta-analysis},
      url = {https://reader.elsevier.com/reader/sd/pii/S096007601930398X?token=BDFDFB0A2D6C3BCB2D6140BFCADFC9742EF3D905A0F5CFB518B320F4235CDDC6C6CF2A14B2FB25CB266333CBB3E631ED},
      volume = 197,
      year = 2020
    }

2019

  • O. D. Røe, M. Markaki, I. Tsamardinos, V. Lagani, O. T. D. Nguyen, J. H. Pedersen, Z. Saghir, and H. G. Ashraf, "‘Reduced’ HUNT model outperforms NLST and NELSON study criteria in predicting lung cancer in the Danish screening trial ," BMJ Open Respiratory Research , vol. 6, iss. 1, 2019. doi:dx.doi.org/10.1136/bmjresp-2019-000512
    [BibTeX] [Abstract] [Download PDF]

    Hypothesis We hypothesise that the validated HUNT Lung Cancer Risk Model would perform better than the NLST (USA) and the NELSON (Dutch‐Belgian) criteria in the Danish Lung Cancer Screening Trial (DLCST). Methods The DLCST measured only five out of the seven variables included in validated HUNT Lung Cancer Model. Therefore a ‘Reduced’ model was retrained in the Norwegian HUNT2-cohort using the same statistical methodology as in the original HUNT model but based only on age, pack years, smoking intensity, quit time and body mass index (BMI), adjusted for sex. The model was applied on the DLCST-cohort and contrasted against the NLST and NELSON criteria. Results Among the 4051 smokers in the DLCST with 10 years follow-up, median age was 57.6, BMI 24.75, pack years 33.8, cigarettes per day 20 and most were current smokers. For the same number of individuals selected for screening, the performance of the ‘Reduced’ HUNT was increased in all metrics compared with both the NLST and the NELSON criteria. In addition, to achieve the same sensitivity, one would need to screen fewer people by the ‘Reduced’ HUNT model versus using either the NLST or the NELSON criteria (709 vs 918, p=1.02e-11 and 1317 vs 1668, p=2.2e-16, respectively). Conclusions The ‘Reduced’ HUNT model is superior in predicting lung cancer to both the NLST and NELSON criteria in a cost-effective way. This study supports the use of the HUNT Lung Cancer Model for selection based on risk ranking rather than age, pack year and quit time cut-off values. When we know how to rank personal risk, it will be up to the medical community and lawmakers to decide which risk threshold will be set for screening.

    @article{noauthororeditor2019reduced,
      abstract = {Hypothesis We hypothesise that the validated HUNT Lung Cancer Risk Model would perform better than the NLST (USA) and the NELSON (Dutch‐Belgian) criteria in the Danish Lung Cancer Screening Trial (DLCST).
    
    Methods The DLCST measured only five out of the seven variables included in validated HUNT Lung Cancer Model. Therefore a ‘Reduced’ model was retrained in the Norwegian HUNT2-cohort using the same statistical methodology as in the original HUNT model but based only on age, pack years, smoking intensity, quit time and body mass index (BMI), adjusted for sex. The model was applied on the DLCST-cohort and contrasted against the NLST and NELSON criteria.
    
    Results Among the 4051 smokers in the DLCST with 10 years follow-up, median age was 57.6, BMI 24.75, pack years 33.8, cigarettes per day 20 and most were current smokers. For the same number of individuals selected for screening, the performance of the ‘Reduced’ HUNT was increased in all metrics compared with both the NLST and the NELSON criteria. In addition, to achieve the same sensitivity, one would need to screen fewer people by the ‘Reduced’ HUNT model versus using either the NLST or the NELSON criteria (709 vs 918, p=1.02e-11 and 1317 vs 1668, p=2.2e-16, respectively).
    
    Conclusions The ‘Reduced’ HUNT model is superior in predicting lung cancer to both the NLST and NELSON criteria in a cost-effective way. This study supports the use of the HUNT Lung Cancer Model for selection based on risk ranking rather than age, pack year and quit time cut-off values. When we know how to rank personal risk, it will be up to the medical community and lawmakers to decide which risk threshold will be set for screening.},
      added-at = {2019-11-13T10:18:46.000+0100},
      author = {Røe, Oluf Dimitri and Markaki, Maria and Tsamardinos, Ioannis and Lagani, Vincenzo and Nguyen, Olav Toai Duc and Pedersen, Jesper Holst and Saghir, Zaigham and Ashraf, Haseem Gary},
      biburl = {https://www.bibsonomy.org/bibtex/2b526991a742c19df51bba5671c8e2015/mensxmachina},
      doi = {dx.doi.org/10.1136/bmjresp-2019-000512},
      interhash = {d35f4774e9052e723730411fd1234172},
      intrahash = {b526991a742c19df51bba5671c8e2015},
      journal = {BMJ Open Respiratory Research },
      keywords = {cancer lung},
      number = 1,
      timestamp = {2019-11-13T10:27:39.000+0100},
      title = {‘Reduced’ HUNT model outperforms NLST and NELSON study criteria in predicting lung cancer in the Danish screening trial },
      url = {https://bmjopenrespres.bmj.com/content/bmjresp/6/1/e000512.full.pdf},
      volume = 6,
      year = 2019
    }

  • G. Papoutsoglou, V. Lagani, A. Schmidt, K. Tsirlis, D. Cabrero, J. Tegner, and I. Tsamardinos, "Challenges in the Multivariate Analysis of Mass Cytometry Data: The Effect of Randomization," Cytometry Part A, 2019. doi:https://doi.org/10.1002/cyto.a.23908
    [BibTeX] [Abstract] [Download PDF]

    Cytometry by time‐of‐flight (CyTOF) has emerged as a high‐throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high‐dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.

    @article{papoutsoglou2019challenges,
      abstract = {Cytometry by time‐of‐flight (CyTOF) has emerged as a high‐throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high‐dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.},
      added-at = {2019-11-06T12:21:03.000+0100},
      author = {Papoutsoglou, Georgios and Lagani, Vincenzo and Schmidt, Angelika and Tsirlis, Konstantinos and Cabrero, David-Gomez and Tegner, Jesper and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2ec8d495b1d35604f30b6fccbbb888292/mensxmachina},
      doi = {https://doi.org/10.1002/cyto.a.23908},
      interhash = {59ca510bf7d57dfbb97a59313975672e},
      intrahash = {ec8d495b1d35604f30b6fccbbb888292},
      journal = {Cytometry Part A},
      keywords = {mxmcausalpath},
      timestamp = {2021-03-08T12:18:30.000+0100},
      title = {Challenges in the Multivariate Analysis of Mass Cytometry Data: The Effect of Randomization},
      url = {https://onlinelibrary.wiley.com/doi/full/10.1002/cyto.a.23908},
      year = 2019
    }

  • D. Gomez-Cabrero, S. Tarazona, I. Ferreirós-Vidal, R. N. Ramirez, C. Company, A. Schmidt, T. Reijmers, V. von Saint Paul, F. Marabita, J. Rodr'iguez-Ubreva, A. Garcia-Gomez, T. Carroll, L. Cooper, Z. Liang, G. Dharmalingam, F. van der Kloet, A. C. Harms, L. Balzano-Nogueira, V. Lagani, I. Tsamardinos, M. Lappe, D. Maier, J. A. Westerhuis, T. Hankemeier, A. Imhof, E. Ballestar, A. Mortazavi, M. Merkenschlager, J. Tegner, and A. Conesa, "STATegra, a comprehensive multi-omics dataset of B-cell differentiation in mouse," Scientific Data, vol. 6, iss. 1, 2019. doi:10.1038/s41597-019-0202-7
    [BibTeX] [Abstract] [Download PDF]

    Multi-omics approaches use a diversity of high-throughput technologies to profile the different molecular layers of living cells. Ideally, the integration of this information should result in comprehensive systems models of cellular physiology and regulation. However, most multi-omics projects still include a limited number of molecular assays and there have been very few multi-omic studies that evaluate dynamic processes such as cellular growth, development and adaptation. Hence, we lack formal analysis methods and comprehensive multi-omics datasets that can be leveraged to develop true multi-layered models for dynamic cellular systems. Here we present the STATegra multi-omics dataset that combines measurements from up to 10 different omics technologies applied to the same biological system, namely the well-studied mouse pre-B-cell differentiation. STATegra includes high-throughput measurements of chromatin structure, gene expression, proteomics and metabolomics, and it is complemented with single-cell data. To our knowledge, the STATegra collection is the most diverse multi-omics dataset describing a dynamic biological system.

    @article{Gomez_Cabrero_2019,
      abstract = {Multi-omics approaches use a diversity of high-throughput technologies to profile the different molecular layers of living cells. Ideally, the integration of this information should result in comprehensive systems models of cellular physiology and regulation. However, most multi-omics projects still include a limited number of molecular assays and there have been very few multi-omic studies that evaluate dynamic processes such as cellular growth, development and adaptation. Hence, we lack formal analysis methods and comprehensive multi-omics datasets that can be leveraged to develop true multi-layered models for dynamic cellular systems. Here we present the STATegra multi-omics dataset that combines measurements from up to 10 different omics technologies applied to the same biological system, namely the well-studied mouse pre-B-cell differentiation. STATegra includes high-throughput measurements of chromatin structure, gene expression, proteomics and metabolomics, and it is complemented with single-cell data. To our knowledge, the STATegra collection is the most diverse multi-omics dataset describing a dynamic biological system.},
      added-at = {2019-11-04T10:47:39.000+0100},
      author = {Gomez-Cabrero, David and Tarazona, Sonia and Ferreir{\'{o}}s-Vidal, Isabel and Ramirez, Ricardo N. and Company, Carlos and Schmidt, Andreas and Reijmers, Theo and von Saint Paul, Veronica and Marabita, Francesco and Rodr{\'{\i}}guez-Ubreva, Javier and Garcia-Gomez, Antonio and Carroll, Thomas and Cooper, Lee and Liang, Ziwei and Dharmalingam, Gopuraja and van der Kloet, Frans and Harms, Amy C. and Balzano-Nogueira, Leandro and Lagani, Vincenzo and Tsamardinos, Ioannis and Lappe, Michael and Maier, Dieter and Westerhuis, Johan A. and Hankemeier, Thomas and Imhof, Axel and Ballestar, Esteban and Mortazavi, Ali and Merkenschlager, Matthias and Tegner, Jesper and Conesa, Ana},
      biburl = {https://www.bibsonomy.org/bibtex/26352a4655343755af6e0855281f6943f/mensxmachina},
      doi = {10.1038/s41597-019-0202-7},
      interhash = {1b1997f9cf4a39e4886fd0fb6384ac41},
      intrahash = {6352a4655343755af6e0855281f6943f},
      journal = {Scientific Data},
      keywords = {mmm},
      month = oct,
      number = 1,
      publisher = {Springer Science and Business Media {LLC}},
      timestamp = {2019-11-04T10:47:39.000+0100},
      title = {{STATegra}, a comprehensive multi-omics dataset of B-cell differentiation in mouse},
      url = {https://doi.org/10.1038%2Fs41597-019-0202-7},
      volume = 6,
      year = 2019
    }

  • K. Lakiotaki, G. Georgakopoulos, E. Castanas, O. D. Røe, G. Borboudakis, and I. Tsamardinos, "A data driven approach reveals disease similarity on a molecular level," npj Systems Biology and Applications , vol. 5, iss. 39, pp. 1-10, 2019. doi:10.1038/s41540-019-0117-0
    [BibTeX] [Abstract] [Download PDF]

    Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.

    @article{noauthororeditor,
      abstract = {Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.},
      added-at = {2019-10-29T11:29:09.000+0100},
      author = {Lakiotaki, Kleanthi and Georgakopoulos, George and Castanas, Elias and Røe, Oluf Dimitri and Borboudakis, Giorgos and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/23c018b31e2b3f946bde052b414e4ea82/mensxmachina},
      doi = {10.1038/s41540-019-0117-0},
      interhash = {e48ead7f0f6f503fe7647117214a3059},
      intrahash = {3c018b31e2b3f946bde052b414e4ea82},
      journal = {npj Systems Biology and Applications },
      keywords = {mxmcausalpath},
      month = oct,
      number = 39,
      pages = {1-10},
      timestamp = {2021-03-08T12:20:05.000+0100},
      title = {A data driven approach reveals disease similarity
    on a molecular level},
      url = {https://www.nature.com/articles/s41540-019-0117-0},
      volume = 5,
      year = 2019
    }

  • M. Tsagris and I. Tsamardinos, "Feature selection with the R package MXM," F1000Research, vol. 7, p. 1505, 2019. doi:https://doi.org/10.12688/f1000research.16216.2
    [BibTeX] [Abstract]

    Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications. Keywords

    @article{noauthororeditor,
      abstract = {Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications.
    Keywords},
      added-at = {2019-10-15T10:34:55.000+0200},
      author = {Tsagris, M and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/2e781d8b1b5f054e4b44da2aa2439fa94/mensxmachina},
      doi = {https://doi.org/10.12688/f1000research.16216.2},
      interhash = {eecad4526b5bcd1e6f1ea321c636118f},
      intrahash = {e781d8b1b5f054e4b44da2aa2439fa94},
      journal = {F1000Research},
      keywords = {mxmcausalpath},
      pages = 1505,
      timestamp = {2021-03-08T12:27:47.000+0100},
      title = {Feature selection with the R package MXM},
      volume = 7,
      year = 2019
    }

  • D. Kyriakis, A. Kanterakis, T. Manousaki, A. Tsakogiannis, M. Tsagris, I. Tsamardinos, L. Papaharisis, D. Chatziplis, G. Potamias, and C. Tsigenopoulos, "Scanning of Genetic Variants and Genetic Mapping of Phenotypic Traits in Gilthead Sea Bream Through ddRAD Sequencing," Frontiers in Genetics , vol. 10, p. 675, 2019. doi:10.3389/fgene.2019.00675
    [BibTeX] [Abstract]

    Gilthead sea bream (Sparus aurata) is a teleost of considerable economic importance in Southern European aquaculture. The aquaculture industry shows a growing interest in the application of genetic methods that can locate phenotype–genotype associations with high economic impact. Through selective breeding, the aquaculture industry can exploit this information to maximize the financial yield. Here, we present a Genome Wide Association Study (GWAS) of 112 samples belonging to seven different sea bream families collected from a Greek commercial aquaculture company. Through double digest Random Amplified DNA (ddRAD) Sequencing, we generated a per-sample genetic profile consisting of 2,258 high-quality Single Nucleotide Polymorphisms (SNPs). These profiles were tested for association with four phenotypes of major financial importance: Fat, Weight, Tag Weight, and the Length to Width ratio. We applied two methods of association analysis. The first is the typical single-SNP to phenotype test, and the second is a feature selection (FS) method through two novel algorithms that are employed for the first time in aquaculture genomics and produce groups with multiple SNPs associated to a phenotype. In total, we identified 9 single SNPs and 6 groups of SNPs associated with weight-related phenotypes (Weight and Tag Weight), 2 groups associated with Fat, and 16 groups associated with the Length to Width ratio. Six identified loci (Chr4:23265532, Chr6:12617755, Chr:8:11613979, Chr13:1098152, Chr15:3260819, and Chr22:14483563) were present in genes associated with growth in other teleosts or even mammals, such as semaphorin-3A and neurotrophin-3. These loci are strong candidates for future studies that will help us unveil the genetic mechanisms underlying growth and improve the sea bream aquaculture productivity by providing genomic anchors for selection programs.

    @article{noauthororeditor,
      abstract = {Gilthead sea bream (Sparus aurata) is a teleost of considerable economic importance in Southern European aquaculture. The aquaculture industry shows a growing interest in the application of genetic methods that can locate phenotype–genotype associations with high economic impact. Through selective breeding, the aquaculture industry can exploit this information to maximize the financial yield. Here, we present a Genome Wide Association Study (GWAS) of 112 samples belonging to seven different sea bream families collected from a Greek commercial aquaculture company. Through double digest Random Amplified DNA (ddRAD) Sequencing, we generated a per-sample genetic profile consisting of 2,258 high-quality Single Nucleotide Polymorphisms (SNPs). These profiles were tested for association with four phenotypes of major financial importance: Fat, Weight, Tag Weight, and the Length to Width ratio. We applied two methods of association analysis. The first is the typical single-SNP to phenotype test, and the second is a feature selection (FS) method through two novel algorithms that are employed for the first time in aquaculture genomics and produce groups with multiple SNPs associated to a phenotype. In total, we identified 9 single SNPs and 6 groups of SNPs associated with weight-related phenotypes (Weight and Tag Weight), 2 groups associated with Fat, and 16 groups associated with the Length to Width ratio. Six identified loci (Chr4:23265532, Chr6:12617755, Chr:8:11613979, Chr13:1098152, Chr15:3260819, and Chr22:14483563) were present in genes associated with growth in other teleosts or even mammals, such as semaphorin-3A and neurotrophin-3. These loci are strong candidates for future studies that will help us unveil the genetic mechanisms underlying growth and improve the sea bream aquaculture productivity by providing genomic anchors for selection programs.
    },
      added-at = {2019-10-15T10:30:00.000+0200},
      author = {Kyriakis, Dimitrios and Kanterakis, Alexandros and Manousaki, Tereza and Tsakogiannis, Alexandros and Tsagris, Michalis and Tsamardinos, Ioannis and Papaharisis, Leonidas and Chatziplis, Dimitris and Potamias, George and Tsigenopoulos, Costas},
      biburl = {https://www.bibsonomy.org/bibtex/2dfc6576873fefd21bfc0dd0d979249fc/mensxmachina},
      doi = {10.3389/fgene.2019.00675},
      interhash = {09fd2ec870d7a4ee7ebcaf4a3d934a96},
      intrahash = {dfc6576873fefd21bfc0dd0d979249fc},
      journal = {Frontiers in Genetics     },
      keywords = {dddd},
      pages = 675,
      timestamp = {2019-10-29T11:37:42.000+0100},
      title = {Scanning of Genetic Variants and Genetic Mapping of Phenotypic Traits in Gilthead Sea Bream Through ddRAD Sequencing},
      volume = 10,
      year = 2019
    }

  • J. Fernandes Sunja, H. Morikawa, E. Ewing, S. Ruhrmann, N. Joshi Rubin, V. Lagani, N. Karathanasis, M. Khademi, N. Planell, A. Schmidt, I. Tsamardinos, T. Olsson, F. Piehl, I. Kockum, M. Jagodic, J. Tegnér, and D. Gomez-Cabrero, "Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients," Nature Scientific Reports, vol. 9, iss. 11996, 2019. doi:10.1038/s41598-019-48493-7
    [BibTeX] [Abstract] [Download PDF]

    Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. the triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials.

    @article{jude2019nonparametric,
      abstract = {Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. the triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials.},
      added-at = {2019-09-26T12:00:57.000+0200},
      author = {Fernandes Sunja, Jude and Morikawa, Hiromasa and Ewing, Ewoud and Ruhrmann, Sabrina and Joshi Rubin, Narayan and Lagani, Vincenzo and Karathanasis, Nestoras and Khademi, Mohsen and Planell, Nuria and Schmidt, Angelika and Tsamardinos, Ioannis and Olsson, Tomas and Piehl, Fredrik and Kockum, Ingrid and Jagodic, Maja and Tegnér, Jesper and Gomez-Cabrero, David},
      biburl = {https://www.bibsonomy.org/bibtex/24e8d52cdff48c449b171f359ee3961d7/mensxmachina},
      doi = {10.1038/s41598-019-48493-7},
      interhash = {da7a8c5930f294e6881f967fca95fe53},
      intrahash = {4e8d52cdff48c449b171f359ee3961d7},
      journal = {Nature Scientific Reports},
      keywords = {mxmcausalpath},
      month = {August},
      number = 11996,
      timestamp = {2021-03-10T08:58:55.000+0100},
      title = {Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients},
      url = {https://www.nature.com/articles/s41598-019-48493-7},
      volume = 9,
      year = 2019
    }

  • E. Ewing, L. Kular, S. J. Fernandes, N. Karathanasis, V. Lagani, S. Ruhrmann, I. Tsamardinos, J. Tegner, F. Piehl, D. Gomez-Cabrero, and M. Jagodic, "Combining evidence from four immune cell types identifies DNA methylation patterns that implicate functionally distinct pathways during Multiple Sclerosis progression," EBioMedicine, vol. 43, pp. 411-423, 2019. doi:10.1016/j.ebiom.2019.04.042
    [BibTeX] [Abstract] [Download PDF]

    Background Multiple Sclerosis (MS) is a chronic inflammatory disease and a leading cause of progressive neurological disability among young adults. DNA methylation, which intersects genes and environment to control cellular functions on a molecular level, may provide insights into MS pathogenesis. Methods We measured DNA methylation in CD4+ T cells (n = 31), CD8+ T cells (n = 28), CD14+ monocytes (n = 35) and CD19+ B cells (n = 27) from relapsing-remitting (RRMS), secondary progressive (SPMS) patients and healthy controls (HC) using Infinium HumanMethylation450 arrays. Monocyte (n = 25) and whole blood (n = 275) cohorts were used for validations. Findings B cells from MS patients displayed most significant differentially methylated positions (DMPs), followed by monocytes, while only few DMPs were detected in T cells. We implemented a non-parametric combination framework (omicsNPC) to increase discovery power by combining evidence from all four cell types. Identified shared DMPs co-localized at MS risk loci and clustered into distinct groups. Functional exploration of changes discriminating RRMS and SPMS from HC implicated lymphocyte signaling, T cell activation and migration. SPMS-specific changes, on the other hand, implicated myeloid cell functions and metabolism. Interestingly, neuronal and neurodegenerative genes and pathways were also specifically enriched in the SPMS cluster. Interpretation We utilized a statistical framework (omicsNPC) that combines multiple layers of evidence to identify DNA methylation changes that provide new insights into MS pathogenesis in general, and disease progression, in particular. Fund This work was supported by the Swedish Research Council, Stockholm County Council, AstraZeneca, European Research Council, Karolinska Institutet and Margaretha af Ugglas Foundation.

    @article{Ewing_2019,
      abstract = {Background
    
    Multiple Sclerosis (MS) is a chronic inflammatory disease and a leading cause of progressive neurological disability among young adults. DNA methylation, which intersects genes and environment to control cellular functions on a molecular level, may provide insights into MS pathogenesis.
    Methods
    
    We measured DNA methylation in CD4+ T cells (n = 31), CD8+ T cells (n = 28), CD14+ monocytes (n = 35) and CD19+ B cells (n = 27) from relapsing-remitting (RRMS), secondary progressive (SPMS) patients and healthy controls (HC) using Infinium HumanMethylation450 arrays. Monocyte (n = 25) and whole blood (n = 275) cohorts were used for validations.
    Findings
    
    B cells from MS patients displayed most significant differentially methylated positions (DMPs), followed by monocytes, while only few DMPs were detected in T cells. We implemented a non-parametric combination framework (omicsNPC) to increase discovery power by combining evidence from all four cell types. Identified shared DMPs co-localized at MS risk loci and clustered into distinct groups. Functional exploration of changes discriminating RRMS and SPMS from HC implicated lymphocyte signaling, T cell activation and migration. SPMS-specific changes, on the other hand, implicated myeloid cell functions and metabolism. Interestingly, neuronal and neurodegenerative genes and pathways were also specifically enriched in the SPMS cluster.
    Interpretation
    
    We utilized a statistical framework (omicsNPC) that combines multiple layers of evidence to identify DNA methylation changes that provide new insights into MS pathogenesis in general, and disease progression, in particular.
    Fund
    
    This work was supported by the Swedish Research Council, Stockholm County Council, AstraZeneca, European Research Council, Karolinska Institutet and Margaretha af Ugglas Foundation.},
      added-at = {2019-08-21T10:15:49.000+0200},
      author = {Ewing, Ewoud and Kular, Lara and Fernandes, Sunjay J. and Karathanasis, Nestoras and Lagani, Vincenzo and Ruhrmann, Sabrina and Tsamardinos, Ioannis and Tegner, Jesper and Piehl, Fredrik and Gomez-Cabrero, David and Jagodic, Maja},
      biburl = {https://www.bibsonomy.org/bibtex/2248e805997be2f24632112b442ef4e4b/mensxmachina},
      doi = {10.1016/j.ebiom.2019.04.042},
      interhash = {d968457cbd123773e47fe5018137eaa2},
      intrahash = {248e805997be2f24632112b442ef4e4b},
      journal = {{EBioMedicine}},
      keywords = {mxmcausalpath},
      month = may,
      pages = {411--423},
      publisher = {Elsevier {BV}},
      timestamp = {2021-03-10T09:00:03.000+0100},
      title = {Combining evidence from four immune cell types identifies {DNA} methylation patterns that implicate functionally distinct pathways during Multiple Sclerosis progression},
      url = {https://doi.org/10.1016%2Fj.ebiom.2019.04.042},
      volume = 43,
      year = 2019
    }

  • M. S. Loos, R. Ramakrishnan, W. Vranken, A. Tsirigotaki, E. Tsare, V. Zorzini, J. D. Geyter, B. Yuan, I. Tsamardinos, M. Klappa, J. Schymkowitz, F. Rousseau, S. Karamanou, and A. Economou, "Structural Basis of the Subcellular Topology Landscape of Escherichia coli," Frontiers in Microbiology, vol. 10, 2019. doi:10.3389/fmicb.2019.01670
    [BibTeX] [Abstract] [Download PDF]

    Cellular proteomes are distributed in multiple compartments: on DNA, ribosomes, on and inside membranes, or they become secreted. Structural properties that allow polypeptides to occupy subcellular niches, particularly to after crossing membranes, remain unclear. We compared intrinsic and extrinsic features in cytoplasmic and secreted polypeptides of the Escherichia coli K-12 proteome. Structural features between the cytoplasmome and secretome are sharply distinct, such that a signal peptide-agnostic machine learning tool distinguishes cytoplasmic from secreted proteins with 95.5% success. Cytoplasmic polypeptides are enriched in aliphatic, aromatic, charged and hydrophobic residues, unique folds and higher early folding propensities. Secretory polypeptides are enriched in polar/small amino acids, β folds, have higher backbone dynamics, higher disorder and contact order and are more often intrinsically disordered. These non-random distributions and experimental evidence imply that evolutionary pressure selected enhanced secretome flexibility, slow folding and looser structures, placing the secretome in a distinct protein class. These adaptations protect the secretome from premature folding during its cytoplasmic transit, optimize its lipid bilayer crossing and allowed it to acquire cell envelope specific chemistries. The latter may favor promiscuous multi-ligand binding, sensing of stress and cell envelope structure changes. In conclusion, enhanced flexibility, slow folding, looser structures and unique folds differentiate the secretome from the cytoplasmome. These findings have wide implications on the structural diversity and evolution of modern proteomes and the protein folding problem.

    @article{Loos_2019,
      abstract = {Cellular proteomes are distributed in multiple compartments: on DNA, ribosomes, on and inside membranes, or they become secreted. Structural properties that allow polypeptides to occupy subcellular niches, particularly to after crossing membranes, remain unclear. We compared intrinsic and extrinsic features in cytoplasmic and secreted polypeptides of the Escherichia coli K-12 proteome. Structural features between the cytoplasmome and secretome are sharply distinct, such that a signal peptide-agnostic machine learning tool distinguishes cytoplasmic from secreted proteins with 95.5% success. Cytoplasmic polypeptides are enriched in aliphatic, aromatic, charged and hydrophobic residues, unique folds and higher early folding propensities. Secretory polypeptides are enriched in polar/small amino acids, β folds, have higher backbone dynamics, higher disorder and contact order and are more often intrinsically disordered. These non-random distributions and experimental evidence imply that evolutionary pressure selected enhanced secretome flexibility, slow folding and looser structures, placing the secretome in a distinct protein class. These adaptations protect the secretome from premature folding during its cytoplasmic transit, optimize its lipid bilayer crossing and allowed it to acquire cell envelope specific chemistries. The latter may favor promiscuous multi-ligand binding, sensing of stress and cell envelope structure changes. In conclusion, enhanced flexibility, slow folding, looser structures and unique folds differentiate the secretome from the cytoplasmome. These findings have wide implications on the structural diversity and evolution of modern proteomes and the protein folding problem.},
      added-at = {2019-08-21T10:13:49.000+0200},
      author = {Loos, Maria S. and Ramakrishnan, Reshmi and Vranken, Wim and Tsirigotaki, Alexandra and Tsare, Evrydiki-Pandora and Zorzini, Valentina and Geyter, Jozefien De and Yuan, Biao and Tsamardinos, Ioannis and Klappa, Maria and Schymkowitz, Joost and Rousseau, Frederic and Karamanou, Spyridoula and Economou, Anastassios},
      biburl = {https://www.bibsonomy.org/bibtex/252185c42a364f2d694e2ab73919ae419/mensxmachina},
      doi = {10.3389/fmicb.2019.01670},
      interhash = {01b19ff85dd7e5afe5d2443997f749a5},
      intrahash = {52185c42a364f2d694e2ab73919ae419},
      journal = {Frontiers in Microbiology},
      keywords = {mxmcausalpath},
      month = jul,
      publisher = {Frontiers Media {SA}},
      timestamp = {2021-03-10T09:01:43.000+0100},
      title = {Structural Basis of the Subcellular Topology Landscape of Escherichia coli},
      url = {https://doi.org/10.3389%2Ffmicb.2019.01670},
      volume = 10,
      year = 2019
    }

  • I. Ferreirós-Vidal, T. Carroll, T. Zhang, V. Lagani, R. N. Ramirez, E. Ing-Simmons, A. Garcia, L. Cooper, Z. Liang, G. Papoutsoglou, G. Dharmalingam, Y. Guo, S. Tarazona, S. J. Fernandes, P. Noori, G. Silberberg, A. G. Fisher, I. Tsamardinos, A. Mortazavi, B. Lenhard, A. Conesa, J. Tegner, M. Merkenschlager, and D. Gomez-Cabrero, "Feedforward regulation of Myc coordinates lineage-specific with housekeeping gene expression during B cell progenitor cell differentiation," PLOS Biology, vol. 17, iss. 4, pp. 1-28, 2019. doi:10.1371/journal.pbio.2006506
    [BibTeX] [Abstract] [Download PDF]

    The human body is made from billions of cells comprizing many specialized cell types. All of these cells ultimately come from a single fertilized oocyte in a process that has two key features: proliferation, which expands cell numbers, and differentiation, which diversifies cell types. Here, we have examined the transition from proliferation to differentiation using B lymphocytes as an example. We find that the transition from proliferation to differentiation involves changes in the expression of genes, which can be categorized into cell-type–specific genes and broadly expressed “housekeeping” genes. The expression of many housekeeping genes is controlled by the gene regulatory factor Myc, whereas the expression of many B lymphocyte–specific genes is controlled by the Ikaros family of gene regulatory proteins. Myc is repressed by Ikaros, which means that changes in housekeeping and tissue-specific gene expression are coordinated during the transition from proliferation to differentiation.

    @article{10.1371/journal.pbio.2006506,
      abstract = {The human body is made from billions of cells comprizing many specialized cell types. All of these cells ultimately come from a single fertilized oocyte in a process that has two key features: proliferation, which expands cell numbers, and differentiation, which diversifies cell types. Here, we have examined the transition from proliferation to differentiation using B lymphocytes as an example. We find that the transition from proliferation to differentiation involves changes in the expression of genes, which can be categorized into cell-type–specific genes and broadly expressed “housekeeping” genes. The expression of many housekeeping genes is controlled by the gene regulatory factor Myc, whereas the expression of many B lymphocyte–specific genes is controlled by the Ikaros family of gene regulatory proteins. Myc is repressed by Ikaros, which means that changes in housekeeping and tissue-specific gene expression are coordinated during the transition from proliferation to differentiation.},
      added-at = {2019-04-15T10:34:19.000+0200},
      author = {Ferreirós-Vidal, Isabel and Carroll, Thomas and Zhang, Tianyi and Lagani, Vincenzo and Ramirez, Ricardo N. and Ing-Simmons, Elizabeth and Garcia, Alicia and Cooper, Lee and Liang, Ziwei and Papoutsoglou, Georgios and Dharmalingam, Gopuraja and Guo, Ya and Tarazona, Sonia and Fernandes, Sunjay J. and Noori, Peri and Silberberg, Gilad and Fisher, Amanda G. and Tsamardinos, Ioannis and Mortazavi, Ali and Lenhard, Boris and Conesa, Ana and Tegner, Jesper and Merkenschlager, Matthias and Gomez-Cabrero, David},
      biburl = {https://www.bibsonomy.org/bibtex/2bd3e0f1a5421ea097c5f5c72221afddf/mensxmachina},
      doi = {10.1371/journal.pbio.2006506},
      interhash = {47806e111971adfc2e5769052393b71e},
      intrahash = {bd3e0f1a5421ea097c5f5c72221afddf},
      journal = {PLOS Biology},
      keywords = {mxmcausalpath},
      month = {04},
      number = 4,
      pages = {1-28},
      publisher = {Public Library of Science},
      timestamp = {2021-03-10T09:19:14.000+0100},
      title = {Feedforward regulation of Myc coordinates lineage-specific with housekeeping gene expression during B cell progenitor cell differentiation},
      url = {https://doi.org/10.1371/journal.pbio.2006506},
      volume = 17,
      year = 2019
    }

  • Y. Pantazis and I. Tsamardinos, "A unified approach for sparse dynamical system inference from temporal measurements," Bioinformatics, 2019. doi:10.1093/bioinformatics/btz065
    [BibTeX] [Abstract] [Download PDF]

    Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously.In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL’s accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway.Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics.Supplementary data are available at Bioinformatics online.

    @article{10.1093/bioinformatics/btz065,
      abstract = {Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously.In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL’s accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway.Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics.Supplementary data are available at Bioinformatics online.},
      added-at = {2019-03-06T13:27:40.000+0100},
      author = {Pantazis, Yannis and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2662a8f26ef15e0b79a88593ddc0574fd/mensxmachina},
      doi = {10.1093/bioinformatics/btz065},
      eprint = {http://oup.prod.sis.lan/bioinformatics/advance-article-pdf/doi/10.1093/bioinformatics/btz065/27980298/btz065.pdf},
      interhash = {da2952f9b07f650ea261d6ba83859a43},
      intrahash = {662a8f26ef15e0b79a88593ddc0574fd},
      journal = {Bioinformatics},
      keywords = {mxmcausalpath},
      month = {01},
      timestamp = {2021-03-10T09:20:39.000+0100},
      title = {A unified approach for sparse dynamical system inference from temporal measurements},
      url = {https://dx.doi.org/10.1093/bioinformatics/btz065},
      year = 2019
    }

  • G. Borboudakis and I. Tsamardinos, "Forward-Backward Selection with Early Dropping," Journal of Machine Learning Research, vol. 20, iss. 8, pp. 1-39, 2019.
    [BibTeX] [Abstract] [Download PDF]

    Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists

    @article{guyon2019forwardbackward,
      abstract = {Forward-backward selection is one of the most basic and commonly-used feature selection
    algorithms available. It is also general and conceptually applicable to many different types
    of data. In this paper, we propose a heuristic that significantly improves its running time,
    while preserving predictive performance. The idea is to temporarily discard the variables
    that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise
    to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs,
    members of this algorithmic family are able to correctly identify the Markov blanket in the
    sample limit. In experiments we show that the proposed heuristic increases computational
    efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of
    variables and retaining predictive performance. Furthermore, we show that the proposed
    algorithm and feature selection with LASSO perform similarly when restricted to select
    the same number of variables, making the proposed algorithm an attractive alternative for
    problems where no (efficient) algorithm for LASSO exists},
      added-at = {2019-03-06T13:21:10.000+0100},
      author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/20379540925c4602d54d5acccd0268113/mensxmachina},
      editor = {Guyon, Isabelle},
      interhash = {af84c72c0490ecd3738da71550eaec18},
      intrahash = {0379540925c4602d54d5acccd0268113},
      journal = {Journal of Machine Learning Research},
      keywords = {mxmcausalpath},
      month = {January},
      number = 8,
      pages = {1-39},
      pdf = {http://jmlr.org/papers/volume20/17-334/17-334.pdf},
      timestamp = {2021-03-10T09:21:04.000+0100},
      title = {Forward-Backward Selection with Early Dropping},
      url = {http://jmlr.org/papers/volume20/17-334/17-334.pdf},
      volume = 20,
      year = 2019
    }

2018

  • K. Tsirlis, V. Lagani, S. Triantafillou, and I. Tsamardinos, "On scoring Maximal Ancestral Graphs with the Max\textendashMin Hill Climbing algorithm," International Journal of Approximate Reasoning, vol. 102, pp. 74-85, 2018. doi:10.1016/j.ijar.2018.08.002
    [BibTeX] [Abstract] [Download PDF]

    We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max–Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraint-based first phase that greatly reduces the space of structures to investigate. On a large scale experimentation we show that the proposed algorithm greatly improves on GSMAG in all comparisons, and over a set of known networks from the literature it compares positively against FCI and cFCI as well as competitively against GFCI, three well known constraint-based approaches for causal-network reconstruction in presence of latent confounders.

    @article{Tsirlis_2018,
      abstract = {We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max–Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraint-based first phase that greatly reduces the space of structures to investigate. On a large scale experimentation we show that the proposed algorithm greatly improves on GSMAG in all comparisons, and over a set of known networks from the literature it compares positively against FCI and cFCI as well as competitively against GFCI, three well known constraint-based approaches for causal-network reconstruction in presence of latent confounders.},
      added-at = {2019-02-01T13:46:22.000+0100},
      author = {Tsirlis, Konstantinos and Lagani, Vincenzo and Triantafillou, Sofia and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/2215e0151afb6ceec0a37601e3028e865/mensxmachina},
      doi = {10.1016/j.ijar.2018.08.002},
      interhash = {7f00d0fcad8c52b8cce71b771e7cb3e5},
      intrahash = {215e0151afb6ceec0a37601e3028e865},
      journal = {International Journal of Approximate Reasoning},
      keywords = {mxmcausalpath},
      month = {November},
      pages = {74-85},
      publisher = {Elsevier {BV}},
      timestamp = {2021-03-10T09:26:09.000+0100},
      title = {On scoring Maximal Ancestral Graphs with the Max{\textendash}Min Hill Climbing algorithm},
      url = {https://doi.org/10.1016%2Fj.ijar.2018.08.002},
      volume = 102,
      year = 2018
    }

  • M. Tsagris, "Bayesian Network Learning with the PC Algorithm: An Improved and Correct Variation," Applied Artificial Intelligence , vol. 33, iss. 2, pp. 101-123, 2018. doi:10.1080/08839514.2018.1526760
    [BibTeX] [Abstract] [Download PDF]

    PC is a prototypical constraint-based algorithm for learning Bayesian networks, a special case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was developed to make the skeleton phase order independent. In return, it has notably increased execution time. In this paper, we clarify that the PC algorithm the skeleton phase of PC is indeed order independent. The modification we propose outperforms pcalg’s variant of the PC in terms of returning correct networks of better quality as is less prone to errors and in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s variant does not return valid acyclic graphs.

    @article{michail2018bayesian,
      abstract = {PC is a prototypical constraint-based algorithm for learning Bayesian networks, a special case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was developed to make the skeleton phase order independent. In return, it has notably increased execution time. In this paper, we clarify that the PC algorithm the skeleton phase of PC is indeed order independent. The modification we propose outperforms pcalg’s variant of the PC in terms of returning correct networks of better quality as is less prone to errors and in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s variant does not return valid acyclic graphs.},
      added-at = {2019-02-01T12:44:24.000+0100},
      author = {Tsagris, Michail},
      biburl = {https://www.bibsonomy.org/bibtex/2af15fddd4692e8ba7df0f00f5de6fd23/mensxmachina},
      doi = {10.1080/08839514.2018.1526760},
      interhash = {6c22e7e09e9aa7a24536cef5b953528e},
      intrahash = {af15fddd4692e8ba7df0f00f5de6fd23},
      journal = {Applied Artificial Intelligence },
      keywords = {mxmcausalpath},
      number = 2,
      pages = {101-123},
      timestamp = {2021-03-10T09:26:33.000+0100},
      title = {Bayesian Network Learning with the PC Algorithm: An Improved and Correct Variation},
      url = {https://www.researchgate.net/profile/Michail_Tsagris/publication/327884019_Bayesian_Network_Learning_with_the_PC_Algorithm_An_Improved_and_Correct_Variation/links/5bab44c945851574f7e65688/Bayesian-Network-Learning-with-the-PC-Algorithm-An-Improved-and-Correct-Variation.pdf},
      volume = 33,
      year = 2018
    }

  • I. Tsamardinos, E. Greasidou, and G. Borboudakis, "Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation," Machine Learning, vol. 107, iss. 12, pp. 1895-1922, 2018. doi:10.1007/s10994-018-5714-4
    [BibTeX] [Abstract] [Download PDF]

    Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822--829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.

    @article{Tsamardinos2018,
      abstract = {Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822--829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.},
      added-at = {2019-01-18T12:36:47.000+0100},
      author = {Tsamardinos, Ioannis and Greasidou, Elissavet and Borboudakis, Giorgos},
      biburl = {https://www.bibsonomy.org/bibtex/2a7b174604425057fa831058ace0c969a/mensxmachina},
      day = 01,
      doi = {10.1007/s10994-018-5714-4},
      interhash = {97a8dcbb7f6259a554fb36d14f44bc47},
      intrahash = {a7b174604425057fa831058ace0c969a},
      issn = {1573-0565},
      journal = {Machine Learning},
      keywords = {mxmcausalpath},
      month = dec,
      number = 12,
      pages = {1895--1922},
      timestamp = {2021-03-10T09:27:16.000+0100},
      title = {Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation},
      url = {https://doi.org/10.1007/s10994-018-5714-4},
      volume = 107,
      year = 2018
    }

  • I. Tsamardinos, G. Borboudakis, P. Katsogridakis, P. Pratikakis, and V. Christophides, "A greedy feature selection algorithm for Big Data of high dimensionality," Machine Learning, vol. 108, iss. 2, pp. 149-202, 2018. doi:10.1007/s10994-018-5748-7
    [BibTeX] [Abstract] [Download PDF]

    We present the Parallel, Forward--Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.

    @article{Tsamardinos2018,
      abstract = {We present the Parallel, Forward--Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.},
      added-at = {2019-01-18T12:33:10.000+0100},
      author = {Tsamardinos, Ioannis and Borboudakis, Giorgos and Katsogridakis, Pavlos and Pratikakis, Polyvios and Christophides, Vassilis},
      biburl = {https://www.bibsonomy.org/bibtex/252633119492a482314134f28a7639e3b/mensxmachina},
      day = 07,
      doi = {10.1007/s10994-018-5748-7},
      interhash = {758062c5debf18afdf43d46ce6105a72},
      intrahash = {52633119492a482314134f28a7639e3b},
      issn = {1573-0565},
      journal = {Machine Learning},
      keywords = {mxmcausalpath},
      month = {August},
      number = 2,
      pages = {149-202},
      timestamp = {2021-03-10T09:27:38.000+0100},
      title = {A greedy feature selection algorithm for Big Data of high dimensionality},
      url = {https://doi.org/10.1007/s10994-018-5748-7},
      volume = 108,
      year = 2018
    }

2017

  • M. Tsagris, G. Borboudakis, V. Lagani, and I. Tsamardinos, "Constraint-based Causal Discovery with Mixed Data," , 2017.
    [BibTeX] [Abstract] [Download PDF]

    We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.

    @conference{noauthororeditor2017constraintbased,
      abstract = {We address the problem of constraint-based
    causal discovery with mixed data types, such as (but
    not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on
    appropriate regression models, and show how to derive
    symmetric conditional independence tests. Such tests
    can then be directly used by existing constraint-based
    methods with mixed data, such as the PC and FCI
    algorithms for learning Bayesian networks and maximal
    ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm
    with different conditional independence tests for mixed
    data, and show that the proposed approach outperforms
    alternatives in terms of learning accuracy.},
      added-at = {2021-03-10T10:58:29.000+0100},
      author = {Tsagris, M and Borboudakis, G and Lagani, V and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/2892378444240fee14d62fd58362e856a/mensxmachina},
      interhash = {87d6a33d891429260e644392ddcba508},
      intrahash = {892378444240fee14d62fd58362e856a},
      keywords = {mxmcausalpath},
      publisher = {23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)},
      timestamp = {2021-03-10T10:58:29.000+0100},
      title = {Constraint-based Causal Discovery with Mixed Data},
      url = {http://nugget.unisa.edu.au/CD2017/papersonly/constraint-based-causal-r1.pdf},
      year = 2017
    }

  • K. Tsirlis, V. Lagani, S. Triantafillou, and I. Tsamardinos, "On Scoring Maximal Ancestral Graphs with the Max-Min Hill Climbing Algorithm," , 2017.
    [BibTeX] [Abstract] [Download PDF]

    t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that the proposed algorithm greatly improves on GSMAG, and compares positively against FCI and cFCI, two well known constraint-based approaches for causal-network reconstruction in presence of latent confounders

    @conference{tsirlis2017scoring,
      abstract = {t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing
    (M3HC) that takes as input a data set of continuous
    variables, assumed to follow a multivariate Gaussian
    distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed
    method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that
    the proposed algorithm greatly improves on GSMAG,
    and compares positively against FCI and cFCI, two well
    known constraint-based approaches for causal-network
    reconstruction in presence of latent confounders},
      added-at = {2021-03-10T10:55:47.000+0100},
      author = {Tsirlis, K and Lagani, V and Triantafillou, S and Tsamardinos, I},
      biburl = {https://www.bibsonomy.org/bibtex/251782ff3d0021d9ae7b7229b39a55d75/mensxmachina},
      interhash = {4731b83fe8b2f1f60eed63d178912109},
      intrahash = {51782ff3d0021d9ae7b7229b39a55d75},
      keywords = {mxmcausalpath},
      publisher = {23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)},
      timestamp = {2021-03-10T10:55:47.000+0100},
      title = {On Scoring Maximal Ancestral Graphs with the Max-Min Hill
    Climbing Algorithm},
      url = {http://nugget.unisa.edu.au/CD2017/papersonly/maxmin-r0.pdf},
      year = 2017
    }

  • G. Borboudakis, T. Stergiannakos, M. Frysali, E. Klontzas, I. Tsamardinos, and G. E. Froudakis, "Chemically intuited, large-scale screening of MOFs by machine learning techniques," NPJ Computational Materials, vol. 3, iss. 40, 2017. doi:10.1038/s41524-017-0045-8
    [BibTeX] [Abstract] [Download PDF]

    A novel computational methodology for large-scale screening of MOFs is applied to gas storage with the use of machine learning technologies. This approach is a promising trade-off between the accuracy of ab initio methods and the speed of classical approaches, strategically combined with chemical intuition. The results demonstrate that the chemical properties of MOFs are indeed predictable (stochastically, not deterministically) using machine learning methods and automated analysis protocols, with the accuracy of predictions increasing with sample size. Our initial results indicate that this methodology is promising to apply not only to gas storage in MOFs but in many other material science projects.

    @article{borboudakis2017chemically,
      abstract = {A novel computational methodology for large-scale screening of MOFs is applied to gas storage with the use of machine learning technologies. This approach is a promising trade-off between the accuracy of ab initio methods and the speed of classical approaches, strategically combined with chemical intuition. The results demonstrate that the chemical properties of MOFs are indeed predictable (stochastically, not deterministically) using machine learning methods and automated analysis protocols, with the accuracy of predictions increasing with sample size. Our initial results indicate that this methodology is promising to apply not only to gas storage in MOFs but in many other material science projects.},
      added-at = {2019-09-26T16:47:50.000+0200},
      author = {Borboudakis, Giorgos and Stergiannakos, Txiarchis and Frysali, Maria and Klontzas, Emmanuel and Tsamardinos, Ioannis and Froudakis, George E.},
      biburl = {https://www.bibsonomy.org/bibtex/25bde5694eb139306e6b6619b48a22be7/mensxmachina},
      doi = {10.1038/s41524-017-0045-8},
      interhash = {1bc83725a35ce15678399a739e9c76bf},
      intrahash = {5bde5694eb139306e6b6619b48a22be7},
      journal = {NPJ Computational Materials},
      keywords = {MOFs learning machine},
      month = {October},
      number = 40,
      timestamp = {2019-09-26T16:47:50.000+0200},
      title = {Chemically intuited, large-scale screening of MOFs by machine learning techniques},
      url = {https://doi.org/10.1038/s41524-017-0045-8},
      volume = 3,
      year = 2017
    }

  • V. Lagani, G. Athineou, A. Farcomeni, M. Tsagris, and I. Tsamardinos, "Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets," Journal of Statistical Software, vol. 80, iss. 7, 2017. doi:10.18637/jss.v080.i07
    [BibTeX] [Abstract] [Download PDF]

    The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.

    @article{Lagani_2017,
      abstract = {The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.},
      added-at = {2019-02-01T14:04:22.000+0100},
      author = {Lagani, Vincenzo and Athineou, Giorgos and Farcomeni, Alessio and Tsagris, Michail and Tsamardinos, Ioannis},
      biburl = {https://www.bibsonomy.org/bibtex/203594f6365b0d67aa39e3ba38b4b2289/mensxmachina},
      doi = {10.18637/jss.v080.i07},
      interhash = {2d6c6cbe4da60ea0a19269dad768d0d4},
      intrahash = {03594f6365b0d67aa39e3ba38b4b2289},
      journal = {Journal of Statistical Software},
      keywords = {mxmcausalpath},
      number = 7,
      publisher = {Foundation for Open Access Statistic},
      timestamp = {2021-03-10T09:22:09.000+0100},
      title = {Feature Selection with the R Package {MXM}: Discovering Statistically Equivalent Feature Subsets},
      url = {https://doi.org/10.18637%2Fjss.v080.i07},
      volume = 80,
      year = 2017
    }

  • G. Orfanoudaki, M. Markaki, K. Chatzi, I. Tsamardinos, and A. Economou, "MatureP: prediction of secreted proteins with exclusive information from their mature regions," Nature Scientific Reports, vol. 7, iss. 1, p. 3263, 2017. doi:10.1038/s41598-017-03557-4
    [BibTeX] [Abstract] [Download PDF]

    More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.

    @article{orfanoudaki2017maturep,
      abstract = {More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.},
      added-at = {2019-02-01T14:01:30.000+0100},
      author = {Orfanoudaki, Georgia and Markaki, Maria and Chatzi, Katerina and Tsamardinos, Ioannis and Economou, Anastassios},
      biburl = {https://www.bibsonomy.org/bibtex/2951371b60898141f3bcdfc2141b70336/mensxmachina},
      doi = {10.1038/s41598-017-03557-4},
      interhash = {45976adff24808a1e1f78e330732b5ff},
      intrahash = {951371b60898141f3bcdfc2141b70336},
      issn = {20452322},
      journal = {Nature Scientific Reports},
      keywords = {mxmcausalpath},
      month = {June},
      number = 1,
      pages = 3263,
      refid = {Orfanoudaki2017},
      timestamp = {2021-03-10T09:25:44.000+0100},
      title = {MatureP: prediction of secreted proteins with exclusive information from their mature regions},
      url = {https://doi.org/10.1038/s41598-017-03557-4},
      volume = 7,
      year = 2017
    }

Read more

About Us

Mens Ex Machina, Mind from the Machine or “Ο από Μηχανής Νους” paraphrases the latin expression Deus Ex Machina, God from the Machine. The name was suggested by Lucy Sofiadou, Prof. Tsamardinos’ wife.

We are a research group, founded in October 2006, led by Professor Ioannis Tsamardinos, interested in Artificial Intelligence, Machine Learning, and Biomedical Informatics and affiliated with the Computer Science Department of University of Crete. The aims of the group are to progress science and disseminate knowledge via educational activities and computer tools. Our group is involved in

Research:

Theoretical, algorithmic, and applied research in all of the above areas; we are also involved in interdisciplinary collaborations with biologists, physicians and practitioners from other fields.

Education:

Educational activities, such as teaching university courses, tutorials, summers schools, as well as supervising undergraduate dissertations, masters projects, and Ph.D. theses.

Systems and Software:

Implementation of tools, systems, and code libraries to aid the dissemination of the research results. Funding is provided from and through the University of Crete, often originating from European and International research grants.

Current research activities include but not limited to the following:

  • Causal discovery methods and the induction of causal models from observational studies. Specifically, we have recently introduced the problem of Integrative Causal Analysis (INCA).
  • Feature selection (a.k.a. variable selection) for classification and regression.
  • Induction of graphical models, such as Bayesian Networks from data.
  • Analysis of biomedical data and applications of AI and Machine Learning methods to induce new biomedical knowledge.
  • Activity recognition in Ambient Intelligent environments.

Ioannis Tsamardinos

Professor, Department of Computer Science, University of Crete