CAUSALPATH – PUBLICATIONS
2021
- AbstractURLBibTeXEndNoteDOIBibSonomyDownloadBorboudakis, Giorgos, and Ioannis Tsamardinos. “Extending Greedy Feature Selection Algorithms To Multiple Solutions”. Data Mining And Knowledge Discovery. doi:10.1007/s10618-020-00731-7.Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.
@article{Borboudakis2021,
abstract = {Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance.},
author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
journal = {Data Mining and Knowledge Discovery},
keywords = {mxmcausalpath},
month = {may},
title = {Extending greedy feature selection algorithms to multiple solutions},
year = 2021
}%0 Journal Article
%1 Borboudakis2021
%A Borboudakis, Giorgos
%A Tsamardinos, Ioannis
%D 2021
%J Data Mining and Knowledge Discovery
%R 10.1007/s10618-020-00731-7
%T Extending greedy feature selection algorithms to multiple solutions
%U https://doi.org/10.1007/s10618-020-00731-7
%X Most feature selection methods identify only a single solution. This is acceptable for predictive purposes, but is not sufficient for knowledge discovery if multiple solutions exist. We propose a strategy to extend a class of greedy methods to efficiently identify multiple solutions, and show under which conditions it identifies all solutions. We also introduce a taxonomy of features that takes the existence of multiple solutions into account. Furthermore, we explore different definitions of statistical equivalence of solutions, as well as methods for testing equivalence. A novel algorithm for compactly representing and visualizing multiple solutions is also introduced. In experiments we show that (a) the proposed algorithm is significantly more computationally efficient than the TIE* algorithm, the only alternative approach with similar theoretical guarantees, while identifying similar solutions to it, and (b) that the identified solutions have similar predictive performance. - BibTeXEndNoteBibSonomyDownloadBorboudakis, G, and I Tsamardinos. “Extending Greedy Feature Selection Algorithms To Multiple Solutions”. Data Mining And Knowledge Discovery to appear.
@article{borboudakis2021mining,
author = {Borboudakis, G and Tsamardinos, I},
journal = {Data Mining and Knowledge Discovery},
keywords = {mxmcausalpath},
title = {Extending Greedy Feature Selection Algorithms to Multiple Solutions},
volume = {to appear},
year = 2021
}%0 Journal Article
%1 borboudakis2021mining
%A Borboudakis, G
%A Tsamardinos, I
%D 2021
%J Data Mining and Knowledge Discovery
%T Extending Greedy Feature Selection Algorithms to Multiple Solutions
%V to appear
2020
- AbstractURLBibTeXEndNoteDOIBibSonomyTsagris, Michail, Zacharias Papadovasilakis, Kleanthi Lakiotaki, and Ioannis Tsamardinos. “The Γ-Omp Algorithm For Feature Selection With Application To Gene Expression Data”. Ieee/acm Transactions On Computational Biology And Bioinformatics. doi:10.1109/TCBB.2020.3029952.Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.
@article{tsagris2020algorithm,
abstract = {Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings.},
author = {Tsagris, Michail and Papadovasilakis, Zacharias and Lakiotaki, Kleanthi and Tsamardinos, Ioannis},
journal = {IEEE/ACM Transactions on Computational Biology and Bioinformatics},
keywords = {mxmcausalpath},
title = {The γ-OMP algorithm for feature selection with application to gene expression data},
year = 2020
}%0 Journal Article
%1 tsagris2020algorithm
%A Tsagris, Michail
%A Papadovasilakis, Zacharias
%A Lakiotaki, Kleanthi
%A Tsamardinos, Ioannis
%D 2020
%J IEEE/ACM Transactions on Computational Biology and Bioinformatics
%R 10.1109/TCBB.2020.3029952
%T The γ-OMP algorithm for feature selection with application to gene expression data
%U https://ieeexplore.ieee.org/document/9219177/authors#authors
%X Feature selection for predictive analytics is the problem of identifying a minimal-size subset of features that is maximally predictive of an outcome of interest. To apply to molecular data, feature selection algorithms need to be scalable to tens of thousands of features. In this paper, we propose γ-OMP, a generalisation of the highly-scalable Orthogonal Matching Pursuit feature selection algorithm. γ-OMP can handle (a) various types of outcomes, such as continuous, binary, nominal, time-to-event, (b) discrete (categorical) features, (c) different statistical-based stopping criteria, (d) several predictive models (e.g., linear or logistic regression), (e) various types of residuals, and (f) different types of association. We compare γ-OMP against LASSO, a prototypical, widely used algorithm for high-dimensional data. On both simulated data and several real gene expression datasets, γ-OMP is on par, or outperforms LASSO in binary classification (case-control data), regression (quantified outcomes), and time-to-event data (censored survival times). γ-OMP is based on simple statistical ideas, it is easy to implement and to extend, and our extensive evaluation shows that it is also effective in bioinformatics analysis settings. - AbstractBibTeXEndNoteDOIBibSonomyDownloadTsourtis, A, Y Pantazis, and I Tsamardinos. “Inference Of Stochastic Dynamical Systems From Cross-Sectional Population Data”. Arxiv:2012.05055V1 [Cs.lg] 9 Dec 2020. doi:arXiv:2012.05055v1 [cs.LG] 9 Dec 2020.Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements there are few attempts to infer the dynamical system straight from population data. In this work, we deduce and then computationally estimate the Fokker-Planck equation which describes the evolution of the population’s probability density, based on stochastic differential equations. Then, following the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions, transforming it into a linear system of equations. Finally, we apply sparse inference methods to solve the latter system and thus induce the driving forces of the dynamical system. Our approach is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential equations, biochemical reaction networks as well as mass cytometry biological measurements.
@article{tsourtis2020inference,
abstract = {Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements there are few attempts to infer the dynamical system straight from population data. In this work, we deduce and then computationally estimate the Fokker-Planck equation which describes the evolution of the population’s probability density, based on stochastic differential equations. Then, following the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions, transforming it into a linear system of equations. Finally, we apply sparse inference methods to solve the latter system and thus induce the driving forces of the dynamical system. Our approach is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential equations, biochemical reaction networks as well as mass cytometry biological measurements.},
author = {Tsourtis, A and Pantazis, Y and Tsamardinos, I},
journal = {arXiv:2012.05055v1 [cs.LG] 9 Dec 2020},
keywords = {mxmcausalpath},
title = {Inference of Stochastic Dynamical Systems from Cross-Sectional Population Data},
year = 2020
}%0 Journal Article
%1 tsourtis2020inference
%A Tsourtis, A
%A Pantazis, Y
%A Tsamardinos, I
%D 2020
%J arXiv:2012.05055v1 [cs.LG] 9 Dec 2020
%R arXiv:2012.05055v1 [cs.LG] 9 Dec 2020
%T Inference of Stochastic Dynamical Systems from Cross-Sectional Population Data
%X Inferring the driving equations of a dynamical system from population or time-course data is important in several scientific fields such as biochemistry, epidemiology, financial mathematics and many others. Despite the existence of algorithms that learn the dynamics from trajectorial measurements there are few attempts to infer the dynamical system straight from population data. In this work, we deduce and then computationally estimate the Fokker-Planck equation which describes the evolution of the population’s probability density, based on stochastic differential equations. Then, following the USDL approach [22], we project the Fokker-Planck equation to a proper set of test functions, transforming it into a linear system of equations. Finally, we apply sparse inference methods to solve the latter system and thus induce the driving forces of the dynamical system. Our approach is illustrated in both synthetic and real data including non-linear, multimodal stochastic differential equations, biochemical reaction networks as well as mass cytometry biological measurements. - AbstractURLBibTeXEndNoteDOIBibSonomyPantazis, Yannis, Christos Tselas, Kleanthi Lakiotaki, Vincenzo Lagani, and ioannis Tsamardinos. “Latent Feature Representations For Human Gene Expression Data Improve Phenotypic Predictions”. Ieee. doi:10.1109/BIBM49941.2020.9313286.High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.
@article{pantazis2020latent,
abstract = {High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses.},
author = {Pantazis, Yannis and Tselas, Christos and Lakiotaki, Kleanthi and Lagani, Vincenzo and Tsamardinos, ioannis},
journal = {IEEE},
keywords = {mxmcausalpath},
title = {Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions},
year = 2020
}%0 Journal Article
%1 pantazis2020latent
%A Pantazis, Yannis
%A Tselas, Christos
%A Lakiotaki, Kleanthi
%A Lagani, Vincenzo
%A Tsamardinos, ioannis
%D 2020
%J IEEE
%R 10.1109/BIBM49941.2020.9313286
%T Latent Feature Representations for Human Gene Expression Data Improve Phenotypic Predictions
%U https://ieeexplore.ieee.org/document/9313286
%X High-throughput technologies such as microarrays and RNA-sequencing (RNA-seq) allow to precisely quantify transcriptomic profiles, generating datasets that are inevitably high-dimensional. In this work, we investigate whether the whole human transcriptome can be represented in a compressed, low dimensional latent space without loosing relevant information. We thus constructed low-dimensional latent feature spaces of the human genome, by utilizing three dimensionality reduction approaches and a diverse set of curated datasets. We applied standard Principal Component Analysis (PCA), kernel PCA and Autoencoder Neural Networks on 1360 datasets from four different measurement technologies. The latent feature spaces are tested for their ability to (a) reconstruct the original data and (b) improve predictive performance on validation datasets not used during the creation of the feature space. While linear techniques show better reconstruction performance, nonlinear approaches, particularly, neural-based models seem to be able to capture non-additive interaction effects, and thus enjoy stronger predictive capabilities. Despite the limited sample size of each dataset and the biological / technological heterogeneity across studies, our results show that low dimensional representations of the human transcriptome can be achieved by integrating hundreds of datasets. The created space is two to three orders of magnitude smaller compared to the raw data, offering the ability of capturing a large portion of the original data variability and eventually reducing computational time for downstream analyses. - AbstractURLBibTeXEndNoteBibSonomyDownloadBiza, K., I. Tsamardinos, and S. Triantafillou. “Tuning Causal Discovery Algorithms”. Proceedings Of The Tenth International Conference On Probabilistic Graphical Models, In Pmlr. https://pgm2020.cs.aau.dk/wp-content/uploads/2020/09/biza20.pdf.There are numerous algorithms proposed in the literature for learning causal graphical probabilistic models. Each one of them is typically equipped with one or more tuning hyper-parameters. The choice of optimal algorithm and hyper-parameter values is not universal; it depends on the size of the network, the density of the true causal structure, the sample size, as well as the metric of quality of learning a causal structure. Thus, the challenge to a practitioner is how to “tune” these choices, given that the true graph is unknown and the learning task is unsupervised. In the paper, we evaluate two previously proposed methods for tuning, one based on stability of the learned structure under perturbations (bootstrapping) of the input data and the other based on balancing the in-sample fitting of the model with the model complexity. We propose and comparatively evaluate a new method that treats a causal model as a set of predictive models: one for each node given its Markov Blanket. It then tunes the choices using out-of-sample protocols for supervised methods such as cross-validation. The proposed method performs on par or better than the previous methods for most metrics.
@article{noauthororeditor,
abstract = {There are numerous algorithms proposed in the literature for learning causal graphical probabilistic models. Each one of them is typically equipped with one or more tuning hyper-parameters. The choice of optimal algorithm and hyper-parameter values is not universal; it depends on the size of the network, the density of the true causal structure, the sample size, as well as the metric of quality of learning a causal structure. Thus, the challenge to a practitioner is how to “tune” these choices, given that the true graph is unknown and the learning task is unsupervised. In the paper, we evaluate two previously proposed methods for tuning, one based on stability of the learned structure under perturbations (bootstrapping) of the input data and the other based on balancing the in-sample fitting of the model with the model complexity. We propose and comparatively evaluate a new method that treats a causal model as a set of predictive models: one for each node given its Markov Blanket. It then tunes the choices using out-of-sample protocols for supervised methods such as cross-validation. The proposed method performs on par or better than the previous methods for most metrics.},
author = {Biza, K. and Tsamardinos, I. and Triantafillou, S.},
journal = {Proceedings of the Tenth International Conference on Probabilistic Graphical Models, in PMLR},
keywords = {mxmcausalpath},
title = {Tuning Causal Discovery Algorithms},
year = 2020
}%0 Journal Article
%1 noauthororeditor
%A Biza, K.
%A Tsamardinos, I.
%A Triantafillou, S.
%D 2020
%J Proceedings of the Tenth International Conference on Probabilistic Graphical Models, in PMLR
%T Tuning Causal Discovery Algorithms
%U https://pgm2020.cs.aau.dk/wp-content/uploads/2020/09/biza20.pdf
%X There are numerous algorithms proposed in the literature for learning causal graphical probabilistic models. Each one of them is typically equipped with one or more tuning hyper-parameters. The choice of optimal algorithm and hyper-parameter values is not universal; it depends on the size of the network, the density of the true causal structure, the sample size, as well as the metric of quality of learning a causal structure. Thus, the challenge to a practitioner is how to “tune” these choices, given that the true graph is unknown and the learning task is unsupervised. In the paper, we evaluate two previously proposed methods for tuning, one based on stability of the learned structure under perturbations (bootstrapping) of the input data and the other based on balancing the in-sample fitting of the model with the model complexity. We propose and comparatively evaluate a new method that treats a causal model as a set of predictive models: one for each node given its Markov Blanket. It then tunes the choices using out-of-sample protocols for supervised methods such as cross-validation. The proposed method performs on par or better than the previous methods for most metrics. - AbstractURLBibTeXEndNoteDOIBibSonomyDownloadKaragiannaki, Ioulia, Yannis Pantazis, Ekaterini Chatzaki, and Ioannis Tsamardinos. “Pathway Activity Score Learning For Dimensionality Reduction Of Gene Expression Data”. G" "Tsoumakas, "Manolopoulos, Y", and "Matwin, S". Discovery Science. Ds 2020. Lecture Notes In Computer Science 12323: 246-261. doi:https://doi.org/10.1007/978-3-030-61527-7_17.Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.
@article{noauthororeditor,
abstract = {Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL.},
author = {Karagiannaki, Ioulia and Pantazis, Yannis and Chatzaki, Ekaterini and Tsamardinos, Ioannis},
editor = {"Tsoumakas, G" and "Manolopoulos, Y" and "Matwin, S"},
journal = {Discovery Science. DS 2020. Lecture Notes in Computer Science},
keywords = {mxmcausalpath},
pages = {246-261},
title = {Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data},
volume = 12323,
year = 2020
}%0 Journal Article
%1 noauthororeditor
%A Karagiannaki, Ioulia
%A Pantazis, Yannis
%A Chatzaki, Ekaterini
%A Tsamardinos, Ioannis
%D 2020
%E "Tsoumakas, G"
%E "Manolopoulos, Y"
%E "Matwin, S"
%J Discovery Science. DS 2020. Lecture Notes in Computer Science
%P 246-261
%R https://doi.org/10.1007/978-3-030-61527-7_17
%T Pathway Activity Score Learning for Dimensionality Reduction of Gene Expression Data
%U https://link.springer.com/chapter/10.1007%2F978-3-030-61527-7_17
%V 12323
%X Molecular gene-expression datasets consist of samples with tens of thousands of measured quantities (e.g., high dimensional data). However, there exist lower-dimensional representations that retain the useful information. We present a novel algorithm for such dimensionality reduction called Pathway Activity Score Learning (PASL). The major novelty of PASL is that the constructed features directly correspond to known molecular pathways and can be interpreted as pathway activity scores. Hence, unlike PCA and similar methods, PASL’s latent space has a relatively straight-forward biological interpretation. As a use-case, PASL is applied on two collections of breast cancer and leukemia gene expression datasets. We show that PASL does retain the predictive information for disease classification on new, unseen datasets, as well as outperforming PLIER, a recently proposed competitive method. We also show that differential activation pathway analysis provides complementary information to standard gene set enrichment analysis. The code is available at https://github.com/mensxmachina/PASL. - AbstractURLBibTeXEndNoteDOIBibSonomyTsamardinos, Ioannis, George Fanourgakis, Elissavet Greasidou, Emmanuel Klontzas, Konstantinos Gkagkas, and George Froudakis. “An Automated Machine Learning Architecture For The Accelerated Prediction Of Metal-Organic Frameworks Performance In Energy And Environmental Applications”. Microporous And Mesoporos Materials 300. doi:https://doi.org/10.1016/j.micromeso.2020.110160.Due to their exceptional host-guest properties, Metal-Organic Frameworks (MOFs) are promising materials for storage of various gases with environmental and technological interest. Molecular modeling and simulations are invaluable tools, extensively used over the last two decades for the study of various properties of MOFs. In particular, Monte Carlo simulation techniques have been employed for the study of the gas uptake capacity of several MOFs at a wide range of different thermodynamic conditions. Despite the accurate predictions of molecular simulations, the accurate characterization and the high-throughput screening of the enormous number of MOFs that can be potentially synthesized by combining various structural building blocks is beyond present computer capabilities. In this work, we propose and demonstrate the use of an alternative approach, namely one based on an Automated Machine Learning (AutoML) architecture that is capable of training machine learning and statistical predictive models for MOFs’ chemical properties and estimate their predictive performance with confidence intervals. The architecture tries numerous combinations of different machine learning (ML) algorithms, tunes their hyper-parameters, and conservatively estimates performance of the final model. We demonstrate that it correctly estimates performance even with few samples (<100) and that it provides improved predictions over trying a single standard method, like Random Forests. The AutoML pipeline democratizes ML to non-expert material-science practitioners that may not know which algorithms to use on a given problem, how to tune them, and how to correctly estimate their predictive performance, dramatically improving productivity and avoiding common analysis pitfalls. A demonstration on the prediction of the carbon dioxide and methane uptake at various thermodynamic conditions is used as a showcase sharable at https://app.jadbio.com/share/86477fd7-d467-464d-ac41-fcbb0475444b.
@article{noauthororeditor,
abstract = {Due to their exceptional host-guest properties, Metal-Organic Frameworks (MOFs) are promising materials for storage of various gases with environmental and technological interest. Molecular modeling and simulations are invaluable tools, extensively used over the last two decades for the study of various properties of MOFs. In particular, Monte Carlo simulation techniques have been employed for the study of the gas uptake capacity of several MOFs at a wide range of different thermodynamic conditions. Despite the accurate predictions of molecular simulations, the accurate characterization and the high-throughput screening of the enormous number of MOFs that can be potentially synthesized by combining various structural building blocks is beyond present computer capabilities. In this work, we propose and demonstrate the use of an alternative approach, namely one based on an Automated Machine Learning (AutoML) architecture that is capable of training machine learning and statistical predictive models for MOFs’ chemical properties and estimate their predictive performance with confidence intervals. The architecture tries numerous combinations of different machine learning (ML) algorithms, tunes their hyper-parameters, and conservatively estimates performance of the final model. We demonstrate that it correctly estimates performance even with few samples (<100) and that it provides improved predictions over trying a single standard method, like Random Forests. The AutoML pipeline democratizes ML to non-expert material-science practitioners that may not know which algorithms to use on a given problem, how to tune them, and how to correctly estimate their predictive performance, dramatically improving productivity and avoiding common analysis pitfalls. A demonstration on the prediction of the carbon dioxide and methane uptake at various thermodynamic conditions is used as a showcase sharable at https://app.jadbio.com/share/86477fd7-d467-464d-ac41-fcbb0475444b.},
author = {Tsamardinos, Ioannis and Fanourgakis, George and Greasidou, Elissavet and Klontzas, Emmanuel and Gkagkas, Konstantinos and Froudakis, George},
journal = {Microporous and Mesoporos Materials},
keywords = {mxmcausalpath},
title = {An Automated Machine Learning architecture for the accelerated prediction of Metal-Organic Frameworks performance in energy and environmental applications},
volume = 300,
year = 2020
}%0 Journal Article
%1 noauthororeditor
%A Tsamardinos, Ioannis
%A Fanourgakis, George
%A Greasidou, Elissavet
%A Klontzas, Emmanuel
%A Gkagkas, Konstantinos
%A Froudakis, George
%D 2020
%J Microporous and Mesoporos Materials
%R https://doi.org/10.1016/j.micromeso.2020.110160
%T An Automated Machine Learning architecture for the accelerated prediction of Metal-Organic Frameworks performance in energy and environmental applications
%U https://www.sciencedirect.com/science/article/abs/pii/S1387181120301633
%V 300
%X Due to their exceptional host-guest properties, Metal-Organic Frameworks (MOFs) are promising materials for storage of various gases with environmental and technological interest. Molecular modeling and simulations are invaluable tools, extensively used over the last two decades for the study of various properties of MOFs. In particular, Monte Carlo simulation techniques have been employed for the study of the gas uptake capacity of several MOFs at a wide range of different thermodynamic conditions. Despite the accurate predictions of molecular simulations, the accurate characterization and the high-throughput screening of the enormous number of MOFs that can be potentially synthesized by combining various structural building blocks is beyond present computer capabilities. In this work, we propose and demonstrate the use of an alternative approach, namely one based on an Automated Machine Learning (AutoML) architecture that is capable of training machine learning and statistical predictive models for MOFs’ chemical properties and estimate their predictive performance with confidence intervals. The architecture tries numerous combinations of different machine learning (ML) algorithms, tunes their hyper-parameters, and conservatively estimates performance of the final model. We demonstrate that it correctly estimates performance even with few samples (<100) and that it provides improved predictions over trying a single standard method, like Random Forests. The AutoML pipeline democratizes ML to non-expert material-science practitioners that may not know which algorithms to use on a given problem, how to tune them, and how to correctly estimate their predictive performance, dramatically improving productivity and avoiding common analysis pitfalls. A demonstration on the prediction of the carbon dioxide and methane uptake at various thermodynamic conditions is used as a showcase sharable at https://app.jadbio.com/share/86477fd7-d467-464d-ac41-fcbb0475444b. - AbstractURLBibTeXEndNoteDOIBibSonomyVerrou, Klio-Maria, Ioannis Tsamardinos, and Georgios Papoutsoglou. “Learning Pathway Dynamics From Single‐Cell Proteomic Data: A Comparative Study”. Cytometry Part A, Special Issue: Machine Learning For Single Cell Data 97, no. 3. doi:https://doi.org/10.1002/cyto.a.23976.Single‐cell platforms provide statistically large samples of snapshot observations capable of resolving intrercellular heterogeneity. Currently, there is a growing literature on algorithms that exploit this attribute in order to infer the trajectory of biological mechanisms, such as cell proliferation and differentiation. Despite the efforts, the trajectory inference methodology has not yet been used for addressing the challenging problem of learning the dynamics of protein signaling systems. In this work, we assess this prospect by testing the performance of this class of algorithms on four proteomic temporal datasets. To evaluate the learning quality, we design new general‐purpose evaluation metrics that are able to quantify performance on (i) the biological meaning of the output, (ii) the consistency of the inferred trajectory, (iii) the algorithm robustness, (iv) the correlation of the learning output with the initial dataset, and (v) the roughness of the cell parameter levels though the inferred trajectory. We show that experimental time alone is insufficient to provide knowledge about the order of proteins during signal transduction. Accordingly, we show that the inferred trajectories provide richer information about the underlying dynamics. We learn that established methods tested on high‐dimensional data with small sample size, slow dynamics, and complex structures (e.g. bifurcations) cannot always work in the signaling setting. Among the methods we evaluate, Scorpius and a newly introduced approach that combines Diffusion Maps and Principal Curves were found to perform adequately in recovering the progression of signal transduction although their performance on some metrics varies from one dataset to another. The novel metrics we devise highlight that it is difficult to conclude, which one method is universally applicable for the task. Arguably, there are still many challenges and open problems to resolve. © 2020 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.
@article{noauthororeditor,
abstract = {Single‐cell platforms provide statistically large samples of snapshot observations capable of resolving intrercellular heterogeneity. Currently, there is a growing literature on algorithms that exploit this attribute in order to infer the trajectory of biological mechanisms, such as cell proliferation and differentiation. Despite the efforts, the trajectory inference methodology has not yet been used for addressing the challenging problem of learning the dynamics of protein signaling systems. In this work, we assess this prospect by testing the performance of this class of algorithms on four proteomic temporal datasets. To evaluate the learning quality, we design new general‐purpose evaluation metrics that are able to quantify performance on (i) the biological meaning of the output, (ii) the consistency of the inferred trajectory, (iii) the algorithm robustness, (iv) the correlation of the learning output with the initial dataset, and (v) the roughness of the cell parameter levels though the inferred trajectory. We show that experimental time alone is insufficient to provide knowledge about the order of proteins during signal transduction. Accordingly, we show that the inferred trajectories provide richer information about the underlying dynamics. We learn that established methods tested on high‐dimensional data with small sample size, slow dynamics, and complex structures (e.g. bifurcations) cannot always work in the signaling setting. Among the methods we evaluate, Scorpius and a newly introduced approach that combines Diffusion Maps and Principal Curves were found to perform adequately in recovering the progression of signal transduction although their performance on some metrics varies from one dataset to another. The novel metrics we devise highlight that it is difficult to conclude, which one method is universally applicable for the task. Arguably, there are still many challenges and open problems to resolve. © 2020 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.},
author = {Verrou, Klio-Maria and Tsamardinos, Ioannis and Papoutsoglou, Georgios},
journal = {Cytometry part A, Special Issue: Machine Learning for Single Cell Data,},
keywords = {mxmcausalpath},
number = 3,
title = {Learning Pathway Dynamics from Single‐Cell Proteomic Data: A Comparative Study},
volume = 97,
year = 2020
}%0 Journal Article
%1 noauthororeditor
%A Verrou, Klio-Maria
%A Tsamardinos, Ioannis
%A Papoutsoglou, Georgios
%D 2020
%J Cytometry part A, Special Issue: Machine Learning for Single Cell Data,
%N 3
%R https://doi.org/10.1002/cyto.a.23976
%T Learning Pathway Dynamics from Single‐Cell Proteomic Data: A Comparative Study
%U https://onlinelibrary.wiley.com/doi/full/10.1002/cyto.a.23976
%V 97
%X Single‐cell platforms provide statistically large samples of snapshot observations capable of resolving intrercellular heterogeneity. Currently, there is a growing literature on algorithms that exploit this attribute in order to infer the trajectory of biological mechanisms, such as cell proliferation and differentiation. Despite the efforts, the trajectory inference methodology has not yet been used for addressing the challenging problem of learning the dynamics of protein signaling systems. In this work, we assess this prospect by testing the performance of this class of algorithms on four proteomic temporal datasets. To evaluate the learning quality, we design new general‐purpose evaluation metrics that are able to quantify performance on (i) the biological meaning of the output, (ii) the consistency of the inferred trajectory, (iii) the algorithm robustness, (iv) the correlation of the learning output with the initial dataset, and (v) the roughness of the cell parameter levels though the inferred trajectory. We show that experimental time alone is insufficient to provide knowledge about the order of proteins during signal transduction. Accordingly, we show that the inferred trajectories provide richer information about the underlying dynamics. We learn that established methods tested on high‐dimensional data with small sample size, slow dynamics, and complex structures (e.g. bifurcations) cannot always work in the signaling setting. Among the methods we evaluate, Scorpius and a newly introduced approach that combines Diffusion Maps and Principal Curves were found to perform adequately in recovering the progression of signal transduction although their performance on some metrics varies from one dataset to another. The novel metrics we devise highlight that it is difficult to conclude, which one method is universally applicable for the task. Arguably, there are still many challenges and open problems to resolve. © 2020 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry. - AbstractURLBibTeXEndNoteBibSonomyDownloadXanthopoulos, Iordanis, Ioannis Tsamardinos, Vassilis Christophides, Eric Simon, and Alejandro Salinger. “Putting The Human Back In The Automl Loop.”. In Edbt/icdt Workshops, Alexandra Poulovassilis, Auber, David, Bikakis, Nikos, Chrysanthis, Panos K., Papastefanatos, George, Sharaf, Mohamed, Pelekis, Nikos, et al.. Vol. 2578. Ceur Workshop Proceedings, CEUR-WS.org, 2020. http://ceur-ws.org/Vol-2578/ETMLP5.pdf.Automated Machine Learning (AutoML) is a rapidly rising sub-field of Machine Learning. AutoML aims to fully automate the machine learning process end-to-end, democratizing Machine Learning to non-experts and drastically increasing the productivity of expert analysts. So far, most comparisons of AutoML systems focus on quantitative criteria such as predictive performance and execution time. In this paper, we examine AutoML services for predictive modeling tasks from a user's perspective, going beyond predictive performance. We present a wide palette of criteria and dimensions on which to evaluate and compare these services as a user. This qualitative comparative methodology is applied on seven AutoML systems, namely Auger.AI, BigML, H2O's Driverless AI, Darwin, Just Add Data Bio, Rapid-Miner, and Watson. The comparison indicates the strengths and weaknesses of each service, the needs that it covers, the segment of users that is most appropriate for, and the possibilities for improvements.
@inproceedings{conf/edbt/XanthopoulosTCS20,
abstract = {Automated Machine Learning (AutoML) is a rapidly rising sub-field of Machine Learning. AutoML aims to fully automate the machine learning process end-to-end, democratizing Machine Learning to non-experts and drastically increasing the productivity of expert analysts. So far, most comparisons of AutoML systems focus on quantitative criteria such as predictive performance and execution time. In this paper, we examine AutoML services for predictive modeling tasks from a user's perspective, going beyond predictive performance. We present a wide palette of criteria and dimensions on which to evaluate and compare these services as a user. This qualitative comparative methodology is applied on seven AutoML systems, namely Auger.AI, BigML, H2O's Driverless AI, Darwin, Just Add Data Bio, Rapid-Miner, and Watson. The comparison indicates the strengths and weaknesses of each service, the needs that it covers, the segment of users that is most appropriate for, and the possibilities for improvements.},
author = {Xanthopoulos, Iordanis and Tsamardinos, Ioannis and Christophides, Vassilis and Simon, Eric and Salinger, Alejandro},
booktitle = {EDBT/ICDT Workshops},
crossref = {conf/edbt/2020w},
editor = {Poulovassilis, Alexandra and Auber, David and Bikakis, Nikos and Chrysanthis, Panos K. and Papastefanatos, George and Sharaf, Mohamed and Pelekis, Nikos and Renso, Chiara and Theodoridis, Yannis and Zeitouni, Karine and Cerquitelli, Tania and Chiusano, Silvia and Vargas-Solar, Genoveva and Omidvar-Tehrani, Behrooz and Morik, Katharina and Renders, Jean-Michel and Firmani, Donatella and Tanca, Letizia and Mottin, Davide and Lissandrini, Matteo and Velegrakis, Yannis},
keywords = {mxmcausalpath},
publisher = {CEUR-WS.org},
series = {CEUR Workshop Proceedings},
title = {Putting the Human Back in the AutoML Loop.},
volume = 2578,
year = 2020
}%0 Conference Paper
%1 conf/edbt/XanthopoulosTCS20
%A Xanthopoulos, Iordanis
%A Tsamardinos, Ioannis
%A Christophides, Vassilis
%A Simon, Eric
%A Salinger, Alejandro
%B EDBT/ICDT Workshops
%D 2020
%E Poulovassilis, Alexandra
%E Auber, David
%E Bikakis, Nikos
%E Chrysanthis, Panos K.
%E Papastefanatos, George
%E Sharaf, Mohamed
%E Pelekis, Nikos
%E Renso, Chiara
%E Theodoridis, Yannis
%E Zeitouni, Karine
%E Cerquitelli, Tania
%E Chiusano, Silvia
%E Vargas-Solar, Genoveva
%E Omidvar-Tehrani, Behrooz
%E Morik, Katharina
%E Renders, Jean-Michel
%E Firmani, Donatella
%E Tanca, Letizia
%E Mottin, Davide
%E Lissandrini, Matteo
%E Velegrakis, Yannis
%I CEUR-WS.org
%T Putting the Human Back in the AutoML Loop.
%U http://ceur-ws.org/Vol-2578/ETMLP5.pdf
%V 2578
%X Automated Machine Learning (AutoML) is a rapidly rising sub-field of Machine Learning. AutoML aims to fully automate the machine learning process end-to-end, democratizing Machine Learning to non-experts and drastically increasing the productivity of expert analysts. So far, most comparisons of AutoML systems focus on quantitative criteria such as predictive performance and execution time. In this paper, we examine AutoML services for predictive modeling tasks from a user's perspective, going beyond predictive performance. We present a wide palette of criteria and dimensions on which to evaluate and compare these services as a user. This qualitative comparative methodology is applied on seven AutoML systems, namely Auger.AI, BigML, H2O's Driverless AI, Darwin, Just Add Data Bio, Rapid-Miner, and Watson. The comparison indicates the strengths and weaknesses of each service, the needs that it covers, the segment of users that is most appropriate for, and the possibilities for improvements. - AbstractURLBibTeXEndNoteDOIBibSonomyDownloadMalliaraki, Niki, Kleanthi Lakiotaki, Rodanthi Vamvoukaki, George Notas, Ioannis Tsamardinos, Marilena Kampa, and Elias Castanas. “Translating Vitamin D Transcriptomics To Clinical Evidence: Analysis Of Data In Asthma And Chronic Obstructive Pulmonary Disease, Followed By Clinical Data Meta-Analysis”. The Journal Of Steroid Biochemistry And Molecular Biology 197: 1-14. doi:https://doi.org/10.1016/j.jsbmb.2019.105505.Vitamin D (VitD) continues to trigger intense scientific controversy, regarding both its bi ological targets and its supplementation doses and regimens. In an effort to resolve this dispute, we mapped VitD transcriptome-wide events in humans, in order to unveil shared patterns or mechanisms with diverse pathologies/tissue profiles and reveal causal effects between VitD actions and specific human diseases, using a recently developed bioinformatics methodology. Using the similarities in analyzed transcriptome data (c-SKL method), we validated our methodology with osteoporosis as an example and further analyzed two other strong hits, specifically chronic obstructive pulmonary disease (COPD) and asthma. The latter revealed no impact of VitD on known molecular pathways. In accordance to this finding, review and meta-analysis of published data, based on an objective measure (Forced Expiratory Volume at one second, FEV1%) did not further reveal any significant effect of VitD on the objective amelioration of either condition. This study may, therefore, be regarded as the first one to explore, in an objective, unbiased and unsupervised manner, the impact of VitD levels and/or interventions in a number of human pathologies.
@article{noauthororeditor,
abstract = {Vitamin D (VitD) continues to trigger intense scientific controversy, regarding both its bi ological targets and its supplementation doses and regimens. In an effort to resolve this dispute, we mapped VitD transcriptome-wide events in humans, in order to unveil shared patterns or mechanisms with diverse pathologies/tissue profiles and reveal causal effects between VitD actions and specific human diseases, using a recently developed bioinformatics methodology. Using the similarities in analyzed transcriptome data (c-SKL method), we validated our methodology with osteoporosis as an example and further analyzed two other strong hits, specifically chronic obstructive pulmonary disease (COPD) and asthma. The latter revealed no impact of VitD on known molecular pathways. In accordance to this finding, review and meta-analysis of published data, based on an objective measure (Forced Expiratory Volume at one second, FEV1%) did not further reveal any significant effect of VitD on the objective amelioration of either condition. This study may, therefore, be regarded as the first one to explore, in an objective, unbiased and unsupervised manner, the impact of VitD levels and/or interventions in a number of human pathologies.},
author = {Malliaraki, Niki and Lakiotaki, Kleanthi and Vamvoukaki, Rodanthi and Notas, George and Tsamardinos, Ioannis and Kampa, Marilena and Castanas, Elias},
journal = {The Journal of Steroid Biochemistry and Molecular Biology},
keywords = {mxmcausalpath},
pages = {1-14},
title = {Translating vitamin D transcriptomics to clinical evidence: Analysis of data in asthma and chronic obstructive pulmonary disease, followed by clinical data meta-analysis},
volume = 197,
year = 2020
}%0 Journal Article
%1 noauthororeditor
%A Malliaraki, Niki
%A Lakiotaki, Kleanthi
%A Vamvoukaki, Rodanthi
%A Notas, George
%A Tsamardinos, Ioannis
%A Kampa, Marilena
%A Castanas, Elias
%D 2020
%J The Journal of Steroid Biochemistry and Molecular Biology
%P 1-14
%R https://doi.org/10.1016/j.jsbmb.2019.105505
%T Translating vitamin D transcriptomics to clinical evidence: Analysis of data in asthma and chronic obstructive pulmonary disease, followed by clinical data meta-analysis
%U https://reader.elsevier.com/reader/sd/pii/S096007601930398X?token=BDFDFB0A2D6C3BCB2D6140BFCADFC9742EF3D905A0F5CFB518B320F4235CDDC6C6CF2A14B2FB25CB266333CBB3E631ED
%V 197
%X Vitamin D (VitD) continues to trigger intense scientific controversy, regarding both its bi ological targets and its supplementation doses and regimens. In an effort to resolve this dispute, we mapped VitD transcriptome-wide events in humans, in order to unveil shared patterns or mechanisms with diverse pathologies/tissue profiles and reveal causal effects between VitD actions and specific human diseases, using a recently developed bioinformatics methodology. Using the similarities in analyzed transcriptome data (c-SKL method), we validated our methodology with osteoporosis as an example and further analyzed two other strong hits, specifically chronic obstructive pulmonary disease (COPD) and asthma. The latter revealed no impact of VitD on known molecular pathways. In accordance to this finding, review and meta-analysis of published data, based on an objective measure (Forced Expiratory Volume at one second, FEV1%) did not further reveal any significant effect of VitD on the objective amelioration of either condition. This study may, therefore, be regarded as the first one to explore, in an objective, unbiased and unsupervised manner, the impact of VitD levels and/or interventions in a number of human pathologies.
2019
- AbstractURLBibTeXEndNoteDOIBibSonomyEwing, Ewoud, Lara Kular, Sunjay J. Fernandes, Nestoras Karathanasis, Vincenzo Lagani, Sabrina Ruhrmann, Ioannis Tsamardinos, et al. “Combining Evidence From Four Immune Cell Types Identifies Dna Methylation Patterns That Implicate Functionally Distinct Pathways During Multiple Sclerosis Progression”. Ebiomedicine 43: 411--423. doi:10.1016/j.ebiom.2019.04.042.Background Multiple Sclerosis (MS) is a chronic inflammatory disease and a leading cause of progressive neurological disability among young adults. DNA methylation, which intersects genes and environment to control cellular functions on a molecular level, may provide insights into MS pathogenesis. Methods We measured DNA methylation in CD4+ T cells (n = 31), CD8+ T cells (n = 28), CD14+ monocytes (n = 35) and CD19+ B cells (n = 27) from relapsing-remitting (RRMS), secondary progressive (SPMS) patients and healthy controls (HC) using Infinium HumanMethylation450 arrays. Monocyte (n = 25) and whole blood (n = 275) cohorts were used for validations. Findings B cells from MS patients displayed most significant differentially methylated positions (DMPs), followed by monocytes, while only few DMPs were detected in T cells. We implemented a non-parametric combination framework (omicsNPC) to increase discovery power by combining evidence from all four cell types. Identified shared DMPs co-localized at MS risk loci and clustered into distinct groups. Functional exploration of changes discriminating RRMS and SPMS from HC implicated lymphocyte signaling, T cell activation and migration. SPMS-specific changes, on the other hand, implicated myeloid cell functions and metabolism. Interestingly, neuronal and neurodegenerative genes and pathways were also specifically enriched in the SPMS cluster. Interpretation We utilized a statistical framework (omicsNPC) that combines multiple layers of evidence to identify DNA methylation changes that provide new insights into MS pathogenesis in general, and disease progression, in particular. Fund This work was supported by the Swedish Research Council, Stockholm County Council, AstraZeneca, European Research Council, Karolinska Institutet and Margaretha af Ugglas Foundation.
@article{Ewing_2019,
abstract = {Background Multiple Sclerosis (MS) is a chronic inflammatory disease and a leading cause of progressive neurological disability among young adults. DNA methylation, which intersects genes and environment to control cellular functions on a molecular level, may provide insights into MS pathogenesis. Methods We measured DNA methylation in CD4+ T cells (n = 31), CD8+ T cells (n = 28), CD14+ monocytes (n = 35) and CD19+ B cells (n = 27) from relapsing-remitting (RRMS), secondary progressive (SPMS) patients and healthy controls (HC) using Infinium HumanMethylation450 arrays. Monocyte (n = 25) and whole blood (n = 275) cohorts were used for validations. Findings B cells from MS patients displayed most significant differentially methylated positions (DMPs), followed by monocytes, while only few DMPs were detected in T cells. We implemented a non-parametric combination framework (omicsNPC) to increase discovery power by combining evidence from all four cell types. Identified shared DMPs co-localized at MS risk loci and clustered into distinct groups. Functional exploration of changes discriminating RRMS and SPMS from HC implicated lymphocyte signaling, T cell activation and migration. SPMS-specific changes, on the other hand, implicated myeloid cell functions and metabolism. Interestingly, neuronal and neurodegenerative genes and pathways were also specifically enriched in the SPMS cluster. Interpretation We utilized a statistical framework (omicsNPC) that combines multiple layers of evidence to identify DNA methylation changes that provide new insights into MS pathogenesis in general, and disease progression, in particular. Fund This work was supported by the Swedish Research Council, Stockholm County Council, AstraZeneca, European Research Council, Karolinska Institutet and Margaretha af Ugglas Foundation.},
author = {Ewing, Ewoud and Kular, Lara and Fernandes, Sunjay J. and Karathanasis, Nestoras and Lagani, Vincenzo and Ruhrmann, Sabrina and Tsamardinos, Ioannis and Tegner, Jesper and Piehl, Fredrik and Gomez-Cabrero, David and Jagodic, Maja},
journal = {EBioMedicine},
keywords = {mxmcausalpath},
month = {may},
pages = {411--423},
publisher = {Elsevier BV},
title = {Combining evidence from four immune cell types identifies DNA methylation patterns that implicate functionally distinct pathways during Multiple Sclerosis progression},
volume = 43,
year = 2019
}%0 Journal Article
%1 Ewing_2019
%A Ewing, Ewoud
%A Kular, Lara
%A Fernandes, Sunjay J.
%A Karathanasis, Nestoras
%A Lagani, Vincenzo
%A Ruhrmann, Sabrina
%A Tsamardinos, Ioannis
%A Tegner, Jesper
%A Piehl, Fredrik
%A Gomez-Cabrero, David
%A Jagodic, Maja
%D 2019
%I Elsevier BV
%J EBioMedicine
%P 411--423
%R 10.1016/j.ebiom.2019.04.042
%T Combining evidence from four immune cell types identifies DNA methylation patterns that implicate functionally distinct pathways during Multiple Sclerosis progression
%U https://doi.org/10.1016%2Fj.ebiom.2019.04.042
%V 43
%X Background Multiple Sclerosis (MS) is a chronic inflammatory disease and a leading cause of progressive neurological disability among young adults. DNA methylation, which intersects genes and environment to control cellular functions on a molecular level, may provide insights into MS pathogenesis. Methods We measured DNA methylation in CD4+ T cells (n = 31), CD8+ T cells (n = 28), CD14+ monocytes (n = 35) and CD19+ B cells (n = 27) from relapsing-remitting (RRMS), secondary progressive (SPMS) patients and healthy controls (HC) using Infinium HumanMethylation450 arrays. Monocyte (n = 25) and whole blood (n = 275) cohorts were used for validations. Findings B cells from MS patients displayed most significant differentially methylated positions (DMPs), followed by monocytes, while only few DMPs were detected in T cells. We implemented a non-parametric combination framework (omicsNPC) to increase discovery power by combining evidence from all four cell types. Identified shared DMPs co-localized at MS risk loci and clustered into distinct groups. Functional exploration of changes discriminating RRMS and SPMS from HC implicated lymphocyte signaling, T cell activation and migration. SPMS-specific changes, on the other hand, implicated myeloid cell functions and metabolism. Interestingly, neuronal and neurodegenerative genes and pathways were also specifically enriched in the SPMS cluster. Interpretation We utilized a statistical framework (omicsNPC) that combines multiple layers of evidence to identify DNA methylation changes that provide new insights into MS pathogenesis in general, and disease progression, in particular. Fund This work was supported by the Swedish Research Council, Stockholm County Council, AstraZeneca, European Research Council, Karolinska Institutet and Margaretha af Ugglas Foundation. - AbstractURLBibTeXEndNoteBibSonomyBorboudakis, Giorgos, and Ioannis Tsamardinos. “Forward-Backward Selection With Early Dropping”. Isabelle Guyon. Journal Of Machine Learning Research 20, no. 8: 1-39. http://jmlr.org/papers/volume20/17-334/17-334.pdf.Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists
@article{guyon2019forwardbackward,
abstract = {Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists},
author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
editor = {Guyon, Isabelle},
journal = {Journal of Machine Learning Research},
keywords = {mxmcausalpath},
month = {January},
number = 8,
pages = {1-39},
title = {Forward-Backward Selection with Early Dropping},
volume = 20,
year = 2019
}%0 Journal Article
%1 guyon2019forwardbackward
%A Borboudakis, Giorgos
%A Tsamardinos, Ioannis
%D 2019
%E Guyon, Isabelle
%J Journal of Machine Learning Research
%N 8
%P 1-39
%T Forward-Backward Selection with Early Dropping
%U http://jmlr.org/papers/volume20/17-334/17-334.pdf
%V 20
%X Forward-backward selection is one of the most basic and commonly-used feature selection algorithms available. It is also general and conceptually applicable to many different types of data. In this paper, we propose a heuristic that significantly improves its running time, while preserving predictive performance. The idea is to temporarily discard the variables that are conditionally independent with the outcome given the selected variable set. Depending on how those variables are reconsidered and reintroduced, this heuristic gives rise to a family of algorithms with increasingly stronger theoretical guarantees. In distributions that can be faithfully represented by Bayesian networks or maximal ancestral graphs, members of this algorithmic family are able to correctly identify the Markov blanket in the sample limit. In experiments we show that the proposed heuristic increases computational efficiency by about 1-2 orders of magnitude, while selecting fewer or the same number of variables and retaining predictive performance. Furthermore, we show that the proposed algorithm and feature selection with LASSO perform similarly when restricted to select the same number of variables, making the proposed algorithm an attractive alternative for problems where no (efficient) algorithm for LASSO exists - AbstractURLBibTeXEndNoteDOIBibSonomyPantazis, Yannis, and Ioannis Tsamardinos. “A Unified Approach For Sparse Dynamical System Inference From Temporal Measurements”. Bioinformatics. doi:10.1093/bioinformatics/btz065.Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously.In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL’s accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway.Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics.Supplementary data are available at Bioinformatics online.
@article{10.1093/bioinformatics/btz065,
abstract = {Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously.In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL’s accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway.Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics.Supplementary data are available at Bioinformatics online.},
author = {Pantazis, Yannis and Tsamardinos, Ioannis},
journal = {Bioinformatics},
keywords = {mxmcausalpath},
month = {01},
title = {A unified approach for sparse dynamical system inference from temporal measurements},
year = 2019
}%0 Journal Article
%1 10.1093/bioinformatics/btz065
%A Pantazis, Yannis
%A Tsamardinos, Ioannis
%D 2019
%J Bioinformatics
%R 10.1093/bioinformatics/btz065
%T A unified approach for sparse dynamical system inference from temporal measurements
%U https://dx.doi.org/10.1093/bioinformatics/btz065
%X Temporal variations in biological systems and more generally in natural sciences are typically modeled as a set of ordinary, partial or stochastic differential or difference equations. Algorithms for learning the structure and the parameters of a dynamical system are distinguished based on whether time is discrete or continuous, observations are time-series or time-course and whether the system is deterministic or stochastic, however, there is no approach able to handle the various types of dynamical systems simultaneously.In this paper, we present a unified approach to infer both the structure and the parameters of non-linear dynamical systems of any type under the restriction of being linear with respect to the unknown parameters. Our approach, which is named Unified Sparse Dynamics Learning (USDL), constitutes of two steps. First, an atemporal system of equations is derived through the application of the weak formulation. Then, assuming a sparse representation for the dynamical system, we show that the inference problem can be expressed as a sparse signal recovery problem, allowing the application of an extensive body of algorithms and theoretical results. Results on simulated data demonstrate the efficacy and superiority of the USDL algorithm under multiple interventions and/or stochasticity. Additionally, USDL’s accuracy significantly correlates with theoretical metrics such as the exact recovery coefficient. On real single-cell data, the proposed approach is able to induce high-confidence subgraphs of the signaling pathway.Source code is available at Bioinformatics online. USDL algorithm has been also integrated in SCENERY (http://scenery.csd.uoc.gr/); an online tool for single-cell mass cytometry analytics.Supplementary data are available at Bioinformatics online. - AbstractURLBibTeXEndNoteDOIBibSonomyFerreirós-Vidal, Isabel, Thomas Carroll, Tianyi Zhang, Vincenzo Lagani, Ricardo N. Ramirez, Elizabeth Ing-Simmons, Alicia Garcia, et al. “Feedforward Regulation Of Myc Coordinates Lineage-Specific With Housekeeping Gene Expression During B Cell Progenitor Cell Differentiation”. Plos Biology 17, no. 4: 1-28. doi:10.1371/journal.pbio.2006506.The human body is made from billions of cells comprizing many specialized cell types. All of these cells ultimately come from a single fertilized oocyte in a process that has two key features: proliferation, which expands cell numbers, and differentiation, which diversifies cell types. Here, we have examined the transition from proliferation to differentiation using B lymphocytes as an example. We find that the transition from proliferation to differentiation involves changes in the expression of genes, which can be categorized into cell-type–specific genes and broadly expressed “housekeeping” genes. The expression of many housekeeping genes is controlled by the gene regulatory factor Myc, whereas the expression of many B lymphocyte–specific genes is controlled by the Ikaros family of gene regulatory proteins. Myc is repressed by Ikaros, which means that changes in housekeeping and tissue-specific gene expression are coordinated during the transition from proliferation to differentiation.
@article{10.1371/journal.pbio.2006506,
abstract = {The human body is made from billions of cells comprizing many specialized cell types. All of these cells ultimately come from a single fertilized oocyte in a process that has two key features: proliferation, which expands cell numbers, and differentiation, which diversifies cell types. Here, we have examined the transition from proliferation to differentiation using B lymphocytes as an example. We find that the transition from proliferation to differentiation involves changes in the expression of genes, which can be categorized into cell-type–specific genes and broadly expressed “housekeeping” genes. The expression of many housekeeping genes is controlled by the gene regulatory factor Myc, whereas the expression of many B lymphocyte–specific genes is controlled by the Ikaros family of gene regulatory proteins. Myc is repressed by Ikaros, which means that changes in housekeeping and tissue-specific gene expression are coordinated during the transition from proliferation to differentiation.},
author = {Ferreirós-Vidal, Isabel and Carroll, Thomas and Zhang, Tianyi and Lagani, Vincenzo and Ramirez, Ricardo N. and Ing-Simmons, Elizabeth and Garcia, Alicia and Cooper, Lee and Liang, Ziwei and Papoutsoglou, Georgios and Dharmalingam, Gopuraja and Guo, Ya and Tarazona, Sonia and Fernandes, Sunjay J. and Noori, Peri and Silberberg, Gilad and Fisher, Amanda G. and Tsamardinos, Ioannis and Mortazavi, Ali and Lenhard, Boris and Conesa, Ana and Tegner, Jesper and Merkenschlager, Matthias and Gomez-Cabrero, David},
journal = {PLOS Biology},
keywords = {mxmcausalpath},
month = {04},
number = 4,
pages = {1-28},
publisher = {Public Library of Science},
title = {Feedforward regulation of Myc coordinates lineage-specific with housekeeping gene expression during B cell progenitor cell differentiation},
volume = 17,
year = 2019
}%0 Journal Article
%1 10.1371/journal.pbio.2006506
%A Ferreirós-Vidal, Isabel
%A Carroll, Thomas
%A Zhang, Tianyi
%A Lagani, Vincenzo
%A Ramirez, Ricardo N.
%A Ing-Simmons, Elizabeth
%A Garcia, Alicia
%A Cooper, Lee
%A Liang, Ziwei
%A Papoutsoglou, Georgios
%A Dharmalingam, Gopuraja
%A Guo, Ya
%A Tarazona, Sonia
%A Fernandes, Sunjay J.
%A Noori, Peri
%A Silberberg, Gilad
%A Fisher, Amanda G.
%A Tsamardinos, Ioannis
%A Mortazavi, Ali
%A Lenhard, Boris
%A Conesa, Ana
%A Tegner, Jesper
%A Merkenschlager, Matthias
%A Gomez-Cabrero, David
%D 2019
%I Public Library of Science
%J PLOS Biology
%N 4
%P 1-28
%R 10.1371/journal.pbio.2006506
%T Feedforward regulation of Myc coordinates lineage-specific with housekeeping gene expression during B cell progenitor cell differentiation
%U https://doi.org/10.1371/journal.pbio.2006506
%V 17
%X The human body is made from billions of cells comprizing many specialized cell types. All of these cells ultimately come from a single fertilized oocyte in a process that has two key features: proliferation, which expands cell numbers, and differentiation, which diversifies cell types. Here, we have examined the transition from proliferation to differentiation using B lymphocytes as an example. We find that the transition from proliferation to differentiation involves changes in the expression of genes, which can be categorized into cell-type–specific genes and broadly expressed “housekeeping” genes. The expression of many housekeeping genes is controlled by the gene regulatory factor Myc, whereas the expression of many B lymphocyte–specific genes is controlled by the Ikaros family of gene regulatory proteins. Myc is repressed by Ikaros, which means that changes in housekeeping and tissue-specific gene expression are coordinated during the transition from proliferation to differentiation. - AbstractURLBibTeXEndNoteDOIBibSonomyLoos, Maria S., Reshmi Ramakrishnan, Wim Vranken, Alexandra Tsirigotaki, Evrydiki-Pandora Tsare, Valentina Zorzini, Jozefien De Geyter, et al. “Structural Basis Of The Subcellular Topology Landscape Of Escherichia Coli”. Frontiers In Microbiology 10. doi:10.3389/fmicb.2019.01670.Cellular proteomes are distributed in multiple compartments: on DNA, ribosomes, on and inside membranes, or they become secreted. Structural properties that allow polypeptides to occupy subcellular niches, particularly to after crossing membranes, remain unclear. We compared intrinsic and extrinsic features in cytoplasmic and secreted polypeptides of the Escherichia coli K-12 proteome. Structural features between the cytoplasmome and secretome are sharply distinct, such that a signal peptide-agnostic machine learning tool distinguishes cytoplasmic from secreted proteins with 95.5% success. Cytoplasmic polypeptides are enriched in aliphatic, aromatic, charged and hydrophobic residues, unique folds and higher early folding propensities. Secretory polypeptides are enriched in polar/small amino acids, β folds, have higher backbone dynamics, higher disorder and contact order and are more often intrinsically disordered. These non-random distributions and experimental evidence imply that evolutionary pressure selected enhanced secretome flexibility, slow folding and looser structures, placing the secretome in a distinct protein class. These adaptations protect the secretome from premature folding during its cytoplasmic transit, optimize its lipid bilayer crossing and allowed it to acquire cell envelope specific chemistries. The latter may favor promiscuous multi-ligand binding, sensing of stress and cell envelope structure changes. In conclusion, enhanced flexibility, slow folding, looser structures and unique folds differentiate the secretome from the cytoplasmome. These findings have wide implications on the structural diversity and evolution of modern proteomes and the protein folding problem.
@article{Loos_2019,
abstract = {Cellular proteomes are distributed in multiple compartments: on DNA, ribosomes, on and inside membranes, or they become secreted. Structural properties that allow polypeptides to occupy subcellular niches, particularly to after crossing membranes, remain unclear. We compared intrinsic and extrinsic features in cytoplasmic and secreted polypeptides of the Escherichia coli K-12 proteome. Structural features between the cytoplasmome and secretome are sharply distinct, such that a signal peptide-agnostic machine learning tool distinguishes cytoplasmic from secreted proteins with 95.5% success. Cytoplasmic polypeptides are enriched in aliphatic, aromatic, charged and hydrophobic residues, unique folds and higher early folding propensities. Secretory polypeptides are enriched in polar/small amino acids, β folds, have higher backbone dynamics, higher disorder and contact order and are more often intrinsically disordered. These non-random distributions and experimental evidence imply that evolutionary pressure selected enhanced secretome flexibility, slow folding and looser structures, placing the secretome in a distinct protein class. These adaptations protect the secretome from premature folding during its cytoplasmic transit, optimize its lipid bilayer crossing and allowed it to acquire cell envelope specific chemistries. The latter may favor promiscuous multi-ligand binding, sensing of stress and cell envelope structure changes. In conclusion, enhanced flexibility, slow folding, looser structures and unique folds differentiate the secretome from the cytoplasmome. These findings have wide implications on the structural diversity and evolution of modern proteomes and the protein folding problem.},
author = {Loos, Maria S. and Ramakrishnan, Reshmi and Vranken, Wim and Tsirigotaki, Alexandra and Tsare, Evrydiki-Pandora and Zorzini, Valentina and Geyter, Jozefien De and Yuan, Biao and Tsamardinos, Ioannis and Klappa, Maria and Schymkowitz, Joost and Rousseau, Frederic and Karamanou, Spyridoula and Economou, Anastassios},
journal = {Frontiers in Microbiology},
keywords = {mxmcausalpath},
month = {jul},
publisher = {Frontiers Media SA},
title = {Structural Basis of the Subcellular Topology Landscape of Escherichia coli},
volume = 10,
year = 2019
}%0 Journal Article
%1 Loos_2019
%A Loos, Maria S.
%A Ramakrishnan, Reshmi
%A Vranken, Wim
%A Tsirigotaki, Alexandra
%A Tsare, Evrydiki-Pandora
%A Zorzini, Valentina
%A Geyter, Jozefien De
%A Yuan, Biao
%A Tsamardinos, Ioannis
%A Klappa, Maria
%A Schymkowitz, Joost
%A Rousseau, Frederic
%A Karamanou, Spyridoula
%A Economou, Anastassios
%D 2019
%I Frontiers Media SA
%J Frontiers in Microbiology
%R 10.3389/fmicb.2019.01670
%T Structural Basis of the Subcellular Topology Landscape of Escherichia coli
%U https://doi.org/10.3389%2Ffmicb.2019.01670
%V 10
%X Cellular proteomes are distributed in multiple compartments: on DNA, ribosomes, on and inside membranes, or they become secreted. Structural properties that allow polypeptides to occupy subcellular niches, particularly to after crossing membranes, remain unclear. We compared intrinsic and extrinsic features in cytoplasmic and secreted polypeptides of the Escherichia coli K-12 proteome. Structural features between the cytoplasmome and secretome are sharply distinct, such that a signal peptide-agnostic machine learning tool distinguishes cytoplasmic from secreted proteins with 95.5% success. Cytoplasmic polypeptides are enriched in aliphatic, aromatic, charged and hydrophobic residues, unique folds and higher early folding propensities. Secretory polypeptides are enriched in polar/small amino acids, β folds, have higher backbone dynamics, higher disorder and contact order and are more often intrinsically disordered. These non-random distributions and experimental evidence imply that evolutionary pressure selected enhanced secretome flexibility, slow folding and looser structures, placing the secretome in a distinct protein class. These adaptations protect the secretome from premature folding during its cytoplasmic transit, optimize its lipid bilayer crossing and allowed it to acquire cell envelope specific chemistries. The latter may favor promiscuous multi-ligand binding, sensing of stress and cell envelope structure changes. In conclusion, enhanced flexibility, slow folding, looser structures and unique folds differentiate the secretome from the cytoplasmome. These findings have wide implications on the structural diversity and evolution of modern proteomes and the protein folding problem. - AbstractURLBibTeXEndNoteDOIBibSonomyDownloadLakiotaki, Kleanthi, George Georgakopoulos, Elias Castanas, Oluf Dimitri Røe, Giorgos Borboudakis, and Ioannis Tsamardinos. “A Data Driven Approach Reveals Disease Similarity On A Molecular Level”. Npj Systems Biology And Applications 5, no. 39: 1-10. doi:10.1038/s41540-019-0117-0.Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.
@article{noauthororeditor,
abstract = {Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/.},
author = {Lakiotaki, Kleanthi and Georgakopoulos, George and Castanas, Elias and Røe, Oluf Dimitri and Borboudakis, Giorgos and Tsamardinos, Ioannis},
journal = {npj Systems Biology and Applications},
keywords = {mxmcausalpath},
month = {oct},
number = 39,
pages = {1-10},
title = {A data driven approach reveals disease similarity on a molecular level},
volume = 5,
year = 2019
}%0 Journal Article
%1 noauthororeditor
%A Lakiotaki, Kleanthi
%A Georgakopoulos, George
%A Castanas, Elias
%A Røe, Oluf Dimitri
%A Borboudakis, Giorgos
%A Tsamardinos, Ioannis
%D 2019
%J npj Systems Biology and Applications
%N 39
%P 1-10
%R 10.1038/s41540-019-0117-0
%T A data driven approach reveals disease similarity on a molecular level
%U https://www.nature.com/articles/s41540-019-0117-0
%V 5
%X Could there be unexpected similarities between different studies, diseases, or treatments, on a molecular level due to common biological mechanisms involved? To answer this question, we develop a method for computing similarities between empirical, statistical distributions of high-dimensional, low-sample datasets, and apply it on hundreds of -omics studies. The similarities lead to dataset-to-dataset networks visualizing the landscape of a large portion of biological data. Potentially interesting similarities connecting studies of different diseases are assembled in a disease-to-disease network. Exploring it, we discover numerous non-trivial connections between Alzheimer’s disease and schizophrenia, asthma and psoriasis, or liver cancer and obesity, to name a few. We then present a method that identifies the molecular quantities and pathways that contribute the most to the identified similarities and could point to novel drug targets or provide biological insights. The proposed method acts as a “statistical telescope” providing a global view of the constellation of biological data; readers can peek through it at: http://datascope.csd.uoc.gr:25000/. - AbstractURLBibTeXEndNoteDOIBibSonomyDownloadFernandes Sunja, Jude, Hiromasa Morikawa, Ewoud Ewing, Sabrina Ruhrmann, Narayan Joshi Rubin, Vincenzo Lagani, Nestoras Karathanasis, et al. “Non-Parametric Combination Analysis Of Multiple Data Types Enables Detection Of Novel Regulatory Mechanisms In T Cells Of Multiple Sclerosis Patients”. Nature Scientific Reports 9, no. 11996. doi:10.1038/s41598-019-48493-7.Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. the triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials.
@article{jude2019nonparametric,
abstract = {Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. the triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials.},
author = {Fernandes Sunja, Jude and Morikawa, Hiromasa and Ewing, Ewoud and Ruhrmann, Sabrina and Joshi Rubin, Narayan and Lagani, Vincenzo and Karathanasis, Nestoras and Khademi, Mohsen and Planell, Nuria and Schmidt, Angelika and Tsamardinos, Ioannis and Olsson, Tomas and Piehl, Fredrik and Kockum, Ingrid and Jagodic, Maja and Tegnér, Jesper and Gomez-Cabrero, David},
journal = {Nature Scientific Reports},
keywords = {mxmcausalpath},
month = {August},
number = 11996,
title = {Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients},
volume = 9,
year = 2019
}%0 Journal Article
%1 jude2019nonparametric
%A Fernandes Sunja, Jude
%A Morikawa, Hiromasa
%A Ewing, Ewoud
%A Ruhrmann, Sabrina
%A Joshi Rubin, Narayan
%A Lagani, Vincenzo
%A Karathanasis, Nestoras
%A Khademi, Mohsen
%A Planell, Nuria
%A Schmidt, Angelika
%A Tsamardinos, Ioannis
%A Olsson, Tomas
%A Piehl, Fredrik
%A Kockum, Ingrid
%A Jagodic, Maja
%A Tegnér, Jesper
%A Gomez-Cabrero, David
%D 2019
%J Nature Scientific Reports
%N 11996
%R 10.1038/s41598-019-48493-7
%T Non-parametric combination analysis of multiple data types enables detection of novel regulatory mechanisms in T cells of multiple sclerosis patients
%U https://www.nature.com/articles/s41598-019-48493-7
%V 9
%X Multiple Sclerosis (MS) is an autoimmune disease of the central nervous system with prominent neurodegenerative components. the triggering and progression of MS is associated with transcriptional and epigenetic alterations in several tissues, including peripheral blood. The combined influence of transcriptional and epigenetic changes associated with MS has not been assessed in the same individuals. Here we generated paired transcriptomic (RNA-seq) and DNA methylation (Illumina 450 K array) profiles of CD4+ and CD8+ T cells (CD4, CD8), using clinically accessible blood from healthy donors and MS patients in the initial relapsing-remitting and subsequent secondary-progressive stage. By integrating the output of a differential expression test with a permutation-based non-parametric combination methodology, we identified 149 differentially expressed (DE) genes in both CD4 and CD8 cells collected from MS patients. Moreover, by leveraging the methylation-dependent regulation of gene expression, we identified the gene SH3YL1, which displayed significant correlated expression and methylation changes in MS patients. Importantly, silencing of SH3YL1 in primary human CD4 cells demonstrated its influence on T cell activation. Collectively, our strategy based on paired sampling of several cell-types provides a novel approach to increase sensitivity for identifying shared mechanisms altered in CD4 and CD8 cells of relevance in MS in small sized clinical materials. - AbstractBibTeXEndNoteDOIBibSonomyTsagris, M, and I Tsamardinos. “Feature Selection With The R Package Mxm”. F1000Research 7: 1505. doi:https://doi.org/10.12688/f1000research.16216.2.Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications. Keywords
@article{noauthororeditor,
abstract = {Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications. Keywords},
author = {Tsagris, M and Tsamardinos, I},
journal = {F1000Research},
keywords = {mxmcausalpath},
pages = 1505,
title = {Feature selection with the R package MXM},
volume = 7,
year = 2019
}%0 Journal Article
%1 noauthororeditor
%A Tsagris, M
%A Tsamardinos, I
%D 2019
%J F1000Research
%P 1505
%R https://doi.org/10.12688/f1000research.16216.2
%T Feature selection with the R package MXM
%V 7
%X Feature (or variable) selection is the process of identifying the minimal set of features with the highest predictive performance on the target variable of interest. Numerous feature selection algorithms have been developed over the years, but only few have been implemented in R and made publicly available R as packages while offering few options. The R package MXM offers a variety of feature selection algorithms, and has unique features that make it advantageous over its competitors: a) it contains feature selection algorithms that can treat numerous types of target variables, including continuous, percentages, time to event (survival), binary, nominal, ordinal, clustered, counts, left censored, etc; b) it contains a variety of regression models that can be plugged into the feature selection algorithms (for example with time to event data the user can choose among Cox, Weibull, log logistic or exponential regression); c) it includes an algorithm for detecting multiple solutions (many sets of statistically equivalent features, plain speaking, two features can carry statistically equivalent information when substituting one with the other does not effect the inference or the conclusions); and d) it includes memory efficient algorithms for high volume data, data that cannot be loaded into R (In a 16GB RAM terminal for example, R cannot directly load data of 16GB size. By utilizing the proper package, we load the data and then perform feature selection.). In this paper, we qualitatively compare MXM with other relevant feature selection packages and discuss its advantages and disadvantages. Further, we provide a demonstration of MXM’s algorithms using real high-dimensional data from various applications. Keywords - AbstractURLBibTeXEndNoteDOIBibSonomyPapoutsoglou, Georgios, Vincenzo Lagani, Angelika Schmidt, Konstantinos Tsirlis, David-Gomez Cabrero, Jesper Tegner, and Ioannis Tsamardinos. “Challenges In The Multivariate Analysis Of Mass Cytometry Data: The Effect Of Randomization”. Cytometry Part A. doi:https://doi.org/10.1002/cyto.a.23908.Cytometry by time‐of‐flight (CyTOF) has emerged as a high‐throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high‐dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.
@article{papoutsoglou2019challenges,
abstract = {Cytometry by time‐of‐flight (CyTOF) has emerged as a high‐throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high‐dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.},
author = {Papoutsoglou, Georgios and Lagani, Vincenzo and Schmidt, Angelika and Tsirlis, Konstantinos and Cabrero, David-Gomez and Tegner, Jesper and Tsamardinos, Ioannis},
journal = {Cytometry Part A},
keywords = {mxmcausalpath},
title = {Challenges in the Multivariate Analysis of Mass Cytometry Data: The Effect of Randomization},
year = 2019
}%0 Journal Article
%1 papoutsoglou2019challenges
%A Papoutsoglou, Georgios
%A Lagani, Vincenzo
%A Schmidt, Angelika
%A Tsirlis, Konstantinos
%A Cabrero, David-Gomez
%A Tegner, Jesper
%A Tsamardinos, Ioannis
%D 2019
%J Cytometry Part A
%R https://doi.org/10.1002/cyto.a.23908
%T Challenges in the Multivariate Analysis of Mass Cytometry Data: The Effect of Randomization
%U https://onlinelibrary.wiley.com/doi/full/10.1002/cyto.a.23908
%X Cytometry by time‐of‐flight (CyTOF) has emerged as a high‐throughput single cell technology able to provide large samples of protein readouts. Already, there exists a large pool of advanced high‐dimensional analysis algorithms that explore the observed heterogeneous distributions making intriguing biological inferences. A fact largely overlooked by these methods, however, is the effect of the established data preprocessing pipeline to the distributions of the measured quantities. In this article, we focus on randomization, a transformation used for improving data visualization, which can negatively affect multivariate data analysis methods such as dimensionality reduction, clustering, and network reconstruction algorithms. Our results indicate that randomization should be used only for visualization purposes, but not in conjunction with high‐dimensional analytical tools. © 2019 The Authors. Cytometry Part A published by Wiley Periodicals, Inc. on behalf of International Society for Advancement of Cytometry.
2018
- AbstractURLBibTeXEndNoteDOIBibSonomyTsagris, Michail, Giorgos Borboudakis, Vincenzo Lagani, and Ioannis Tsamardinos. “Constraint-Based Causal Discovery With Mixed Data”. International Journal Of Data Science And Analytics 6, no. 1: 19-30. doi:10.1007/s41060-018-0097-y.We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and or-dinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simu-lated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.
@article{Tsagris2018,
abstract = {We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and or-dinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simu-lated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.},
author = {Tsagris, Michail and Borboudakis, Giorgos and Lagani, Vincenzo and Tsamardinos, Ioannis},
journal = {International Journal of Data Science and Analytics},
keywords = {mxmcausalpath},
month = {August},
number = 1,
pages = {19-30},
title = {Constraint-based causal discovery with mixed data},
volume = 6,
year = 2018
}%0 Journal Article
%1 Tsagris2018
%A Tsagris, Michail
%A Borboudakis, Giorgos
%A Lagani, Vincenzo
%A Tsamardinos, Ioannis
%D 2018
%J International Journal of Data Science and Analytics
%N 1
%P 19-30
%R 10.1007/s41060-018-0097-y
%T Constraint-based causal discovery with mixed data
%U https://doi.org/10.1007/s41060-018-0097-y http://link.springer.com/10.1007/s41060-018-0097-y
%V 6
%X We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and or-dinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simu-lated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy. - AbstractURLBibTeXEndNoteDOIBibSonomyTsagris, Michail, Vincenzo Lagani, and Ioannis Tsamardinos. “Feature Selection For High-Dimensional Temporal Data”. Bmc Bioinformatics 19, no. 17: 1-14. doi:10.1186/s12859-018-2023-7.Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work, we extend established constrained-based, feature-selection methods to high-dimensional “omics” temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. The use of this algorithm is suggested for variable selection with high-dimensional temporal data.
@article{Tsagris2018a,
abstract = {Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work, we extend established constrained-based, feature-selection methods to high-dimensional “omics” temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. The use of this algorithm is suggested for variable selection with high-dimensional temporal data.},
author = {Tsagris, Michail and Lagani, Vincenzo and Tsamardinos, Ioannis},
journal = {BMC Bioinformatics},
keywords = {mxmcausalpath},
month = {January},
number = 17,
pages = {1-14},
title = {Feature selection for high-dimensional temporal data},
volume = 19,
year = 2018
}%0 Journal Article
%1 Tsagris2018a
%A Tsagris, Michail
%A Lagani, Vincenzo
%A Tsamardinos, Ioannis
%D 2018
%J BMC Bioinformatics
%N 17
%P 1-14
%R 10.1186/s12859-018-2023-7
%T Feature selection for high-dimensional temporal data
%U https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2023-7
%V 19
%X Feature selection is commonly employed for identifying collectively-predictive biomarkers and biosignatures; it facilitates the construction of small statistical models that are easier to verify, visualize, and comprehend while providing insight to the human expert. In this work, we extend established constrained-based, feature-selection methods to high-dimensional “omics” temporal data, where the number of measurements is orders of magnitude larger than the sample size. The extension required the development of conditional independence tests for temporal and/or static variables conditioned on a set of temporal variables. The algorithm is able to return multiple, equivalent solution subsets of variables, scale to tens of thousands of features, and outperform or be on par with existing methods depending on the analysis task specifics. The use of this algorithm is suggested for variable selection with high-dimensional temporal data. - AbstractURLBibTeXEndNoteDOIBibSonomyTsirlis, Konstantinos, Vincenzo Lagani, Sofia Triantafillou, and Ioannis Tsamardinos. “On Scoring Maximal Ancestral Graphs With The Maxtextendashmin Hill Climbing Algorithm”. International Journal Of Approximate Reasoning 102: 74-85. doi:10.1016/j.ijar.2018.08.002.We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max–Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraint-based first phase that greatly reduces the space of structures to investigate. On a large scale experimentation we show that the proposed algorithm greatly improves on GSMAG in all comparisons, and over a set of known networks from the literature it compares positively against FCI and cFCI as well as competitively against GFCI, three well known constraint-based approaches for causal-network reconstruction in presence of latent confounders.
@article{Tsirlis_2018,
abstract = {We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max–Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraint-based first phase that greatly reduces the space of structures to investigate. On a large scale experimentation we show that the proposed algorithm greatly improves on GSMAG in all comparisons, and over a set of known networks from the literature it compares positively against FCI and cFCI as well as competitively against GFCI, three well known constraint-based approaches for causal-network reconstruction in presence of latent confounders.},
author = {Tsirlis, Konstantinos and Lagani, Vincenzo and Triantafillou, Sofia and Tsamardinos, Ioannis},
journal = {International Journal of Approximate Reasoning},
keywords = {mxmcausalpath},
month = {November},
pages = {74-85},
publisher = {Elsevier BV},
title = {On scoring Maximal Ancestral Graphs with the MaxtextendashMin Hill Climbing algorithm},
volume = 102,
year = 2018
}%0 Journal Article
%1 Tsirlis_2018
%A Tsirlis, Konstantinos
%A Lagani, Vincenzo
%A Triantafillou, Sofia
%A Tsamardinos, Ioannis
%D 2018
%I Elsevier BV
%J International Journal of Approximate Reasoning
%P 74-85
%R 10.1016/j.ijar.2018.08.002
%T On scoring Maximal Ancestral Graphs with the MaxtextendashMin Hill Climbing algorithm
%U https://doi.org/10.1016%2Fj.ijar.2018.08.002
%V 102
%X We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max–Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraint-based first phase that greatly reduces the space of structures to investigate. On a large scale experimentation we show that the proposed algorithm greatly improves on GSMAG in all comparisons, and over a set of known networks from the literature it compares positively against FCI and cFCI as well as competitively against GFCI, three well known constraint-based approaches for causal-network reconstruction in presence of latent confounders. - AbstractURLBibTeXEndNoteDOIBibSonomyTsagris, Michail. “Bayesian Network Learning With The Pc Algorithm: An Improved And Correct Variation”. Applied Artificial Intelligence 33, no. 2: 101-123. doi:10.1080/08839514.2018.1526760.PC is a prototypical constraint-based algorithm for learning Bayesian networks, a special case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was developed to make the skeleton phase order independent. In return, it has notably increased execution time. In this paper, we clarify that the PC algorithm the skeleton phase of PC is indeed order independent. The modification we propose outperforms pcalg’s variant of the PC in terms of returning correct networks of better quality as is less prone to errors and in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s variant does not return valid acyclic graphs.
@article{michail2018bayesian,
abstract = {PC is a prototypical constraint-based algorithm for learning Bayesian networks, a special case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was developed to make the skeleton phase order independent. In return, it has notably increased execution time. In this paper, we clarify that the PC algorithm the skeleton phase of PC is indeed order independent. The modification we propose outperforms pcalg’s variant of the PC in terms of returning correct networks of better quality as is less prone to errors and in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s variant does not return valid acyclic graphs.},
author = {Tsagris, Michail},
journal = {Applied Artificial Intelligence},
keywords = {mxmcausalpath},
number = 2,
pages = {101-123},
title = {Bayesian Network Learning with the PC Algorithm: An Improved and Correct Variation},
volume = 33,
year = 2018
}%0 Journal Article
%1 michail2018bayesian
%A Tsagris, Michail
%D 2018
%J Applied Artificial Intelligence
%N 2
%P 101-123
%R 10.1080/08839514.2018.1526760
%T Bayesian Network Learning with the PC Algorithm: An Improved and Correct Variation
%U https://www.researchgate.net/profile/Michail_Tsagris/publication/327884019_Bayesian_Network_Learning_with_the_PC_Algorithm_An_Improved_and_Correct_Variation/links/5bab44c945851574f7e65688/Bayesian-Network-Learning-with-the-PC-Algorithm-An-Improved-and-Correct-Variation.pdf
%V 33
%X PC is a prototypical constraint-based algorithm for learning Bayesian networks, a special case of directed acyclic graphs. An existing variant of it, in the R package pcalg, was developed to make the skeleton phase order independent. In return, it has notably increased execution time. In this paper, we clarify that the PC algorithm the skeleton phase of PC is indeed order independent. The modification we propose outperforms pcalg’s variant of the PC in terms of returning correct networks of better quality as is less prone to errors and in some cases it is a lot more computationally cheaper. In addition, we show that pcalg’s variant does not return valid acyclic graphs. - AbstractURLBibTeXEndNoteDOIBibSonomyTsamardinos, Ioannis, Elissavet Greasidou, and Giorgos Borboudakis. “Bootstrapping The Out-Of-Sample Predictions For Efficient And Accurate Cross-Validation”. Machine Learning 107, no. 12: 1895--1922. doi:10.1007/s10994-018-5714-4.Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822--829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.
@article{Tsamardinos2018,
abstract = {Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822--829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates.},
author = {Tsamardinos, Ioannis and Greasidou, Elissavet and Borboudakis, Giorgos},
journal = {Machine Learning},
keywords = {mxmcausalpath},
month = {dec},
number = 12,
pages = {1895--1922},
title = {Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation},
volume = 107,
year = 2018
}%0 Journal Article
%1 Tsamardinos2018
%A Tsamardinos, Ioannis
%A Greasidou, Elissavet
%A Borboudakis, Giorgos
%D 2018
%J Machine Learning
%N 12
%P 1895--1922
%R 10.1007/s10994-018-5714-4
%T Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation
%U https://doi.org/10.1007/s10994-018-5714-4
%V 107
%X Cross-Validation (CV), and out-of-sample performance-estimation protocols in general, are often employed both for (a) selecting the optimal combination of algorithms and values of hyper-parameters (called a configuration) for producing the final predictive model, and (b) estimating the predictive performance of the final model. However, the cross-validated performance of the best configuration is optimistically biased. We present an efficient bootstrap method that corrects for the bias, called Bootstrap Bias Corrected CV (BBC-CV). BBC-CV's main idea is to bootstrap the whole process of selecting the best-performing configuration on the out-of-sample predictions of each configuration, without additional training of models. In comparison to the alternatives, namely the nested cross-validation (Varma and Simon in BMC Bioinform 7(1):91, 2006) and a method by Tibshirani and Tibshirani (Ann Appl Stat 822--829, 2009), BBC-CV is computationally more efficient, has smaller variance and bias, and is applicable to any metric of performance (accuracy, AUC, concordance index, mean squared error). Subsequently, we employ again the idea of bootstrapping the out-of-sample predictions to speed up the CV process. Specifically, using a bootstrap-based statistical criterion we stop training of models on new folds of inferior (with high probability) configurations. We name the method Bootstrap Bias Corrected with Dropping CV (BBCD-CV) that is both efficient and provides accurate performance estimates. - AbstractURLBibTeXEndNoteDOIBibSonomyTsamardinos, Ioannis, Giorgos Borboudakis, Pavlos Katsogridakis, Polyvios Pratikakis, and Vassilis Christophides. “A Greedy Feature Selection Algorithm For Big Data Of High Dimensionality”. Machine Learning 108, no. 2: 149-202. doi:10.1007/s10994-018-5748-7.We present the Parallel, Forward--Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.
@article{Tsamardinos2018,
abstract = {We present the Parallel, Forward--Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case.},
author = {Tsamardinos, Ioannis and Borboudakis, Giorgos and Katsogridakis, Pavlos and Pratikakis, Polyvios and Christophides, Vassilis},
journal = {Machine Learning},
keywords = {mxmcausalpath},
month = {August},
number = 2,
pages = {149-202},
title = {A greedy feature selection algorithm for Big Data of high dimensionality},
volume = 108,
year = 2018
}%0 Journal Article
%1 Tsamardinos2018
%A Tsamardinos, Ioannis
%A Borboudakis, Giorgos
%A Katsogridakis, Pavlos
%A Pratikakis, Polyvios
%A Christophides, Vassilis
%D 2018
%J Machine Learning
%N 2
%P 149-202
%R 10.1007/s10994-018-5748-7
%T A greedy feature selection algorithm for Big Data of high dimensionality
%U https://doi.org/10.1007/s10994-018-5748-7
%V 108
%X We present the Parallel, Forward--Backward with Pruning (PFBP) algorithm for feature selection (FS) for Big Data of high dimensionality. PFBP partitions the data matrix both in terms of rows as well as columns. By employing the concepts of p-values of conditional independence tests and meta-analysis techniques, PFBP relies only on computations local to a partition while minimizing communication costs, thus massively parallelizing computations. Similar techniques for combining local computations are also employed to create the final predictive model. PFBP employs asymptotically sound heuristics to make early, approximate decisions, such as Early Dropping of features from consideration in subsequent iterations, Early Stopping of consideration of features within the same iteration, or Early Return of the winner in each iteration. PFBP provides asymptotic guarantees of optimality for data distributions faithfully representable by a causal network (Bayesian network or maximal ancestral graph). Empirical analysis confirms a super-linear speedup of the algorithm with increasing sample size, linear scalability with respect to the number of features and processing cores. An extensive comparative evaluation also demonstrates the effectiveness of PFBP against other algorithms in its class. The heuristics presented are general and could potentially be employed to other greedy-type of FS algorithms. An application on simulated Single Nucleotide Polymorphism (SNP) data with 500K samples is provided as a use case. - AbstractBibTeXEndNoteDOIBibSonomyLakiotaki, Kleanthi, Nikolaos Vorniotakis, Michail Tsagris, Georgios Georgakopoulos, and Ioannis Tsamardinos. “Biodataome: A Collection Of Uniformly Preprocessed And Automatically Annotated Datasets For Data-Driven Biology”. Database 2018, no. bay011: 1-14. doi:10.1093/database/bay011.Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/
@article{Lakiotaki2018,
abstract = {Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/},
author = {Lakiotaki, Kleanthi and Vorniotakis, Nikolaos and Tsagris, Michail and Georgakopoulos, Georgios and Tsamardinos, Ioannis},
journal = {Database},
keywords = {mxmcausalpath},
month = {March},
number = {bay011},
pages = {1-14},
title = {BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology},
volume = 2018,
year = 2018
}%0 Journal Article
%1 Lakiotaki2018
%A Lakiotaki, Kleanthi
%A Vorniotakis, Nikolaos
%A Tsagris, Michail
%A Georgakopoulos, Georgios
%A Tsamardinos, Ioannis
%D 2018
%J Database
%N bay011
%P 1-14
%R 10.1093/database/bay011
%T BioDataome: a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology
%V 2018
%X Biotechnology revolution generates a plethora of omics data with an exponential growth pace. Therefore, biological data mining demands automatic, ‘high quality’ curation efforts to organize biomedical knowledge into online databases. BioDataome is a database of uniformly preprocessed and disease-annotated omics data with the aim to promote and accelerate the reuse of public data. We followed the same preprocessing pipeline for each biological mart (microarray gene expression, RNA-Seq gene expression and DNA methylation) to produce ready for downstream analysis datasets and automatically annotated them with disease-ontology terms. We also designate datasets that share common samples and automatically discover control samples in case-control studies. Currently, BioDataome includes ∼5600 datasets, ∼260 000 samples spanning ∼500 diseases and can be easily used in large-scale massive experiments and meta-analysis. All datasets are publicly available for querying and downloading via BioDataome web application. We demonstrate BioDataome’s utility by presenting exploratory data analysis examples. We have also developed BioDataome R package found in: https://github.com/mensxmachina/BioDataome/. Database URL: http://dataome.mensxmachina.org/
2017
- AbstractURLBibTeXEndNoteDOIBibSonomyPapoutsoglou, Georgios, Giorgos Athineou, Vincenzo Lagani, Iordanis Xanthopoulos, Angelika Schmidt, Szabolcs Éliás, Jesper Tegnér, and Ioannis Tsamardinos. “Scenery: A Web Application For (Causal) Network Reconstruction From Cytometry Data”. Nucleic Acids Research 45: W270-W275. doi:10.1093/nar/gkx448.Flow and mass cytometry technologies can probe proteins as biological markers in thousands of individual cells simultaneously, providing unprecedented opportunities for reconstructing networks of protein interactions through machine learning algorithms. The network reconstruction (NR) problem has been well-studied by the machine learning community. However, the potentials of available methods remain largely unknown to the cytometry community, mainly due to their intrinsic complexity and the lack of comprehensive, powerful and easy-to-use NR software implementations specific for cytometry data. To bridge this gap, we present Single CEll NEtwork Reconstruction sYstem (SCENERY), a web server featuring several standard and advanced cytometry data analysis methods coupled with NR algorithms in a user-friendly, on-line environment. In SCENERY, users may upload their data and set their own study design. The server offers several data analysis options categorized into three classes of methods: data (pre)processing, statistical analysis and NR. The server also provides interactive visualization and download of results as ready-to-publish images or multimedia reports. Its core is modular and based on the widely-used and robust R platform allowing power users to extend its functionalities by submitting their own NR methods. SCENERY is available at scenery.csd.uoc.gr or http://mensxmachina.org/en/software/.
@article{Papoutsoglou2017,
abstract = {Flow and mass cytometry technologies can probe proteins as biological markers in thousands of individual cells simultaneously, providing unprecedented opportunities for reconstructing networks of protein interactions through machine learning algorithms. The network reconstruction (NR) problem has been well-studied by the machine learning community. However, the potentials of available methods remain largely unknown to the cytometry community, mainly due to their intrinsic complexity and the lack of comprehensive, powerful and easy-to-use NR software implementations specific for cytometry data. To bridge this gap, we present Single CEll NEtwork Reconstruction sYstem (SCENERY), a web server featuring several standard and advanced cytometry data analysis methods coupled with NR algorithms in a user-friendly, on-line environment. In SCENERY, users may upload their data and set their own study design. The server offers several data analysis options categorized into three classes of methods: data (pre)processing, statistical analysis and NR. The server also provides interactive visualization and download of results as ready-to-publish images or multimedia reports. Its core is modular and based on the widely-used and robust R platform allowing power users to extend its functionalities by submitting their own NR methods. SCENERY is available at scenery.csd.uoc.gr or http://mensxmachina.org/en/software/.},
author = {Papoutsoglou, Georgios and Athineou, Giorgos and Lagani, Vincenzo and Xanthopoulos, Iordanis and Schmidt, Angelika and Éliás, Szabolcs and Tegnér, Jesper and Tsamardinos, Ioannis},
journal = {Nucleic Acids Research},
keywords = {mxmcausalpath},
month = {July},
pages = {W270-W275},
title = {SCENERY: a web application for (causal) network reconstruction from cytometry data},
volume = 45,
year = 2017
}%0 Journal Article
%1 Papoutsoglou2017
%A Papoutsoglou, Georgios
%A Athineou, Giorgos
%A Lagani, Vincenzo
%A Xanthopoulos, Iordanis
%A Schmidt, Angelika
%A Éliás, Szabolcs
%A Tegnér, Jesper
%A Tsamardinos, Ioannis
%D 2017
%J Nucleic Acids Research
%P W270-W275
%R 10.1093/nar/gkx448
%T SCENERY: a web application for (causal) network reconstruction from cytometry data
%U https://doi.org/10.1093/nar/gkx448
%V 45
%X Flow and mass cytometry technologies can probe proteins as biological markers in thousands of individual cells simultaneously, providing unprecedented opportunities for reconstructing networks of protein interactions through machine learning algorithms. The network reconstruction (NR) problem has been well-studied by the machine learning community. However, the potentials of available methods remain largely unknown to the cytometry community, mainly due to their intrinsic complexity and the lack of comprehensive, powerful and easy-to-use NR software implementations specific for cytometry data. To bridge this gap, we present Single CEll NEtwork Reconstruction sYstem (SCENERY), a web server featuring several standard and advanced cytometry data analysis methods coupled with NR algorithms in a user-friendly, on-line environment. In SCENERY, users may upload their data and set their own study design. The server offers several data analysis options categorized into three classes of methods: data (pre)processing, statistical analysis and NR. The server also provides interactive visualization and download of results as ready-to-publish images or multimedia reports. Its core is modular and based on the widely-used and robust R platform allowing power users to extend its functionalities by submitting their own NR methods. SCENERY is available at scenery.csd.uoc.gr or http://mensxmachina.org/en/software/. - AbstractURLBibTeXEndNoteDOIBibSonomyTriantafillou, Sofia, Vincenzo Lagani, Christina Heinze-Deml, Angelika Schmidt, Jesper Tegner, and Ioannis Tsamardinos. “Predicting Causal Relationships From Biological Data: Applying Automated Causal Discovery On Mass Cytometry Data Of Human Immune Cells”. Nature Scientific Reports 7, no. 12724. doi:10.1038/s41598-017-08582-x.Learning the causal relationships that define a molecular system allows us to predict how the system will respond to different interventions. Distinguishing causality from mere association typically requires randomized experiments. Methods for automated causal discovery from limited experiments exist, but have so far rarely been tested in systems biology applications. In this work, we apply state-of-the art causal discovery methods on a large collection of public mass cytometry data sets, measuring intra-cellular signaling proteins of the human immune system and their response to several perturbations. We show how different experimental conditions can be used to facilitate causal discovery, and apply two fundamental methods that produce context-specific causal predictions. Causal predictions were reproducible across independent data sets from two different studies, but often disagree with the KEGG pathway databases. Within this context, we discuss the caveats we need to overcome for automated causal discovery to become a part of the routine data analysis in systems biology.
@article{Triantafillou2017,
abstract = {Learning the causal relationships that define a molecular system allows us to predict how the system will respond to different interventions. Distinguishing causality from mere association typically requires randomized experiments. Methods for automated causal discovery from limited experiments exist, but have so far rarely been tested in systems biology applications. In this work, we apply state-of-the art causal discovery methods on a large collection of public mass cytometry data sets, measuring intra-cellular signaling proteins of the human immune system and their response to several perturbations. We show how different experimental conditions can be used to facilitate causal discovery, and apply two fundamental methods that produce context-specific causal predictions. Causal predictions were reproducible across independent data sets from two different studies, but often disagree with the KEGG pathway databases. Within this context, we discuss the caveats we need to overcome for automated causal discovery to become a part of the routine data analysis in systems biology.},
author = {Triantafillou, Sofia and Lagani, Vincenzo and Heinze-Deml, Christina and Schmidt, Angelika and Tegner, Jesper and Tsamardinos, Ioannis},
journal = {Nature Scientific Reports},
keywords = {mxmcausalpath},
month = {October},
number = 12724,
title = {Predicting Causal Relationships from Biological Data: Applying Automated Causal Discovery on Mass Cytometry Data of Human Immune Cells},
volume = 7,
year = 2017
}%0 Journal Article
%1 Triantafillou2017
%A Triantafillou, Sofia
%A Lagani, Vincenzo
%A Heinze-Deml, Christina
%A Schmidt, Angelika
%A Tegner, Jesper
%A Tsamardinos, Ioannis
%D 2017
%J Nature Scientific Reports
%N 12724
%R 10.1038/s41598-017-08582-x
%T Predicting Causal Relationships from Biological Data: Applying Automated Causal Discovery on Mass Cytometry Data of Human Immune Cells
%U https://www.nature.com/articles/s41598-017-08582-x
%V 7
%X Learning the causal relationships that define a molecular system allows us to predict how the system will respond to different interventions. Distinguishing causality from mere association typically requires randomized experiments. Methods for automated causal discovery from limited experiments exist, but have so far rarely been tested in systems biology applications. In this work, we apply state-of-the art causal discovery methods on a large collection of public mass cytometry data sets, measuring intra-cellular signaling proteins of the human immune system and their response to several perturbations. We show how different experimental conditions can be used to facilitate causal discovery, and apply two fundamental methods that produce context-specific causal predictions. Causal predictions were reproducible across independent data sets from two different studies, but often disagree with the KEGG pathway databases. Within this context, we discuss the caveats we need to overcome for automated causal discovery to become a part of the routine data analysis in systems biology. - AbstractURLBibTeXEndNoteDOIBibSonomyOrfanoudaki, Georgia, Maria Markaki, Katerina Chatzi, Ioannis Tsamardinos, and Anastassios Economou. “Maturep: Prediction Of Secreted Proteins With Exclusive Information From Their Mature Regions”. Nature Scientific Reports 7, no. 1: 3263. doi:10.1038/s41598-017-03557-4.More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.
@article{orfanoudaki2017maturep,
abstract = {More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology.},
author = {Orfanoudaki, Georgia and Markaki, Maria and Chatzi, Katerina and Tsamardinos, Ioannis and Economou, Anastassios},
journal = {Nature Scientific Reports},
keywords = {mxmcausalpath},
month = {June},
number = 1,
pages = 3263,
title = {MatureP: prediction of secreted proteins with exclusive information from their mature regions},
volume = 7,
year = 2017
}%0 Journal Article
%1 orfanoudaki2017maturep
%A Orfanoudaki, Georgia
%A Markaki, Maria
%A Chatzi, Katerina
%A Tsamardinos, Ioannis
%A Economou, Anastassios
%D 2017
%J Nature Scientific Reports
%N 1
%P 3263
%R 10.1038/s41598-017-03557-4
%T MatureP: prediction of secreted proteins with exclusive information from their mature regions
%U https://doi.org/10.1038/s41598-017-03557-4
%V 7
%X More than a third of the cellular proteome is non-cytoplasmic. Most secretory proteins use the Sec system for export and are targeted to membranes using signal peptides and mature domains. To specifically analyze bacterial mature domain features, we developed MatureP, a classifier that predicts secretory sequences through features exclusively computed from their mature domains. MatureP was trained using Just Add Data Bio, an automated machine learning tool. Mature domains are predicted efficiently with ~92% success, as measured by the Area Under the Receiver Operating Characteristic Curve (AUC). Predictions were validated using experimental datasets of mutated secretory proteins. The features selected by MatureP reveal prominent differences in amino acid content between secreted and cytoplasmic proteins. Amino-terminal mature domain sequences have enhanced disorder, more hydroxyl and polar residues and less hydrophobics. Cytoplasmic proteins have prominent amino-terminal hydrophobic stretches and charged regions downstream. Presumably, secretory mature domains comprise a distinct protein class. They balance properties that promote the necessary flexibility required for the maintenance of non-folded states during targeting and secretion with the ability of post-secretion folding. These findings provide novel insight in protein trafficking, sorting and folding mechanisms and may benefit protein secretion biotechnology. - AbstractURLBibTeXEndNoteDOIBibSonomyLagani, Vincenzo, Giorgos Athineou, Alessio Farcomeni, Michail Tsagris, and Ioannis Tsamardinos. “Feature Selection With The R Package Mxm: Discovering Statistically Equivalent Feature Subsets”. Journal Of Statistical Software 80, no. 7. doi:10.18637/jss.v080.i07.The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.
@article{Lagani_2017,
abstract = {The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data.},
author = {Lagani, Vincenzo and Athineou, Giorgos and Farcomeni, Alessio and Tsagris, Michail and Tsamardinos, Ioannis},
journal = {Journal of Statistical Software},
keywords = {mxmcausalpath},
number = 7,
publisher = {Foundation for Open Access Statistic},
title = {Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets},
volume = 80,
year = 2017
}%0 Journal Article
%1 Lagani_2017
%A Lagani, Vincenzo
%A Athineou, Giorgos
%A Farcomeni, Alessio
%A Tsagris, Michail
%A Tsamardinos, Ioannis
%D 2017
%I Foundation for Open Access Statistic
%J Journal of Statistical Software
%N 7
%R 10.18637/jss.v080.i07
%T Feature Selection with the R Package MXM: Discovering Statistically Equivalent Feature Subsets
%U https://doi.org/10.18637%2Fjss.v080.i07
%V 80
%X The statistically equivalent signature (SES) algorithm is a method for feature selection inspired by the principles of constraint-based learning of Bayesian networks. Most of the currently available feature selection methods return only a single subset of features, supposedly the one with the highest predictive power. We argue that in several domains multiple subsets can achieve close to maximal predictive accuracy, and that arbitrarily providing only one has several drawbacks. The SES method attempts to identify multiple, predictive feature subsets whose performances are statistically equivalent. In that respect the SES algorithm subsumes and extends previous feature selection algorithms, like the max-min parent children algorithm. The SES algorithm is implemented in an homonym function included in the R package MXM, standing for mens ex machina, meaning 'mind from the machine' in Latin. The MXM implementation of SES handles several data analysis tasks, namely classification, regression and survival analysis. In this paper we present the SES algorithm, its implementation, and provide examples of use of the SES function in R. Furthermore, we analyze three publicly available data sets to illustrate the equivalence of the signatures retrieved by SES and to contrast SES against the state-of-the-art feature selection method LASSO. Our results provide initial evidence that the two methods perform comparably well in terms of predictive accuracy and that multiple, equally predictive signatures are actually present in real world data. - AbstractURLBibTeXEndNoteBibSonomyTsirlis, K, V Lagani, S Triantafillou, and I Tsamardinos. “On Scoring Maximal Ancestral Graphs With The Max-Min Hill Climbing Algorithm”. In, 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD), 2017. http://nugget.unisa.edu.au/CD2017/papersonly/maxmin-r0.pdf.t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that the proposed algorithm greatly improves on GSMAG, and compares positively against FCI and cFCI, two well known constraint-based approaches for causal-network reconstruction in presence of latent confounders
@conference{tsirlis2017scoring,
abstract = {t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that the proposed algorithm greatly improves on GSMAG, and compares positively against FCI and cFCI, two well known constraint-based approaches for causal-network reconstruction in presence of latent confounders},
author = {Tsirlis, K and Lagani, V and Triantafillou, S and Tsamardinos, I},
keywords = {mxmcausalpath},
publisher = {23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)},
title = {On Scoring Maximal Ancestral Graphs with the Max-Min Hill Climbing Algorithm},
year = 2017
}%0 Generic
%1 tsirlis2017scoring
%A Tsirlis, K
%A Lagani, V
%A Triantafillou, S
%A Tsamardinos, I
%D 2017
%I 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)
%T On Scoring Maximal Ancestral Graphs with the Max-Min Hill Climbing Algorithm
%U http://nugget.unisa.edu.au/CD2017/papersonly/maxmin-r0.pdf
%X t We consider the problem of causal structure learning in presence of latent confounders. We propose a hybrid method, MAG Max-Min Hill-Climbing (M3HC) that takes as input a data set of continuous variables, assumed to follow a multivariate Gaussian distribution, and outputs the best fitting maximal ancestral graph. M3HC builds upon a previously proposed method, namely GSMAG, by introducing a constraintbased first phase that greatly reduces the space of structures to investigate. We show on simulated data that the proposed algorithm greatly improves on GSMAG, and compares positively against FCI and cFCI, two well known constraint-based approaches for causal-network reconstruction in presence of latent confounders - AbstractURLBibTeXEndNoteBibSonomyDownloadTsagris, M, G Borboudakis, V Lagani, and I Tsamardinos. “Constraint-Based Causal Discovery With Mixed Data”. In, 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD), 2017. http://nugget.unisa.edu.au/CD2017/papersonly/constraint-based-causal-r1.pdf.We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.
@conference{noauthororeditor2017constraintbased,
abstract = {We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.},
author = {Tsagris, M and Borboudakis, G and Lagani, V and Tsamardinos, I},
keywords = {mxmcausalpath},
publisher = {23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)},
title = {Constraint-based Causal Discovery with Mixed Data},
year = 2017
}%0 Generic
%1 noauthororeditor2017constraintbased
%A Tsagris, M
%A Borboudakis, G
%A Lagani, V
%A Tsamardinos, I
%D 2017
%I 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Workshop on Causal Discovery (KDD)
%T Constraint-based Causal Discovery with Mixed Data
%U http://nugget.unisa.edu.au/CD2017/papersonly/constraint-based-causal-r1.pdf
%X We address the problem of constraint-based causal discovery with mixed data types, such as (but not limited to) continuous, binary, multinomial and ordinal variables. We use likelihood-ratio tests based on appropriate regression models, and show how to derive symmetric conditional independence tests. Such tests can then be directly used by existing constraint-based methods with mixed data, such as the PC and FCI algorithms for learning Bayesian networks and maximal ancestral graphs respectively. In experiments on simulated Bayesian networks, we employ the PC algorithm with different conditional independence tests for mixed data, and show that the proposed approach outperforms alternatives in terms of learning accuracy.
2016
- BibTeXEndNoteDOIBibSonomyCharonyktakis, Paulos, Maria Plakia, Ioannis Tsamardinos, and Maria Papadopouli. “On User-Centric Modular Qoe Prediction For Voip Based On Machine-Learning Algorithms”. Ieee Transactions On Mobile Computing. doi:10.1109/TMC.2015.2461216.
@article{Charonyktakis2016,
author = {Charonyktakis, Paulos and Plakia, Maria and Tsamardinos, Ioannis and Papadopouli, Maria},
journal = {IEEE Transactions on Mobile Computing},
keywords = {mxmcausalpath},
title = {On user-centric modular QoE prediction for voip based on machine-learning algorithms},
year = 2016
}%0 Journal Article
%1 Charonyktakis2016
%A Charonyktakis, Paulos
%A Plakia, Maria
%A Tsamardinos, Ioannis
%A Papadopouli, Maria
%D 2016
%J IEEE Transactions on Mobile Computing
%R 10.1109/TMC.2015.2461216
%T On user-centric modular QoE prediction for voip based on machine-learning algorithms - BibTeXEndNoteDOIBibSonomyGoveia, Jermaine, Andreas Pircher, Lena-Christin Conradi, Joanna Kalucka, Vincenzo Lagani, Mieke Dewerchin, Guy Eelen, Ralph J DeBerardinis, Ian D Wilson, and Peter Carmeliet. “Meta-Analysis Of Clinical Metabolic Profiling Studies In Cancer: Challenges And Opportunities”. Embo Molecular Medicine. doi:10.15252/EMMM.201606798.
@article{Goveia2016,
author = {Goveia, Jermaine and Pircher, Andreas and Conradi, Lena-Christin and Kalucka, Joanna and Lagani, Vincenzo and Dewerchin, Mieke and Eelen, Guy and DeBerardinis, Ralph J and Wilson, Ian D and Carmeliet, Peter},
journal = {EMBO Molecular Medicine},
keywords = {mxmcausalpath},
title = {Meta-analysis of clinical metabolic profiling studies in cancer: challenges and opportunities},
year = 2016
}%0 Journal Article
%1 Goveia2016
%A Goveia, Jermaine
%A Pircher, Andreas
%A Conradi, Lena-Christin
%A Kalucka, Joanna
%A Lagani, Vincenzo
%A Dewerchin, Mieke
%A Eelen, Guy
%A DeBerardinis, Ralph J
%A Wilson, Ian D
%A Carmeliet, Peter
%D 2016
%J EMBO Molecular Medicine
%R 10.15252/EMMM.201606798
%T Meta-analysis of clinical metabolic profiling studies in cancer: challenges and opportunities - BibTeXEndNoteDOIBibSonomyKarathanasis, N, I Tsamardinos, and V Lagani. “Omicsnpc: Applying The Nonparametric Combination Methodology To The Integrative Analysis Of Heterogeneous Omics Data”. Plos One. doi:10.1371/journal.pone.0165545.
@article{Karathanasis2016,
author = {Karathanasis, N and Tsamardinos, I and Lagani, V},
journal = {PloS one},
keywords = {mxmcausalpath},
title = {omicsNPC: applying the NonParametric Combination methodology to the integrative analysis of heterogeneous omics data},
year = 2016
}%0 Journal Article
%1 Karathanasis2016
%A Karathanasis, N
%A Tsamardinos, I
%A Lagani, V
%D 2016
%J PloS one
%R 10.1371/journal.pone.0165545
%T omicsNPC: applying the NonParametric Combination methodology to the integrative analysis of heterogeneous omics data - URLBibTeXEndNoteBibSonomyLagani, Vincenzo, Sofia Triantafillou, Gordon Ball, Jesper Tegner, and Ioannis Tsamardinos. “Probabilistic Computational Causal Discovery For Systems Biology”. Uncertainty In Biology. http://link.springer.com/chapter/10.1007/978-3-319-21296-8_3.
@article{Lagani2016a,
author = {Lagani, Vincenzo and Triantafillou, Sofia and Ball, Gordon and Tegner, Jesper and Tsamardinos, Ioannis},
journal = {Uncertainty in Biology},
keywords = {mxmcausalpath},
title = {Probabilistic computational causal discovery for systems biology},
year = 2016
}%0 Journal Article
%1 Lagani2016a
%A Lagani, Vincenzo
%A Triantafillou, Sofia
%A Ball, Gordon
%A Tegner, Jesper
%A Tsamardinos, Ioannis
%D 2016
%J Uncertainty in Biology
%T Probabilistic computational causal discovery for systems biology
%U http://link.springer.com/chapter/10.1007/978-3-319-21296-8_3 - AbstractURLBibTeXEndNoteDOIBibSonomyAthineou, G., G. Papoutsoglou, S. Triantafullou, I Basdekis, V. Lagani, and I. Tsamardinos. “Scenery: A Web-Based Application For Network Reconstruction And Visualization Of Cytometry Data.”. Accepted For Publication On The 10Th International Conference On Practical Applications Of Computational Biology & Bioinformatics (Pacbb 2016). doi:10.1007/978-3-319-40126-3_21.Cytometry techniques allow to quantify morphological characteristics and protein abundances at a single-cell level. Data collected with these techniques can be used for addressing the fascinating, yet challenging problem of reconstructing the network of protein interactions forming signaling pathways and governing cell biological mechanisms. Network reconstruction is an established and well studied problem in the machine learning and data mining fields, with several algorithms already available. In this paper, we present the first web-oriented application, SCENERY, that allows scientists to rapidly apply state-of-the-art network-reconstruction methods on cytometry data. SCENERY comes with an easy-to-use user interface, a modular architecture, and advanced visualization functions. The functionalities of the application are illustrated on data from a publicly available immunology experiment.
@article{Athineou2016,
abstract = {Cytometry techniques allow to quantify morphological characteristics and protein abundances at a single-cell level. Data collected with these techniques can be used for addressing the fascinating, yet challenging problem of reconstructing the network of protein interactions forming signaling pathways and governing cell biological mechanisms. Network reconstruction is an established and well studied problem in the machine learning and data mining fields, with several algorithms already available. In this paper, we present the first web-oriented application, SCENERY, that allows scientists to rapidly apply state-of-the-art network-reconstruction methods on cytometry data. SCENERY comes with an easy-to-use user interface, a modular architecture, and advanced visualization functions. The functionalities of the application are illustrated on data from a publicly available immunology experiment.},
author = {Athineou, G. and Papoutsoglou, G. and Triantafullou, S. and Basdekis, I and Lagani, V. and Tsamardinos, I.},
journal = {Accepted for publication on the 10th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2016).},
keywords = {mxmcausalpath},
title = {SCENERY: a Web-Based Application for Network Reconstruction and Visualization of Cytometry Data.},
year = 2016
}%0 Journal Article
%1 Athineou2016
%A Athineou, G.
%A Papoutsoglou, G.
%A Triantafullou, S.
%A Basdekis, I
%A Lagani, V.
%A Tsamardinos, I.
%D 2016
%J Accepted for publication on the 10th International Conference on Practical Applications of Computational Biology & Bioinformatics (PACBB 2016).
%R 10.1007/978-3-319-40126-3_21
%T SCENERY: a Web-Based Application for Network Reconstruction and Visualization of Cytometry Data.
%U https://link.springer.com/chapter/10.1007/978-3-319-40126-3_21
%X Cytometry techniques allow to quantify morphological characteristics and protein abundances at a single-cell level. Data collected with these techniques can be used for addressing the fascinating, yet challenging problem of reconstructing the network of protein interactions forming signaling pathways and governing cell biological mechanisms. Network reconstruction is an established and well studied problem in the machine learning and data mining fields, with several algorithms already available. In this paper, we present the first web-oriented application, SCENERY, that allows scientists to rapidly apply state-of-the-art network-reconstruction methods on cytometry data. SCENERY comes with an easy-to-use user interface, a modular architecture, and advanced visualization functions. The functionalities of the application are illustrated on data from a publicly available immunology experiment. - AbstractURLBibTeXEndNoteBibSonomyDownloadTriantafillou, Sofia, and Ioannis Tsamardinos. “Score Based Vs Constraint Based Causal Learning In The Presence Of Confounders”. In, 2016. http://www.its.caltech.edu/~fehardt/UAI2016WS/papers/Triantafillou.pdf.We compare score-based and constraint-based learning in the presence of latent confounders. We use a greedy search strategy to identify the best fitting maximal ancestral graph (MAG) from continuous data, under the assumption of multivariate normality. Scoring maximal ancestral graphs is based on (a) residual iterative conditional fitting [Drton et al., 2009] for obtaining maximum likelihood estimates for the parameters of a given MAG and (b) factorization and score decomposition results for mixed causal graphs [Richardson, 2009, Nowzohour et al., 2015]. We compare the score-based approach in simulated settings with two standard constraintbased algorithms: FCI and conservative FCI. Results show a promising performance of the greedy search algorithm
@inproceedings{Triantafillou2016,
abstract = {We compare score-based and constraint-based learning in the presence of latent confounders. We use a greedy search strategy to identify the best fitting maximal ancestral graph (MAG) from continuous data, under the assumption of multivariate normality. Scoring maximal ancestral graphs is based on (a) residual iterative conditional fitting [Drton et al., 2009] for obtaining maximum likelihood estimates for the parameters of a given MAG and (b) factorization and score decomposition results for mixed causal graphs [Richardson, 2009, Nowzohour et al., 2015]. We compare the score-based approach in simulated settings with two standard constraintbased algorithms: FCI and conservative FCI. Results show a promising performance of the greedy search algorithm},
author = {Triantafillou, Sofia and Tsamardinos, Ioannis},
keywords = {mxmcausalpath},
title = {Score based vs constraint based causal learning in the presence of confounders},
year = 2016
}%0 Conference Paper
%1 Triantafillou2016
%A Triantafillou, Sofia
%A Tsamardinos, Ioannis
%D 2016
%T Score based vs constraint based causal learning in the presence of confounders
%U http://www.its.caltech.edu/~fehardt/UAI2016WS/papers/Triantafillou.pdf
%X We compare score-based and constraint-based learning in the presence of latent confounders. We use a greedy search strategy to identify the best fitting maximal ancestral graph (MAG) from continuous data, under the assumption of multivariate normality. Scoring maximal ancestral graphs is based on (a) residual iterative conditional fitting [Drton et al., 2009] for obtaining maximum likelihood estimates for the parameters of a given MAG and (b) factorization and score decomposition results for mixed causal graphs [Richardson, 2009, Nowzohour et al., 2015]. We compare the score-based approach in simulated settings with two standard constraintbased algorithms: FCI and conservative FCI. Results show a promising performance of the greedy search algorithm - AbstractURLBibTeXEndNoteBibSonomyDownloadBorboudakis, Giorgos, and Ioannis Tsamardinos. “Towards Robust And Versatile Causal Discovery Forbusiness Applications”. In, 2016. https://www.kdd.org/kdd2016/papers/files/rpp1045-borboudakisA.pdf.Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however, all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admitting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above.The code is available on the mensxmachina.org website.
@inproceedings{Borboudakis2016,
abstract = {Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however, all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admitting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above.The code is available on the mensxmachina.org website.},
author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
keywords = {mxmcausalpath},
title = {Towards Robust and Versatile Causal Discovery forBusiness Applications},
year = 2016
}%0 Conference Paper
%1 Borboudakis2016
%A Borboudakis, Giorgos
%A Tsamardinos, Ioannis
%D 2016
%T Towards Robust and Versatile Causal Discovery forBusiness Applications
%U https://www.kdd.org/kdd2016/papers/files/rpp1045-borboudakisA.pdf
%X Causal discovery algorithms can induce some of the causal relations from the data, commonly in the form of a causal network such as a causal Bayesian network. Arguably however, all such algorithms lack far behind what is necessary for a true business application. We develop an initial version of a new, general causal discovery algorithm called ETIO with many features suitable for business applications. These include (a) ability to accept prior causal knowledge (e.g., taking senior driving courses improves driving skills), (b) admitting the presence of latent confounding factors, (c) admitting the possibility of (a certain type of) selection bias in the data (e.g., clients sampled mostly from a given region), (d) ability to analyze data with missing-by-design (i.e., not planned to measure) values (e.g., if two companies merge and their databases measure different attributes), and (e) ability to analyze data from different interventions (e.g., prior and posterior to an advertisement campaign). ETIO is an instance of the logical approach to integrative causal discovery that has been relatively recently introduced and enables the solution of complex reverse-engineering problems in causal discovery. ETIO is compared against the state-of-the-art and is shown to be more effective in terms of speed, with only a slight degradation in terms of learning accuracy, while incorporating all the features above.The code is available on the mensxmachina.org website. - AbstractURLBibTeXEndNoteBibSonomyDownloadRoumpelaki, Anna, Giorgos Borboudakis, Sofia Triantafillou, and Ioannis Tsamardinos. “Marginal Causal Consistency In Constraint-Based Causal Learning”. In, 2016. http://www.its.caltech.edu/~fehardt/UAI2016WS/papers/Roumpelaki.pdf.Maximal Ancestral Graphs (MAGs) are probabilistic graphical models that can model the distribution and causal properties of a set of variables in the presence of latent confounders. They are closed under marginalization. Invariant pairwise features of a class of Markov equivalent MAGs can be learnt from observational data sets using the FCI algorithm and its variations (such as conservative FCI and order independent FCI). We investigate the consistency of causal features (causal ancestry relations) obtained by FCI in different marginals of a single data set. In principle, the causal relationships identified by FCI on a data set D measuring a set of variables V should not conflict the output of FCI on marginal data sets including only subsets of V. In practice, however, FCI is prone to error propagation, and running FCI in different marginals results in inconsistent causal predictions. We introduce the term of marginal causal consistency to denote the consistency of causal relationships when learning marginal distributions, and investigate the marginal causal consistency of different FCI variations.Results indicate that marginal causal consistency varies for different algorithms, and is also sensitive to network density and marginal size
@inproceedings{Roumpelaki2016,
abstract = {Maximal Ancestral Graphs (MAGs) are probabilistic graphical models that can model the distribution and causal properties of a set of variables in the presence of latent confounders. They are closed under marginalization. Invariant pairwise features of a class of Markov equivalent MAGs can be learnt from observational data sets using the FCI algorithm and its variations (such as conservative FCI and order independent FCI). We investigate the consistency of causal features (causal ancestry relations) obtained by FCI in different marginals of a single data set. In principle, the causal relationships identified by FCI on a data set D measuring a set of variables V should not conflict the output of FCI on marginal data sets including only subsets of V. In practice, however, FCI is prone to error propagation, and running FCI in different marginals results in inconsistent causal predictions. We introduce the term of marginal causal consistency to denote the consistency of causal relationships when learning marginal distributions, and investigate the marginal causal consistency of different FCI variations.Results indicate that marginal causal consistency varies for different algorithms, and is also sensitive to network density and marginal size},
author = {Roumpelaki, Anna and Borboudakis, Giorgos and Triantafillou, Sofia and Tsamardinos, Ioannis},
keywords = {mxmcausalpath},
title = {Marginal causal consistency in constraint-based causal learning},
year = 2016
}%0 Conference Paper
%1 Roumpelaki2016
%A Roumpelaki, Anna
%A Borboudakis, Giorgos
%A Triantafillou, Sofia
%A Tsamardinos, Ioannis
%D 2016
%T Marginal causal consistency in constraint-based causal learning
%U http://www.its.caltech.edu/~fehardt/UAI2016WS/papers/Roumpelaki.pdf
%X Maximal Ancestral Graphs (MAGs) are probabilistic graphical models that can model the distribution and causal properties of a set of variables in the presence of latent confounders. They are closed under marginalization. Invariant pairwise features of a class of Markov equivalent MAGs can be learnt from observational data sets using the FCI algorithm and its variations (such as conservative FCI and order independent FCI). We investigate the consistency of causal features (causal ancestry relations) obtained by FCI in different marginals of a single data set. In principle, the causal relationships identified by FCI on a data set D measuring a set of variables V should not conflict the output of FCI on marginal data sets including only subsets of V. In practice, however, FCI is prone to error propagation, and running FCI in different marginals results in inconsistent causal predictions. We introduce the term of marginal causal consistency to denote the consistency of causal relationships when learning marginal distributions, and investigate the marginal causal consistency of different FCI variations.Results indicate that marginal causal consistency varies for different algorithms, and is also sensitive to network density and marginal size - URLBibTeXEndNoteDOIBibSonomyDownloadLagani, Vincenzo, Argyro D. Karozou, David Gomez-Cabrero, Gilad Silberberg, and Ioannis Tsamardinos. “A Comparative Evaluation Of Data-Merging And Meta-Analysis Methods For Reconstructing Gene-Gene Interactions”. Bmc Bioinformatics no. S5. doi:10.1186/s12859-016-1038-1.
@article{Lagani2016,
author = {Lagani, Vincenzo and Karozou, Argyro D. and Gomez-Cabrero, David and Silberberg, Gilad and Tsamardinos, Ioannis},
journal = {BMC Bioinformatics},
keywords = {mxmcausalpath},
number = {S5},
title = {A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions},
year = 2016
}%0 Journal Article
%1 Lagani2016
%A Lagani, Vincenzo
%A Karozou, Argyro D.
%A Gomez-Cabrero, David
%A Silberberg, Gilad
%A Tsamardinos, Ioannis
%D 2016
%J BMC Bioinformatics
%N S5
%R 10.1186/s12859-016-1038-1
%T A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions
%U https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/s12859-016-1038-1
2015
- BibTeXEndNoteBibSonomyTsamardinos, Ioannis, Michail Tsagris, and Vincenzo Lagani. “Feature Selection For Longitudinal Data”. Proceedings Of The 10Th Conference Of The Hellenic Society For Computational Biology & Bioinformatics (Hscbb15) no. 1.
@article{Tsamardinos2015a,
author = {Tsamardinos, Ioannis and Tsagris, Michail and Lagani, Vincenzo},
journal = {Proceedings of the 10th conference of the Hellenic Society for Computational Biology & Bioinformatics (HSCBB15)},
keywords = {mxmcausalpath},
number = 1,
title = {Feature selection for longitudinal data},
year = 2015
}%0 Journal Article
%1 Tsamardinos2015a
%A Tsamardinos, Ioannis
%A Tsagris, Michail
%A Lagani, Vincenzo
%D 2015
%J Proceedings of the 10th conference of the Hellenic Society for Computational Biology & Bioinformatics (HSCBB15)
%N 1
%T Feature selection for longitudinal data - AbstractURLBibTeXEndNoteBibSonomyTriantafillou, Sofia, and Ioannis Tsamardinos. “Constraint-Based Causal Discovery From Multiple Interventions Over Overlapping Variable Sets”. Journal Of Machine Learning Research. http://arxiv.org/abs/1403.2150.Scientific practice typically involves repeatedly studying a system, each time trying to unravel a different perspective. In each study, the scientist may take measurements under different experimental conditions (interventions, manipulations, perturbations) and measure different sets of quantities (variables). The result is a collection of heterogeneous data sets coming from different data distributions. In this work, we present algorithm COmbINE, which accepts a collection of data sets over overlapping variable sets under different experimental conditions; COmbINE then outputs a summary of all causal models indicating the invariant and variant structural characteristics of all models that simultaneously fit all of the input data sets. COmbINE converts estimated dependencies and independencies in the data into path constraints on the data- generating causal model and encodes them as a SAT instance. The algorithm is sound and complete in the sample limit. To account for conflicting constraints arising from statistical errors, we introduce a general method for sorting constraints in order of confidence, computed as a function of their corresponding p-values. In our empirical evaluation, COmbINE outperforms in terms of efficiency the only pre-existing similar algorithm; the latter additionally admits feedback cycles, but does not admit conflicting constraints which hinders the applicability on real data. As a proof-of-concept, COmbINE is employed to co- analyze 4 real, mass-cytometry data sets measuring phosphorylated protein concentrations of overlapping protein sets under 3 different interventions
@article{Triantafillou2014,
abstract = {Scientific practice typically involves repeatedly studying a system, each time trying to unravel a different perspective. In each study, the scientist may take measurements under different experimental conditions (interventions, manipulations, perturbations) and measure different sets of quantities (variables). The result is a collection of heterogeneous data sets coming from different data distributions. In this work, we present algorithm COmbINE, which accepts a collection of data sets over overlapping variable sets under different experimental conditions; COmbINE then outputs a summary of all causal models indicating the invariant and variant structural characteristics of all models that simultaneously fit all of the input data sets. COmbINE converts estimated dependencies and independencies in the data into path constraints on the data- generating causal model and encodes them as a SAT instance. The algorithm is sound and complete in the sample limit. To account for conflicting constraints arising from statistical errors, we introduce a general method for sorting constraints in order of confidence, computed as a function of their corresponding p-values. In our empirical evaluation, COmbINE outperforms in terms of efficiency the only pre-existing similar algorithm; the latter additionally admits feedback cycles, but does not admit conflicting constraints which hinders the applicability on real data. As a proof-of-concept, COmbINE is employed to co- analyze 4 real, mass-cytometry data sets measuring phosphorylated protein concentrations of overlapping protein sets under 3 different interventions},
author = {Triantafillou, Sofia and Tsamardinos, Ioannis},
journal = {Journal of Machine Learning Research},
keywords = {mxmcausalpath},
title = {Constraint-based Causal Discovery from Multiple Interventions over Overlapping Variable Sets},
year = 2015
}%0 Journal Article
%1 Triantafillou2014
%A Triantafillou, Sofia
%A Tsamardinos, Ioannis
%D 2015
%J Journal of Machine Learning Research
%T Constraint-based Causal Discovery from Multiple Interventions over Overlapping Variable Sets
%U http://arxiv.org/abs/1403.2150
%X Scientific practice typically involves repeatedly studying a system, each time trying to unravel a different perspective. In each study, the scientist may take measurements under different experimental conditions (interventions, manipulations, perturbations) and measure different sets of quantities (variables). The result is a collection of heterogeneous data sets coming from different data distributions. In this work, we present algorithm COmbINE, which accepts a collection of data sets over overlapping variable sets under different experimental conditions; COmbINE then outputs a summary of all causal models indicating the invariant and variant structural characteristics of all models that simultaneously fit all of the input data sets. COmbINE converts estimated dependencies and independencies in the data into path constraints on the data- generating causal model and encodes them as a SAT instance. The algorithm is sound and complete in the sample limit. To account for conflicting constraints arising from statistical errors, we introduce a general method for sorting constraints in order of confidence, computed as a function of their corresponding p-values. In our empirical evaluation, COmbINE outperforms in terms of efficiency the only pre-existing similar algorithm; the latter additionally admits feedback cycles, but does not admit conflicting constraints which hinders the applicability on real data. As a proof-of-concept, COmbINE is employed to co- analyze 4 real, mass-cytometry data sets measuring phosphorylated protein concentrations of overlapping protein sets under 3 different interventions - AbstractURLBibTeXEndNoteBibSonomyBorboudakis, Giorgos, and Ioannis Tsamardinos. “Bayesian Network Learning With Discrete Case-Control Data.”. Uncertainty In Artificial Intelligence (Uai). http://auai.org/uai2015/proceedings/papers/188.pdf.We address the problem of learning Bayesian networks from discrete, unmatched case-control data using specialized conditional independence tests. Those tests can also be used for learning other types of graphical models or for feature selection. We also propose a post-processing method that can be applied in conjunction with any Bayesian network learning algorithm. In simulations we show that our methods are able to deal with selection bias from case-control data.
@article{Borboudakis2015,
abstract = {We address the problem of learning Bayesian networks from discrete, unmatched case-control data using specialized conditional independence tests. Those tests can also be used for learning other types of graphical models or for feature selection. We also propose a post-processing method that can be applied in conjunction with any Bayesian network learning algorithm. In simulations we show that our methods are able to deal with selection bias from case-control data.},
author = {Borboudakis, Giorgos and Tsamardinos, Ioannis},
journal = {Uncertainty in Artificial Intelligence (UAI)},
keywords = {mxmcausalpath},
title = {Bayesian Network Learning with Discrete Case-Control Data.},
year = 2015
}%0 Journal Article
%1 Borboudakis2015
%A Borboudakis, Giorgos
%A Tsamardinos, Ioannis
%D 2015
%J Uncertainty in Artificial Intelligence (UAI)
%T Bayesian Network Learning with Discrete Case-Control Data.
%U http://auai.org/uai2015/proceedings/papers/188.pdf
%X We address the problem of learning Bayesian networks from discrete, unmatched case-control data using specialized conditional independence tests. Those tests can also be used for learning other types of graphical models or for feature selection. We also propose a post-processing method that can be applied in conjunction with any Bayesian network learning algorithm. In simulations we show that our methods are able to deal with selection bias from case-control data.
2014
- AbstractURLBibTeXEndNoteDOIBibSonomyPapagiannopoulou, Christina, Grigorios Tsoumakas, and Ioannis Tsamardinos. “Discovering And Exploiting Entailment Relationships In Multi-Label Learning”. Acm Sigkdd Conference On Knowledge Discovery And Data Mining 2015 (Kdd). doi:doi.org/10.1145/2783258.2783302.This work presents a probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailment: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented as a deterministic Bayesian network. Marginal probabilities are entered as soft evidence in the network and through probabilistic inference become consistent with the discovered knowledge. Our approach offers robust improvements in mean average precision compared to the standard binary relevance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.
@article{Papagiannopoulou2014,
abstract = {This work presents a probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailment: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented as a deterministic Bayesian network. Marginal probabilities are entered as soft evidence in the network and through probabilistic inference become consistent with the discovered knowledge. Our approach offers robust improvements in mean average precision compared to the standard binary relevance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.},
author = {Papagiannopoulou, Christina and Tsoumakas, Grigorios and Tsamardinos, Ioannis},
journal = {ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2015 (KDD)},
keywords = {mxmcausalpath},
title = {Discovering and Exploiting Entailment Relationships in Multi-Label Learning},
year = 2014
}%0 Journal Article
%1 Papagiannopoulou2014
%A Papagiannopoulou, Christina
%A Tsoumakas, Grigorios
%A Tsamardinos, Ioannis
%D 2014
%J ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2015 (KDD)
%R doi.org/10.1145/2783258.2783302
%T Discovering and Exploiting Entailment Relationships in Multi-Label Learning
%U http://arxiv.org/abs/1404.4038
%X This work presents a probabilistic method for enforcing adherence of the marginal probabilities of a multi-label model to automatically discovered deterministic relationships among labels. In particular we focus on discovering two kinds of relationships among the labels. The first one concerns pairwise positive entailment: pairs of labels, where the presence of one implies the presence of the other in all instances of a dataset. The second concerns exclusion: sets of labels that do not coexist in the same instances of the dataset. These relationships are represented as a deterministic Bayesian network. Marginal probabilities are entered as soft evidence in the network and through probabilistic inference become consistent with the discovered knowledge. Our approach offers robust improvements in mean average precision compared to the standard binary relevance approach across all 12 datasets involved in our experiments. The discovery process helps interesting implicit knowledge to emerge, which could be useful in itself.