Comparative Evaluation of Missing Data Imputation for Omics Data
Presented by Cheng-Chang Wu
Masters Candidate in Biostatistics
Plan B Adviser: Eric Lock
Abstract: Missing data is a common challenge for the analysis of many molecular “omics” datasets (e.g., genomics, metabolomics, proteomics). This challenge is particularly significant in metabolomics and other mass-spectrometry based technologies, primarily due to the prevalence of informative missingness, such as data Missing Due to Limit of Detection (MLOD). This study evaluated two widely-used imputation methods, SoftImpute and KNNimpute, across a variety of parameter settings, for estimating missing values in simulated datasets under diverse conditions. These conditions included varying data dimensions, signal-to-noise ratios, missing data proportions, and missingness mechanisms (Missing Completely at Random and MLOD). Additionally, the methods were applied to a real metabolomics dataset from the MILK-OMICS study.
We show that the SVD-based SoftImpute consistently outperforms the neighborhood-based KNNimpute across all simulations, likely due to its ability to capture the low-rank structure inherent in the simulated data. SoftImpute’s performance is further influenced by the adopted tuning approaches and parameter settings. Notably, the nuclear norm regularization proves effective in providing stable solutions and mitigating over-fitting issues, especially under scenarios with weak signal strength. While fine-tuning the rank of the approximation for imputations may yield superior results, it also exhibits instability without nuclear norm penalization. These findings underscore the importance of parameter settings and tuning approaches for SoftImpute to ensure optimal imputation accuracy. Although simulations highlight SoftImpute’s advantages, the application to real metabolomics data reveals limitations of current methods in handling missingness induced by detection limit. Nonetheless, these insights can guide the development of more effective imputation strategies suited to the challenges of omics data analysis.
Keywords: Missing data imputation, Omics data, Metabolomics, SoftImpute, KNNimpute, Simulations, Low-rank approximation, Missing due to limit of detection, Nuclear norm regularization