-
Rahibu A. Abassi
1*
-
Amina S. Msengwa
2
-
Rocky R. J. Akarro
2
1 Department of Natural Science, State University of Zanzibar, Zanzibar-Tanzania
2 Department of Statistics, University of Dar es Salaam, Dar es Salaam-Tanzania
*Corresponding Author: Rahibu A. Department of Natural Science, State University of Zanzibar, Zanzibar-Tanzania.
Citation: Rahibu A. Abassi, Amina S. Msengwa, Rocky R. J. Akarro. (2022). Imputation Methods on Retrospective Breast Cancer Data in Tanzania: A Comparative Study. J. Women Health Care and Issues. 5(4): DOI: 10.31579/2642-9756/118.
Copyright: © 2022 Rahibu A. Abassi. This is an open-access article distributed under the terms of The Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Received: 18 April 2022 | Accepted: 25 May 2022 | Published: 06 June 2022
Keywords: breast cancer dataset; classification methods; imputation methods; missing data
Abstract
Background: Clinical datasets are at risk of having missing data for several reasons including patients’ failure to attend clinical measurements and measurement recorder’s defects. Missing data can significantly affect the analysis and results might be doubtful due to bias caused by omission incomplete records during analysis especially if a dataset is small. This study aims to compare several imputation methods in terms of efficiency in filling-in missing data so as to increase prediction and classification accuracy in breast cancer dataset.
Methodology: Five imputation methods namely series mean, k-nearest neighbour, hot deck, predictive mean matching, expected maximisation via bootstrapping, and multiple imputation by chained equations were applied to replace the missing values to the real breast cancer dataset. The efficiency of imputation methods was compared by using the Root Mean Square Errors and Mean Absolute Errors to obtain a suitable complete dataset. Binary logistic regression and linear discrimination classifiers were applied to the imputed dataset to compare their efficacy on classification and discrimination.
Results: The evaluation of imputation methods revealed that the predictive mean matching method was better off compared to other imputation methods. In addition, the binary logistic regression and linear discriminant analyses yield almost similar values on overall classification rates, sensitivity and specificity.
Conclusion: The predictive mean matching imputation showed higher accuracy in estimating and replacing missing data values in a real breast cancer dataset under the study. It is a more effective and good approach to handle missing data. We recommend replacing missing data by using predictive mean matching since it is a plausible approach toward multiple imputations for numerical variables. It improves estimation and prediction accuracy over the use complete-case analysis especially when percentage of missing data is not very small.
References
- H. Sung et al., (2021,) “Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries,” vol. 71, no. 3, pp. 209–249,
View at Publisher |
View at Google Scholar
- S. O. Azubuike, C. Muirhead, L. Hayes, and R. McNally, (2018) “Rising global burden of breast cancer: The case of sub-Saharan Africa (with emphasis on Nigeria) and implications for regional development: A review,” World J. Surg. Oncol., vol. 16, no. 1, pp. 1–14,.
View at Publisher |
View at Google Scholar
- A. Nekouie and M. H. Moattar, (2018) “Missing Value Imputation for Breast Cancer Diagnosis Data Using Tensor Factorization Improved by Enhanced Reduced Adaptive Particle Swarm Optimization Atefeh Nekouie Cancer refers to a disease in which a group of cells show uncontrolled growth, invasion ,” J. King Saud Univ. - Comput. Inf. Sci.,
View at Publisher |
View at Google Scholar
- M. Humphries, (2013) “Missing Data & How to Deal: An overview of missing data,” Popul. Res. Cent., p. 45, [Online]. Available:
View at Publisher |
View at Google Scholar
- C. Curley, R. M. Krause, R. Feiock, and C. V Hawkins, (2019) “Dealing with Missing Data: A Comparative Exploration of Approaches Using the Integrated City Sustainability Database.
View at Publisher |
View at Google Scholar
- Molenburghs & Verbeke, (2005) Models for Discrete Longitudinal Data. Springer Series in Statistics.
View at Publisher |
View at Google Scholar
- Little and Rubin, (1987) Statistical Analysis with Missing Data. John Willey & Sons.
View at Publisher |
View at Google Scholar
- J. Honaker, G. King, and M. Blackwell, “Amelia II : A Program for Missing Data,” vol. 45, no. 7,
View at Publisher |
View at Google Scholar
- T. Siswantining, S. M. Soemartojo, and D. Sarwinda, ,(2019) “Multiple Imputation with Predictive Mean Matching Method for Numerical Missing Data.
View at Publisher |
View at Google Scholar
- B. E. Bailey, R. Andridge, and A. B. Shoben, (2020) “Multiple imputation by predictive mean matching in cluster-randomized trials,” BMC Med. Res. Methodol., vol. 20, no. 1, pp. 1–16.
View at Publisher |
View at Google Scholar
- N. J. Horton and S. R. Lipsitz, (2001) “Multiple imputation in practice : Comparison of software packages for regress ...,” Sci. York, vol. 55, no. 3, pp. 244–254.
View at Publisher |
View at Google Scholar
- S. Van Buuren and K. Groothuis-oudshoorn, (2014) “mice: Multivariate Imputation by Chained,”.
View at Publisher |
View at Google Scholar
- M. Pazhoohesh, S. Walker, and Z. Pourmirza, (2019) “A comparison of Methods for Missing data treatment in building sensor data.
View at Publisher |
View at Google Scholar
- L. Beretta and A. Santaniello, (2016) “Nearest neighbor imputation algorithms : a critical evaluation,” BMC Med. Inform. Decis. Mak., vol. 16, no. Suppl 3.
View at Publisher |
View at Google Scholar
- A. Kowarik and M. Templ, (2016) “Imputation with the R Package VIM,” vol. 74, no. 7.
View at Publisher |
View at Google Scholar
- X. Zhu, (2014) “Comparison of Four Methods for Handing Missing Data in Longitudinal Data Analysis through a Simulation Study,” no. December, pp. 933–944.
View at Publisher |
View at Google Scholar
- P. Gaffert, F. Meinfelder, and V. Bosch, (2016) “Towards an MI-proper Predictive Mean Matching.
View at Publisher |
View at Google Scholar