Decision tree Classification and Model Evaluation for Breast Cancer Survivability: A Data Mining Approach
Chinnaiyan Ponnuraja1, Babu C Lakshmanan2, Valarmathi Srinivasan3 and Krihsna Prasanth B41Department of Statistics, National Institute for Research in Tuberculosis (ICMR), Chennai, India.
2Cognizant Technology Solutions Chennai, India.
3Department of Epidemiology, The TamilNadu Dr.MGR Medical University, Chennai, India.
4Department of Oral Pathology (COCPAR), Sree Balaji Dental College and Hospital, Bharath University, Pallikaranai, Chennai-600100.
Corresponding Author E-mail: cponnuraja@nirt.res.in
Abstract: Data mining is the foremost technique in health care industry which helps in uncovering data patterns in large volume of data. The breast cancer is one of the most prevalent cancers in the world that has enough potential to be studied by data mining techniques. . While taking treatment decision for Breast Cancer multiple factors are to be considered. SEER Breast Cancer data is analyzed to extract an accurate model of patients survival using data mining technique like decision tree algorithm, classification and pattern recognition. Evolving from breast cancer insights, decision tree algorithm can employ multiple factors in resolving prediction, classification, pattern recognition, and pattern completion. SEER data set pertained to patients suffering from breast cancer is used to extract an accurate model to identify the survival of patients by data mining techniques. To achieve better prediction of the breast cancer patients’ survivability, only seven features are identified from the available features as important for the analysis. After feature identification, pre-processing of the data is done, like deletion of records with insufficient/missing information, and then all the identified features are being used in Decision Tree algorithm. The objective is to compare predictive results classifying breast cancer patients (both male and female) with decision tree algorithm using age categorization. By means of this algorithm, we predict the risks of female breast cancer patients’ mortality rate as 95.1% in the age group 42-52 years along with other risk factors. The prediction and risks factors for male are also achieved equally. Decision tree algorithm concludes with a path for highest survival rate (96.4%) and the highest death rate (95.1%). The result is cross validated using logistic regression. Female and male breast cancer patients in the age group (42-52) and (<42) are identified as high risk groups respectively. The proposed approach helps the clinicians with high risk group reference and to plan for the patient’s treatment accordingly.
Keywords: Data Mining; Decision Tree; CHAID; SEER; Breast Cancer Back to TOC