AI Tool for Early Breast Cancer Detection

AI-Based Breast Cancer Screening Tool

Early and accurate detection of tumor malignancy in breast cancer is crucial for effective patient management. This study developed an explainable artificial intelligence (XAI)-based tool for pre-screening breast cancer malignancy. The tool is fast and requires minimal data.

Eight machine learning algorithms were compared using a Kaggle dataset with nine clinical and demographic features from 213 patients. The comparison was based on accuracy, sensitivity, specificity, F1 score, Roc Curve (AUC), and Matthews correlation coefficient. Both ensemble models, specifically RUSBoost, and individual decision trees achieved approximately 91.7% accuracy. The decision tree was selected for its high explainability, low computational cost, and clinical practicality.

The model offers verbal decision rules that include malignancy classification with lymph node involvement, malignancy inference regardless of tumor size when metastasis is present, and large tumor size with advanced age indicating malignancy without lymph node involvement or metastasis. SHapley Additive exPlanations (SHAP) analysis validated the model’s decision-making process. This model shows potential for integration into clinical decision support systems, offering rapid, reliable pre-screening with minimal data. Future validation studies are planned to enhance generalizability.

The Importance of Early Detection

Breast cancer is the most commonly diagnosed cancer among women and a leading cause of cancer-related mortality worldwide. In 2020, there were about 2.3 million new cases, accounting for 15% of cancer deaths among women. The International Agency for Research on Cancer (IARC) projects that the annual number of new cases will reach 3.2 million, with 1.1 million deaths by 2050. Early diagnosis raises the five-year survival rate for localized breast cancer to 99%, compared to 27% in metastatic stages. It also enables less invasive treatments that improve quality of life and enhances the effectiveness of adjuvant therapies, reducing recurrence risk.

Access to early diagnosis remains limited in low- and middle-income countries, where screening methods like mammography are limited by cost, infrastructure, and staff shortages. Low public awareness delays consultation, which causes late-stage diagnosis and higher mortality. While AI-based decision support systems can achieve high diagnostic accuracy in breast cancer (up to 95%), their use in resource-limited settings is restricted due to the need for high-quality data and complex infrastructure. A lack of model explainability and limited integration of clinical variables also reduce clinical trust and generalizability.

Explainable AI Model Proposal

Medical literature emphasizes clinical parameters such as age, menopausal status, tumor size, lymph node involvement, and breast quadrant localization, all strongly associated with breast cancer malignancy and prognosis. Tumor size and lymph node status are key for staging and treatment, and menopausal status influences hormone-positive tumors. Tumors in the upper outer quadrant pose a higher metastasis risk. These features can be gathered via clinical exam, ultrasound, or basic biopsy, even in low-resource settings.

This study proposes an explainable AI model to reduce dependency on mammography by utilizing relevant clinical parameters. This system, designed to provide transparent and interpretable decision support reports to clinicians, can potentially increase early detection rates, particularly in resource-constrained settings. The study aims to offer a low-cost, scalable solution, enhance the accessibility of clinical decision support systems and reduce breast cancer-related mortality rates in resource-limited regions. This system is expected to make meaningful contributions to clinicians, healthcare policymakers, and patient communities.

Dataset and Methods

The analyses used a breast cancer dataset based on cancer records from the University of Calabar Teaching Hospital, covering January 2019 to August 2021. The dataset is publicly available on Kaggle. It includes nine clinical/demographic features from 213 patients, including tumor size, lymph node status, and metastasis. Numerical variables like age and tumor size were used in their original form, while categorical variables were encoded.

Variables with an inherent order, such as menopausal status, lymph node involvement, and presence of metastasis, were label encoded. Non-hierarchical variables like breast laterality and quadrant were one-hot encoded. The outcome was labeled as benign (0) or malignant (1). The dataset contains no personally identifiable information and was used in accordance with Kaggle's data-sharing policies and academic ethical standards. This study did not involve direct human or animal participants.

Machine Learning Algorithms

Eight machine learning algorithms were evaluated for the early and explainable prediction of breast cancer tumor malignancy. The algorithms were selected based on their classification performance, computational efficiency, and interpretability. Machine learning approaches based on explainable AI are widely used in similar datasets in the literature. The machine learning methods employed include decision trees, discriminant analysis, logistic regression, support vector machines (SVM), Naive Bayes, K-nearest neighbors (K-NN), ensemble learning, and artificial neural networks (ANN).

To ensure an objective and generalizable assessment, the dataset was split into 90% training and 10% testing sets. Tenfold cross-validation was used within the training portion to train and internally validate the models. All classification workflows, including data partitioning, model training, validation, and hyperparameter tuning, were conducted using MATLAB’s Classification Learner Toolbox. Hyperparameter optimization was performed via Bayesian optimization in MATLAB to enhance model performance while avoiding overfitting. Models were evaluated using accuracy, sensitivity, specificity, PPV, F1 score, AUC, and MCC.

Two explainability approaches were adopted. First, decision tree structures were visualized. Second, SHAP (SHapley Additive exPlanations) analysis was used to quantitatively evaluate each variable’s contribution to the model output. To build a general model, a conditional feature augmentation strategy was used. This method develops a model using primary discriminative features, constructs a separate model using secondary features, and integrates these two models conditionally. This provides a transparent and interpretable model architecture, which is beneficial for clinical decision support systems.

Results and Discussion

In this study, the classification performance of eight machine learning algorithms was evaluated for a binary classification problem. The models’ performances were compared using accuracy (ACC), sensitivity (TPR), specificity (TNR), positive predictive value (PPV), F1 score, AUC, and Matthews Correlation Coefficient (MCC). The Decision Tree and RusBoost-based Ensemble methods achieved the highest performance, with 91.7% accuracy, 90.1–92.8% F1 score, and 83.1% MCC. The Decision Tree was prioritized due to its interpretability, transparency, low computational cost, and clinical practicality. The decision tree structure illustrates the model’s classification logic. The SHAP analysis quantitatively shows the impact of variables on the model’s decisions. Affected lymph node was the most influential variable, followed by tumor size.

The use of only two features (affected lymph node and tumor size) was a conscious choice to avoid overfitting. Table 3 shows that affected lymph node, metastasis, and tumor size have the highest discriminative power. Due to high correlation, only one was retained to avoid redundancy. Since metastasis had lower information gain and Gini index values compared to the affected lymph node, the model retained the latter. Tumor size was included for its independent discriminative contribution. Variables like breast quadrant, laterality, and family history had low discriminative power.

A new decision tree model was developed by excluding the primary discriminative feature (affected lymph node) to assess the classification potential of previously omitted variables. The new decision tree model without the affected lymph node achieved high performance with 89.6% accuracy and 87.8–90.9% F1 score. The new model includes tumor size, metastasis, and the patient’s age. SHAP analysis indicates that metastasis, tumor size, and age contributed most to the predictions. The optimized model structure for large sample sizes comprises only affected lymph node, tumor size, metastasis, and age.

The Optimized Decision Tree model reduces overfitting, achieving better generalization, reflected in higher AUC (91.9% vs. 90.9%) and MCC (82.1% vs. 78.8%) values. This decision tree model achieves 91.15% accuracy and F1 scores ranging from 89.4% to 92.3% using only four variables. The model’s reliance on numerical thresholds may limit its universal applicability. Assignment to the malignant class should be explained through three core decision paths: direct classification in the presence of affected lymph nodes, favoring malignancy in patients with metastasis, and considering the combination of large tumor size and older age as significant for malignancy in the absence of lymph node involvement or metastasis. The model classifies breast tumor malignancy with high accuracy using only four core clinical variables and offers a clinically interpretable decision logic.

The model performs classification based on affected lymph node, metastasis, tumor size, and age. The optimized model incorporates four features using conditional feature augmentation for robust, explainable classification. The design avoids redundancy and aligns with clinical evaluation principles. The relationship between tumor size and malignancy risk is consistent with the literature. Instead of a fixed 4.5 cm cut-off, the model uses a qualitative approach, reducing overfitting and improving generalizability.

The use of machine learning approaches in breast cancer classification has increased significantly in recent years. Decision tree-based structures that use a small number of clinically meaningful variables offer advantages in interpretability and clinical usability. The decision flow of the model is structured to align with clinical reasoning. Most of the variables used in the model can be easily obtained during standard clinical evaluation, increasing the model’s applicability even in healthcare settings with limited resources.

Limitations and Future Directions

The main limitation of this study is the small size of the dataset and the issue of class imbalance. The dataset’s lack of molecular markers limits differentiation of malignant tumor types. Validation with larger, multi-center datasets including molecular and subtype data is recommended. The absence of molecular markers such as hormone receptor status (ER/PR), HER2 status, and BRCA mutations in the model limits its ability to reflect biological heterogeneity in full. Metastasis was assessed only as present/absent, without subtypes. Missing molecular and metastasis subtype data limits personalized prediction. The model is based on static clinical data and does not account for the temporal evolution of the disease.

In future studies, the use of larger, multi-center, and balanced datasets is crucial to enhance the generalizability of the model. Furthermore, incorporating more comprehensive clinical and biological variables such as family history, personal medical history, tumor localization, molecular markers (ER/PR, HER2, BRCA), and metastasis types will significantly strengthen the capacity for personalized prediction. The three-stage customized model is suitable only for preliminary screening, as it does not include molecular subtypes. For definitive diagnosis and treatment planning, immunohistochemistry (IHC) and genetic tests are required. In field applications as a clinical decision support system, the model’s impact on physicians’ decision-making processes, decision speed, and patient confidence should be evaluated.

This study developed an explainable decision tree model for breast cancer malignancy classification based on four key clinical variables. The model achieved a high level of accuracy (AUC: 91.89), offering a practical tool that can support clinicians in rapid screening and triage processes, particularly in resource-limited healthcare settings. Its transparent logic may help patients understand decisions, reduce anxiety, and build trust. The absence of molecular markers and metastasis localization limits the model’s applicability in definitive treatment planning. Therefore, the model should be used solely for preliminary assessment before histopathological evaluation.

Future studies should focus on integrating molecular markers and metastasis types into the model, performing multi-center external validation, and prospectively evaluating the model’s impact on patient-physician decision-making processes. Though not for biopsy prioritization or urgent surgery, the model is a foundational and applicable clinical tool. The dataset analyzed during the current study is publicly available on the Kaggle platform. The data are fully anonymized and comply with the data-sharing policies of the source platform.

A standard message is suggested: “You are in a high-risk group; this is not a diagnosis and should be confirmed.”