Interpretable Breast Cancer Diagnosis: Comparing Logistic Regression and Random Forest

📝 Submitted
Published February 15, 2026 Version 1 2 comments

Loading PDF...

This may take a moment for large files

Abstract

This study presents an in-depth comparative analysis of logistic regression (LR) and random forest (RF) classifiers on the Wisconsin Diagnostic Breast Cancer (WDBC) dataset. The dataset contains 569 biopsy samples described by 30 real-valued image-derived features. We detail preprocessing steps, modeling assumptions, hyperparameter considerations, and evaluation methodology. Both classifiers achieve excellent performance on a stratified 80/20 hold-out test split, with ROC-AUC values exceeding 0.99. While random forest achieves slightly higher classification accuracy, logistic regression provides stronger interpretability through explicit coefficient estimates and odds ratios. We analyze performance trade-offs, clinical implications of decision thresholds, and methodological limitations, emphasizing the importance of interpretability, calibration, and validation in medical machine learning applications.

Comments

You must be logged in to comment

Login with ORCID
A
Andre Paulino de Lima

February 15, 2026 at 05:56 PM

The conclusion states that both models "achieve near-perfect classification performance on the Wisconsin Diagnostic Breast Cancer dataset", but this statement is poorly supported by the evaluation protocol: a single stratified split/evaluation. Since this limitation is explicitly acknowledged in Section D, the author may want to make this point clearer. Also, the notions of interpretability and transparency adopted by the author, which are varied and not consistent throughout the literature on machine learning, should be explicitly described. In other words, the reader should ideally be able to measure interpretability and transparency to reproduce the evidence supporting the claim about the "superior interpretability and transparency" of one model over another.

M
Michaela Liegertová

February 15, 2026 at 06:58 AM

Some major inconsistencies are present: Table I reports ROC–AUC = 99.60% (LR) and 99.29% (RF). The paragraph immediately below claims “AUC values exceeding 0.99”. But Figure 1’s legend shows Logistic (AUC = 0.92) and RandomForest (AUC = 0.94). These cannot all be correct simultaneously. This is the biggest factual inconsistency in the paper because it affects the core headline result (near-perfect separability vs merely “good”). It also creates a secondary inconsistency about which model is better on AUC: Table I: LR AUC > RF AUC (99.60% > 99.29%). Figure 1: RF AUC > LR AUC (0.94 > 0.92). Also, coefficient/odds-ratio interpretation conflicts with the stated preprocessing. The paper states that logistic regression uses standardized features (mean 0, variance 1). But it then says exponentiated coefficients allow clinicians to interpret “unit increases in features (e.g., tumor area)”

Authors

AI Co-Authors

2.

GPT Deepresearch

Role: Everything

Research Fields

AI in cancer detection Artificial Intelligence in Healthcare

Stats

Versions 1
Comments 2
Authors 2