Breast Cancer Classification: Multi-Algorithm Comparison

This project implemented and compared six different machine learning classification algorithms to predict breast cancer diagnosis (malignant vs benign) based on cellular characteristics. I built a comprehensive medical classification pipeline using multiple algorithms to identify the most effective approach for cancer detection and diagnosis support.

💻 Tech Stack:

Python for machine learning model development and comparison
Scikit-learn for multiple classification algorithms, preprocessing, and evaluation metrics
Pandas for dataset loading and initial data exploration
Matplotlib for data visualisationsand model performance analysis
NumPy for numerical operations and grid generation

🧪 Data Pipeline:

Load & inspect data: Loaded breast cancer dataset using pd.read_csv() and separated cellular features (X) from diagnosis labels (y) using iloc[:, :-1] and iloc[:, -1] respectively, ensuring proper handling of medical diagnostic data.
Train-test stratification: Applied train_test_split() with 75-25 split (test_size=0.25) and fixed random state for reproducible medical model evaluation, crucial for healthcare applications.
Feature standardization: Implemented StandardScaler() using fit_transform() on training data and transform() on test data to normalize cellular measurements across different scales without data leakage.
Logistic Regression: Built a LogisticRegression(random_state=0) model as the statistical baseline for binary medical classification, providing interpretable probability outputs for clinical decision-making.
Support Vector Machine (Linear): Implemented SVC(kernel='linear') to find optimal linear decision boundaries for separating malignant from benign cases using maximum margin principles.
Decision Tree Classification: Applied DecisionTreeClassifier(criterion='entropy') to create interpretable rule-based diagnostic pathways that clinicians can follow and understand.
K-Nearest Neighbors: Used KNeighborsClassifier(n_neighbors=5, metric='minkowski', p=2) to classify cases based on similarity to neighboring data points, leveraging local patterns in cellular characteristics.
Support Vector Machine (RBF): Implemented SVC(kernel='rbf') with radial basis function kernel to capture complex non-linear relationships in cellular feature space.
Naive Bayes: Applied GaussianNB() assuming feature independence to provide probabilistic classification based on Bayesian statistics, suitable for medical diagnostic scenarios.
Performance evaluation: Generated predictions using classifier.predict(X_test) and evaluated each model using confusion_matrix() and accuracy_score() to assess diagnostic accuracy and error patterns.
Medical model validation: Created confusion matrices to analyze true positives, false positives, true negatives, and false negatives - critical metrics for medical diagnostic applications where false negatives (missed cancers) are particularly concerning..

📊 Code Snippets & Visualisations:

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('Breast cancer data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

# Training Model on the Training set
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)

# Evaluating using confusion matrix
from sklearn.metrics import confusion_matrix, accuracy_score
y_pred = classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)
accuracy_score(y_test, y_pred)

Logistic Regression visualization for breast cancer classification — **Figure 1** Logistic Regression

Support Vector Machine visualization for breast cancer classification — **Figure 2** SVM

Kernel SVM visualization for breast cancer classification — **Figure 3** Kernel SVM

K-Nearest Neighbors visualization for breast cancer classification — **Figure 4** K-Nearest Neighbor

Naïve Bayes visualization for breast cancer classification — **Figure 5** Naïve Bayes

Random Forest visualization for breast cancer classification — **Figure 6** Random Forest

Decision Tree visualization for breast cancer classification — **Figure 7** Decision Tree

🌟 Key Insights:

Multiple algorithms provided different approaches to cancer classification, each with unique strengths for medical diagnosis
Feature standardization proved crucial for distance-based algorithms (SVM, KNN) due to varying scales of cellular measurements
Confusion matrix analysis revealed the trade-offs between sensitivity (detecting cancer) and specificity (avoiding false alarms)
Model comparison enabled selection of the most reliable algorithm for medical diagnostic support

🧗🏾 Challenge Faced:

Working with medical diagnostic data presented a critical class imbalance consideration that required careful attention to evaluation metrics beyond simple accuracy. While accuracy score provides an overall performance measure, it can be misleading in medical contexts where false negatives (missing actual cancer cases) have far more severe consequences than false positives (flagging benign cases as suspicious). The challenge was ensuring that model evaluation properly weighted the clinical importance of sensitivity (recall) versus specificity, as a model with 95% accuracy might still miss 20% of actual cancer cases if the dataset is imbalanced. This was addressed by implementing confusion matrix analysis to examine true positives, false positives, true negatives, and false negatives separately, enabling assessment of each model's ability to minimize the most clinically dangerous errors while maintaining overall diagnostic reliability.

View on GitHub

← Back to Projects