Customer Purchase Prediction – Model Comparison & Optimization

This project involved comparing multiple classification algorithms to predict whether users would purchase a product based on their age and estimated salary from social network advertisement data. I tested various models including Logistic Regression, SVM, Kernel SVM, Naive Bayes, K-NN, Random Forest, and Decision Tree. The Decision Tree classifier yielded the best results, which is why I've included its implementation in my portfolio.

💻 Tech Stack:

Python for machine learning model development and comparison

Scikit-learn for machine learning model implementation and evaluation

Pandas for dataset loading and initial data exploration

Matplotlib for creating decision boundary visualizations and scatter plots

NumPy for numerical operations and grid generation

🧪 Data Pipeline:

Model Comparison & selection: Tested multiple classification algorithms (Logistic Regression, SVM, Kernel SVM, Naive Bayes, K-NN, Random Forest, and Decision Tree) to identify the best-performing model for this dataset.

Data Import & Preparation: Loaded the Social Network Ads dataset using pandas and separated features (age, salary) from the target variable (purchase decision).

Data Splitting: Used train_test_split() to divide the dataset into 75% training and 25% testing sets with a fixed random state for reproducibility.

Feature scaling: Applied StandardScaler to normalize both age and salary features, ensuring equal contribution to the model since salary values are much larger than age values.

Model Training: Implemented and trained seven different classifiers on the scaled training data: LogisticRegression, SVC (linear and RBF kernel), GaussianNB, KNeighborsClassifier, RandomForestClassifier, and DecisionTreeClassifier with entropy criterion.

Model Evaluation: Generated predictions on the test set and created a confusion matrix to assess classification performance and calculate accuracy score..

Decision Boundary Visualization feature engineering: Created contour plots showing decision boundaries for both training and test sets, with red and green regions representing different classification zones.

📊 Code Snippets & Visualisations:

# Importing Libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing Dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values # Splitting dataset into Training & Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0) print(X_train), print(X_test) print(y_train), print(y_test) # Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test) print(X_train), print(X_test) # Training Decision tree classification Model from sklearn.tree import DecisionTreeClassifier classifier = DecisionTreeClassifier(criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train) # Predicting Test Results y_pred = classifier.predict(X_test) print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1)) # Making Confusion Matrix from sklearn.metrics import confusion_matrix, accuracy_score cm = confusion_matrix(y_test, y_pred) print(cm) accuracy_score(y_test, y_pred) # Visualising Training set results from matplotlib.colors import ListedColormap X_set, y_set = sc.inverse_transform(X_train), y_train X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25), np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25)) plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Decision Tree Classification (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show() # Visualising Test set results from matplotlib.colors import ListedColormap X_set, y_set = sc.inverse_transform(X_test), y_test X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.25), np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.25)) plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Decision Tree Classification (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

Logistic Regression visualization — **Figure 1** Logistic Regression

Logistic Regression performance metrics — **Table 1** Logistic Regression Metrics

Support Vector Machine visualization — **Figure 2** Support Vector Machine

SVM performance metrics — **Table 2** SVM Metrics

Kernel SVM visualization — **Figure 3** Kernel SVM

Kernel SVM performance metrics — **Table 3** Kernel SVM Metrics

Naive Bayes visualization — **Figure 4** Naive Bayes

Naive Bayes performance metrics — **Table 4** Naive Bayes Metrics

K-Nearest Neighbors visualization — **Figure 5** K-Nearest Neighbors

KNN performance metrics — **Table 5** KNN Metrics

Random Forest visualization — **Figure 6** Random Forest

Random Forest performance metrics — **Table 6** Random Forest Metrics

Decision Tree visualization — **Figure 7** Decision Tree

Decision Tree performance metrics — **Table 7** Decision Tree Metrics

🌟 Key Insights:

Decision Tree outperformed six other classification models including Logistic Regression, SVM variants, Naive Bayes, K-NN, and Random Forest, demonstrating superior accuracy for this specific age-salary prediction task

Feature scaling significantly improved model performance across all algorithms by preventing salary values from dominating the decision-making process due to their larger magnitude compared to age values

Visual analysis revealed clear classification patterns where younger, lower-salary individuals and older, higher-salary individuals showed different purchasing behaviors, with the winning Decision Tree model capturing these complex relationships most effectively

🧗🏾 Challenge Faced:

At first, the visualisations were hard to understand because the data had been scaled. The age and salary values didn’t look realistic in the plots. I solved this by converting the data back to its original scale before plotting. This made the decision areas easier to read and relate to real-life values.

Customer Purchase Prediction – Model Comparison & Optimization

💻 Tech Stack:

🧪 Data Pipeline:

📊 Code Snippets & Visualisations:

🌟 Key Insights:

🧗🏾 Challenge Faced: