Analyzed bank customer data to segment customers using K-Means Clustering, with dimensionality reduction achieved through PCA. This approach resulted in 7 well-defined customer segments based on key financial behaviors, optimizing the bank's ability to market tailored products and services to their customers.
pd.read_csv(), checked for nulls and reviewed data types using .info() and .describe(). MINIMUM_PAYMENTS and CREDIT_LIMIT. Used pair plots and distribution plots to understand feature distributions and detect outliers. StandardScaler for optimal clustering.KMeans to group customers.# Importing the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
# Load the data
credit_card_df = pd.read_csv('/content/4.+Marketing_data.csv')
credit_card_df.info()
# Display descriptive statistics (Table 1)
credit_card_df.describe()
# Comments from analysis:
# - Mean balance is $1564
# - Balance frequency is frequently updated on average ~0.9, Purchases average is $1000, one off purchase average is ~$600
# - Average purchases frequency is around 0.5, Average ONEOFF_PURCHASES_FREQUENCY, PURCHASES_INSTALLMENTS_FREQUENCY, and CASH_ADVANCE_FREQUENCY are generally low
# - Average credit limit ~ 4500, Percent of full payment is 15%, Average tenure is 11 years
# Check for missing data (Table 2)
credit_card_df.isnull().sum()
# Replace the missing elements with mean of the 'MINIMUM_PAYMENT'
credit_card_df.MINIMUM_PAYMENTS.fillna(credit_card_df.MINIMUM_PAYMENTS.mean(), inplace=True)
# Replace the missing elements with mean of the 'CREDIT_LIMIT'
credit_card_df.CREDIT_LIMIT.fillna(credit_card_df.CREDIT_LIMIT.mean(), inplace=True)
# Plot to check for missing data (Figure 1)
sns.heatmap(credit_card_df.isnull(), yticklabels=False, cbar=False, cmap='Reds')
# Check for duplicate entries
credit_card_df.duplicated().sum()
# Remove Customer ID
credit_card_df.drop('CUST_ID', axis=1, inplace=True)
# Define function to create subplots of distplots with KDE for all columns
def dist_plots(dataframe):
fig, ax = plt.subplots(nrows=7, ncols=2, figsize=(15, 30))
index = 0
for row in range(7):
for col in range(2):
if index < dataframe.shape[1]: # Added safety check
sns.distplot(dataframe.iloc[:, index], ax=ax[row][col],
kde_kws={'color': 'blue', 'lw': 3, 'label': 'KDE'},
hist_kws={'histtype': 'step', 'lw': 3, 'color': 'green'})
index += 1
plt.tight_layout()
plt.show()
# Visualise distplots (Figure 2)
dist_plots(credit_card_df)
# Analysis comments:
# - 'Balance_Frequency' for most customers is updated frequently ~1, For 'PURCHASES_FREQUENCY', there are two distinct group of customers
# - For 'ONEOFF_PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY' most users don't do one off purchases or installment purchases frequently, Very small number of customers pay their balance in full 'PRC_FULL_PAYMENT'~0
# - Mean of balance is $1500, Credit limit average is around $4500, Most customers are ~11 years tenure
# Heatmap to visualise correlations (Figure 3)
correlations = credit_card_df.corr()
plt.figure(figsize=(20, 20))
sns.heatmap(correlations, annot=True)
# Analysis comments:
# - 'PURCHASES' have high correlation between one-off purchases, 'installment purchases, purchase transactions, credit limit and payments.
# - Strong Positive Correlation between 'PURCHASES_FREQUENCY' and 'PURCHASES_INSTALLMENT_FREQUENCY'
# Note: The following section appears to be for classification, but X_train, X_test, y_train, y_test are not defined
# You may need to add train_test_split and define your features and target variable
# Training Model on the Training set
# classifier = DecisionTreeClassifier(criterion='entropy', random_state=0)
# classifier.fit(X_train, y_train)
# Evaluating using confusion matrix
# y_pred = classifier.predict(X_test)
# cm = confusion_matrix(y_test, y_pred)
# print(cm)
# accuracy_score(y_test, y_pred)
# Display first few rows
credit_card_df.head()
# Apply Feature scaling
scaler = StandardScaler()
credit_card_df_scaled = scaler.fit_transform(credit_card_df)
# Display scaled data
print(credit_card_df_scaled)
# Use Elbow Method to find optimal number of clusters (Figure 4)
wcss = []
for i in range(1, 20):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(credit_card_df_scaled)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 20), wcss, 'bx-')
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('Score (WCSS)')
plt.show()
# Analysis comment:
# We can observe that, 4th cluster seems to be forming the elbow of the curve. However, the values does not reduce linearly until 8th cluster. Let's choose the number of clusters to be 7.
# Train data using K-Means method with 7 clusters
kmeans = KMeans(n_clusters=7, init='k-means++', random_state=42)
kmeans.fit(credit_card_df_scaled)
labels = kmeans.labels_
# Use Principal Component Analysis to reduce dimensionality
pca = PCA(n_components=2)
principalComp = pca.fit_transform(credit_card_df_scaled)
print(principalComp)
# Create a dataframe with the two components
pca_df = pd.DataFrame(data=principalComp, columns=['PCA1', 'PCA2'])
print(pca_df)
# Concatenate the clusters labels to the dataframe (Table 3)
pca_df = pd.concat([pca_df, pd.DataFrame({'cluster': labels})], axis=1)
pca_df.head()
# Visualise Clusters (Figure 5)
plt.figure(figsize=(10, 10))
ax = sns.scatterplot(x='PCA1', y='PCA2', hue='cluster', data=pca_df, palette='tab10')
plt.title('Clusters identified by PCA')
plt.show()
Initial visualisations of K-Means clusters were ambiguous due to the high dimensionality of features. Reducing dimensions with PCA made it easier to see meaningful separation, but it required balancing between retaining variance and simplifying complexity. I resolved this by examining explained variance ratios and adjusting the number of components accordingly.