This project implemented and compared five different regression algorithms to predict employee salaries based on position levels within an organization. I built a comprehensive machine learning pipeline comparing linear regression, polynomial regression, support vector regression (SVR), decision tree regression, and random forest regression to identify the optimal model for HR compensation analysis.
pd.read_csv() and extracted features using iloc[:, 1:-1] (position levels) and target variable using iloc[:, -1] (salaries), strategically excluding the first column containing position titles. LinearRegression() model using .fit(X, y) to establish a baseline for salary prediction based on position level with a straight-line relationship. PolynomialFeatures(degree=4) to transform the single position level feature into polynomial terms (x, x², x³, x⁴), creating a richer feature space to capture non-linear salary progression patterns.StandardScaler()for both X and y variables, then trained an SVR model with RBF kernel to handle non-linear relationships while managing the different scales between position levels and salary amounts.DecisionTreeRegressor() model that creates hierarchical decision rules to predict salaries, capturing complex non-linear patterns without requiring feature scaling.RandomForestRegressor() with multiple decision trees to reduce overfitting and improve prediction stability through ensemble learning.PolynomialFeatures(degree=4) to transform the single position level feature into polynomial terms (x, x², x³, x⁴), creating a richer feature space to capture non-linear salary progression patterns.RandomForestRegressor() with multiple decision trees to reduce overfitting and improve prediction stability through ensemble learning.sc_X.inverse_transform() and sc_y.inverse_transform() to convert scaled predictions back to original salary units, with proper reshaping using .reshape(-1, 1) for visualization.plt.scatter() for actual data points and plt.plot() for the linear regression line, showing the limitation of straight-line salary prediction. Generated similar visualizations for the polynomial model using lin_reg_2.predict(poly_reg.fit_transform(X)) to display the curved relationship between position levels and salaries.lin_reg.predict([[6.5]]) and lin_reg_2.predict(poly_reg.fit_transform([[6.5]])) to compare model outputs for intermediate position levels.# Importing Libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing dataset
dataset = pd.read_csv('Position_Salaries.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
# Training the Decision Tree Regression
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
# Predicting New Result
regressor.predict([[6.5]])
# Visualising Results (Figure 1)
X_grid = np.arange(min(X), max(X), 0.01)
# 0.01 was adjusted from 0.1 to increase the resolution
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('HR Salary Predictions (Decision Tree Regression)')
plt.xlabel('Position level')
plt.ylabel('Salary')
plt.show()
# Evaluating the Model Performance
from sklearn.metrics import r2_score
# Since the model was trained on the whole dataset, we evaluate on the whole dataset
y_pred = regressor.predict(X)
r2_score(y, y_pred)
The SVR model visualization presented scaling complications because support vector regression requires feature scaling for optimal performance, but the visualization needed to display results in original salary units. The challenge was handling the forward and inverse transformations correctly. This was resolved by implementing a multi-step process: using sc_X.transform(X_grid) to scale the grid for SVR prediction, then applying sc_y.inverse_transform() to convert predictions back to actual salary values, with careful attention to array reshaping using .reshape(-1, 1) to maintain proper dimensionality throughout the scaling pipeline. This approach ensured accurate model performance while maintaining interpretable visualizations in original salary units.