This project built a multiple linear regression model to predict startup profitability based on their R&D spending, administration costs, marketing expenditure, and location. I implemented a complete machine learning pipeline using scikit-learn to analyze which factors most strongly influence startup success and revenue generation.
pd.read_csv() and strategically separated features (X) from the target variable (y) using .iloc[:, -1] for all columns except the last, and .iloc[:, -1] for the dependent variable (profit). ColumnTransformer and OneHotEncoder() to convert the categorical 'State' variable (column index [3]) into numerical dummy variables, while keeping other numerical features intact using remainder='passthrough'. np.array(ct.fit_transform(X)) to convert the transformed data back into a NumPy array format suitable for machine learning algorithms. Implemented train_test_split() with an 80-20 split (test_size=0.2) and fixed random state (random_state=0) to ensure reproducible results and proper model validation. Instantiated and trained a LinearRegression() model using .fit(X_train, y_train) to learn the relationships between startup characteristics and profitability. Generated predictions on the test set using regressor.predict(X_test) to evaluate model performance on unseen data.np.set_printoptions(precision=2) for clean output formatting and np.concatenate() with reshape() to create side-by-side comparison of predicted vs. actual values for easy performance assessment.# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
# OneHotEncoder(), [3] - the 3 is the column you want to encode
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))
# Splitting Train and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting Results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
# Evaluating the Model Performance
from sklearn.metrics import r2_score
r2_score(y_test, y_pred)
The array reshaping and concatenation for results display presented a significant hurdle because the predicted and actual values were 1D arrays that couldn't be directly concatenated horizontally. The error occurred when trying to use np.concatenate() without proper dimensionality. This was solved by using reshape(len(y_pred),1) to convert both arrays into column vectors (2D arrays with one column), then applying horizontal concatenation with the parameter 1 to stack them side-by-side. This approach created a clean comparison matrix showing predicted values next to actual values, making model performance evaluation much more intuitive.