Prediction using Supervised Machine Learning (ML) with Python Implementation

Linear Regression

6 min readDec 30, 2023

Introduction

Linear regression is a statistical method used for modelling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. The simplest form of linear regression involves two variables: a dependent variable (often denoted as y) and an independent variable (often denoted as x).

Linear Regression — The independent variable is represented on the X-axis, while the dependent variable is depicted on the Y-axis. The regression line serves as the optimal fit for the model, and our primary goal within this algorithm is to determine and establish this best-fit line. — Image Courtesy: Analytics Vidhya

The linear regression equation is represented as:

y = mx + b

where:
- y is the dependent variable,
- x is the independent variable,
- m is the slope of the line (representing the change in y concerning a unit change in x),
- b is the y-intercept (the value of y when x is 0).

In multiple linear regression, there are more than one independent variable, and the equation becomes:

y = m₁x₁ + m₂x₂ + … + b

The goal of linear regression is to find the best-fitting line that minimizes the sum of the squared differences between the observed values and the values predicted by the linear equation. This process is often referred to as “fitting the line to the data” or “regression line fitting.”

Linear regression is commonly used for predicting the value of the dependent variable based on the values of one or more independent variables. Notably, linear regression assumes a linear relationship between the variables, and the method may not be appropriate if the relationship is non-linear.

There are different variations of linear regression, including simple linear regression (with one independent variable) and multiple linear regression (with two or more independent variables). The coefficients (slope and intercept) are estimated using statistical techniques, such as the method of least squares, to find the line that best fits the data.

Python Implementation-

In entirety, there are ten steps to address the Linear Regression machine learning problem, outlined as follows.

Step 0. Problem Statement:

To anticipate the percentage of marks a student is likely to achieve, considering the number of hours devoted to studying.

Step 1. Import required libraries:

Including all essential libraries at the beginning of the project is a vital initial step.

# Importing the required libraries
from sklearn.model_selection import train_test_split 
                                  # To split data in train and test dataset
from sklearn.linear_model import LinearRegression # Linear Regression Model
import matplotlib.pyplot as plt # Plotting Library
import pandas as pd # data analysis
import numpy as np # working with arrays
from sklearn import metrics # evaluation of the model

Step 2. Load the data

# Reading data from remote link
url = r"https://raw.githubusercontent.com/AdiPersonalWorks/Random/master/student_scores%20-%20student_scores.csv"
df = pd.read_csv(url) # convert data from source to dataframe of pandas
print("Data:")
df.head(10) # First 10 rows of data

Step 3. Explore the data

We consistently examine the data immediately after loading it, before conducting additional analysis or constructing machine learning models.

df.shape # Get the number of rows and columns

Upon running the code described above, it became apparent that there are 25 rows and 02 columns.

df.info()    # Get the information of the dataframe (data).

Based on the provided output, the ‘Hours’ column is of type float64, the ‘Scores’ column is of type int64, and the total memory required to load this data was 528.0 bytes.

df.describe()  # dataframe.describe() --> It generate descriptive statistics.

The presented output displays the count, mean, standard deviation (std), minimum, maximum, and quantiles of the data.

Step 4. Data Visualization

Utilizing data visualization aids in gaining a visual understanding of our data. The primary and crucial visualization involves plotting the distribution of scores.

# Plotting the distribution of scores
df.plot(x='Hours', y='Scores', style='o')  
plt.title('Hours vs Percentage')  
plt.xlabel('Hours Studied')  
plt.ylabel('Percentage Score')  
plt.show()

Based on the depicted graph, we can infer a positive linear correlation between the hours of study and the percentage of the score. In other words, an increase in the number of hours studied is associated with a rise in the percentage of the score.

Step 5. Data Preprocessing

In this stage, we partition our data into attributes and labels, a crucial step for training the model.

X = df.iloc[:, :-1].values # Attributes (Input)
y = df.iloc[:, 1].values   # Labels (Output)
print(X[:3:], '\n\n', y)   # Here we are printing only first 3 points of X.

Upon running the provided code snippet, we obtained a “numpy.ndarray” consisting of items representing attributes. The initial three lines present the output of X, succeeded by y. The output is outlined as follows:

Step 6. Model Training

The initial sub-step in the process of model training involves dividing the data into training and testing sets, followed by the application of the training algorithm.

X_train, X_test, y_train, y_test = train_test_split(X, y, 
             test_size=0.2, random_state=0) # Split data into training and testing
regressor = LinearRegression()
regressor.fit(X_train.reshape(-1,1), y_train)

print("Training complete.")
print(f"Intercept    : {round(regressor.intercept_, 3)}")
print(f"Coefficients : {round(regressor.coef_[0], 3)}")

Step 7. Plotting the Line of regression

Now that our model is trained, it’s time to visualize the regression’s best-fit line (LOR).

# Plotting the regression line
line = regressor.coef_*X+regressor.intercept_

# Plotting for the test data
plt.scatter(X, y, color="blue")
plt.plot(X, line,color='red', linewidth=2);
plt.show()

Step 8. Predictions

Having trained our algorithm, it is now time to assess the model through predictions. To achieve this, we will utilize our testing dataset.

# Testing data
print(X_test)
# Model Prediction 
y_pred = regressor.predict(X_test)

Step 9. Contrasting Actual Outcomes with Predicted Model Results.

# Comparing Actual vs Predicted
dff = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) 
dff

#Estimating training and test score
print("Training Score:", round(regressor.score(X_train, y_train), 3))
print("Test Score:", round(regressor.score(X_test, y_test), 3))

Creating a bar graph to visually depict the variance between the actual and predicted values.

dff.plot(kind='bar', figsize=(10,10))
plt.grid(which='major', linewidth='0.5', color='red')
plt.grid(which='minor', linewidth='0.5', color='blue')
plt.show()

# Testing the model with our own data
hours = 9.25
test = np.array([hours])
test = test.reshape(-1, 1)
own_pred = regressor.predict(test)
print(f"No of Hours = {hours}")
print(f"Predicted Score = {round(own_pred[0], 3)}")
print(f"If you spend {hours} hrs/day, then you will get your score  = {round(own_pred[0], 3)}")

Step 10. Evaluation of our Model

The concluding phase involves assessing the algorithm’s performance, which is crucial for comparing the efficacy of various algorithms on a specific dataset. Various errors have been computed in this stage to gauge the model’s performance and predict its accuracy.

print('The Mean Absolute Error:', round(metrics.mean_absolute_error(y_test, y_pred), 3)) 
print('The Mean Squared Error:', round(metrics.mean_squared_error(y_test, y_pred), 3))
print('The Root Mean Squared Error:', round(np.sqrt(metrics.mean_squared_error(y_test, y_pred)), 3))
print('R-2:', round(metrics.r2_score(y_test, y_pred), 3))