Linear regression in Python

Linear regression in Python

26 December 2018 0 By Alvaro

In this article I will talk about linear regression, surely the simplest algorithm of supervised learning within the paradigm of machine learning. As we discussed in a previous post, there are four types of learning: supervised, unsupervised, semi-supervised and reinforcement learning.

Linear regression is among the supervised algorithms, which is a useful tool to predict a quantitative response.

There are many later algorithms that can be seen as a generalization or an extension of linear regressions, so it is important to have a real knowledge of this algorithm, although it may seem simple.

In a broad sense what a linear regression does is to obtain the relationship between some independent variables (\( X \)) and a dependent variable (\( Y \)). That is, having a series of predictor variables obtains the relationship with a quantitative variable to be predicted. Linear regression explains the variable \( Y \) with the variables \( X \), and obtains the linear function that best fits or explains this relationship.

Simple linear regression

Introduction to the model


Simple linear regression starts from a single predictor variable, ie \(X=x_1\:where\:X\in\mathbb{R}\) and assumes that there is approximately a linear relationship between X and Y. Linear relationship can be written as:
$$ Y = \underbrace {\beta_0+\beta_{1}X}_{model}+\underbrace{\epsilon}_{error}$$
That is, Y is equal to X multiplied by a coefficient plus an offset coefficient and an error term. The part \(\beta_0+\beta_{1}X\) is the linear regression model, with \(\beta_0\) and \(\beta_{1}\) the coefficients of the linear regression and \(\epsilon\) the error made by the model.

The terms \(\beta_0\) and \(\beta_1\) respectively represent the interceptor and the slope of the linear model. They are what are called coefficients or parameters of the linear regression model. These two coefficients are two constants that are obtained by means of what is called “training” of the regression model. This training is done with data from which we know the real values ​​of the tuple \((X, y)\). Once the estimation of these two coefficients is obtained, the variable Y can be predicted using known X and from the following formula:
$$\hat{y}=\hat{\beta_0}+\hat{\beta_1}x$$
Being \(\hat{y}\) the prediction we get and \(\hat{\beta_0}\) and \(\hat{\beta_1} \) estimates obtained with training .

If the error term \(\epsilon\) is low for this sample, our prediction will be approximately equal to the real value: $$\hat{y}\approx y$$

Therefore, the training and adjustment of the simple regression model will try to obtain the values of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that better adjust the model, that is, better predict the \(Y\) as a function of \(X\).

Estimation of the coefficients


As we have said before, the estimation of the coefficients is done with known real values of the tuple \( (X, y)\). Suppose we have \( n \) records with known independent and dependent variable:

$$(x_1,y_1),(x_2,y_2),…,(x_n,y_n)$$

With these n observations, the objective is to obtain the estimates of the coefficients \(\hat{\beta_0}\) and \(\hat{\beta_1}\) that make the linear model fit the best possible to the data. That is, that \(y_i\approx\hat{\beta_0}+\hat{\beta_1}x_i\) for all \(x_i\), or what is the same, find the slope and the ordinate at the origin of the line that passes closest to the points \((x_i, y_i)\).

This measure of closeness between the line and the points can be done in different ways, the most common way being to adjust by least squares criterion.

Estimate by least squares

Being \(\hat{y_i} = \hat{\beta_0} + \hat{\beta_1} x_i \) the predictions of \(y_i\), the residual or difference between the prediction and the value Real is:
$$r_i=y_i-\hat{y_i}$$

The least squares method will obtain the coefficients that minimize the sum of squared residuals, that is, that minimize:

$$RSS=\sum_{i=1}^{n}{r_i}^2$$

Using partial derivatives and equating to 0 we obtain that:
$$\left\{\begin{array}{1}\frac{\partial{RSS}}{\partial{\hat{\beta_1}}} = 0 \longrightarrow \hat{\beta_1} = \frac{\sum^{n}_{i=1}(x_i – \bar{x})(y_i – \bar{y})}{\sum^{n}_{i=1}(x_i – \bar{x})^2}\\\\
\frac{\partial{RSS}}{\partial{\hat{\beta_0}}} = 0 \longrightarrow \hat{\beta_0} = \bar{y} – \hat{\beta_1}\bar{x}\end{array}\right.$$

Being \(\bar{x}\) and \(\bar{y}\) the arithmetic means of the samples.

Example of simple linear regression using Python


First I create a test dataset, to which I will adjust a linear regression model.


import numpy as np
import random
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
% matplotlib inline

# Generator of data distribution for simple linear regression
def simple_data_generator(beta, samples, deviation):
# Gender n (samples) random x values between 0 and 100
x = np.random.random(samples) * 100
# I generate a random Gaussian error with standard deviation (deviation)
e = np.random.randn(samples) * deviation
# I get the and real as x * beta + error
y = x * beta + e
return x.reshape((samples, 1)), y.reshape ((samples, 1))

# Parameters of the distribution
deviation = 200
beta = 10
n = 50
x, y = simple_data_generator(beta, n, deviation)

# I represent the generated data
plt.scatter(x, y)
plt.show()

generacion-random-data

Generation of random data


# I create a linear regression model
model = linear_model.LinearRegression()

# I train the model with the data (X, Y)
model.fit(x, y)
# Now I can get the coefficient b_1
print u'Coeff beta1: ', modelo.coef_[0]

# We can predict using the model
y_pred = modelo.predict(x)

# Finally, we calculate the mean square error and the R ^ 2 statistic
print u'Square quadratic error:% .2f '% mean_squared_error(y, y_pred)
print u'Statistic R_2:% .2f '% r2_score(y, y_pred)

Beta1 coefficient: [9.89500401]
Mean square error: 22140.95
Statistic R_2: 0.79

# We represent the adjustment (red) and the line Y = beta * x (green)
plt.scatter(x, y)
plt.plot(x, y_pred, color = 'red')
x_real = np.array([0, 100])
y_real = x_real * beta
plt.plot(x_real, y_real, color = 'green')
plt.show()

fit-linear

Adjustment of the regression lines

Multiple linear regression

Introduction to the model


In practice, we usually have n predictor variables, not just 1. We could do n simple linear regressions, but each of them would ignore the other n-1 variables and we would not take advantage of the relationships between variables.

Simple linear regression can be extended for the case of n variables such as:

$$Y=\underbrace{\beta_0+\beta_{1}X_1+\beta_{2}X_2+…+\beta_{n}X_n}_{modelo}+\underbrace{\epsilon}_{error}$$

In this case, we assign a coefficient that marks the slope to each predictor variable. These coefficients \(\beta_i\) mark the linear relationship between \( Y \) and the variable \( X_i \). It can be interpreted as the mean effect in Y of an increment of one unit in \( X_i \), keeping the rest of fixed predictive variables.

Estimation of the coefficients


Analogously to simple linear regression, we can obtain estimates \(\hat{\beta_i}\) from the coefficients \(\beta_i\) by training and then make predictions of Y using the formula:

$$\hat{y}= \hat{\beta_0}+\hat{\beta_1}x_1+\hat{\beta_2}x_2+…+\hat{\beta_n}x_n$$

These parameters are calculated using a measure of closeness as the average squares. We calculate the \(\hat{\beta_i}\) that minimizes the sum of squared residuals:

$$RSS=\sum_{i=1}^{n}{r_i}^2 = \sum_{i=1}^{n}({y_i-\hat{y_i}})^2 = \sum_{i=1}^{n}({y_i-\hat{\beta_0}+\hat{\beta_1}x_{i1}+\hat{\beta_2}x_{i2}+…+\hat{\beta_n}x_{in}})^2$$


import numpy as np
import random
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt
% matplotlib inline
from mpl_toolkits.mplot3d import Axes3D

def_datos_multiple (coefficients, samples, std_dev):
# We calculate the number of predictors and create a matrix
# with the coefficients with p rows and 1 column for
# multiplication of matrices
n_coefficients = len(coefficients)
coef_matriz = np.array(coefficients).reshape(n_coeficientes, 1)
# Same as in the case of simple linear regression
x = np.random.random_sample((samples, n_coefficients)) * 100
epsilon = np.random.randn(samples) * std_dev
# Since x is a matrix samples x n_coefficients, and
# coef_matriz is n_coeficientes x 1
# We can do matrix multiplication to obtain and
# dice x1, x2, ..., xn we need to do the transpose
# to obtain an array 1x samples instead of samplesx1 for
# use regression
y = np.matmul(x, coef_matriz).transpose() + epsilon
return x, and

# I generate the data that I will adjust with the line
coefficients_real = [10, 5]
Samples = 200
std_dev = 100

X, Y = multiple_database_generator (real_co coefficients, samples, std_dev)

# I create a linear regression model
model = linear_model.LinearRegression()

# I train the model with the data (X, Y)
model.fit(X, Y.transpose())
# Now I can get the coefficient b_1
print u'Coeficiente beta1: ', modelo.coef_ [0]

# We can predict using the model
y_pred = modelo.predict(X)

# Finally, we calculate the mean square error and the R ^ 2 statistic
print u'Square quadratic error:% .2f '\
        % mean_squared_error(Y.transpose(), y_pred)
print u'Statistic R_2:% .2f '% r2_score(Y.transpose(), y_pred)

 Beta1 coefficient: [9.73631401 4.9061911]
Mean square error: 8628.75
Statistic R_2: 0.92 
f, [p1, p2] = plt.subplots(1,2)
f.set_figwidth (15)

# I represent the points for variable X1 and for Y
p1.scatter(X[:, 0], Y)
p1.set_title(u'Representation of Y in function of X1 ')
p1.set_xlabel('X1')
p1.set_ylabel('Y')

p2.scatter(X[:, 1], Y)
p2.set_title(u'Representation of Y in function of X2 ')
p2.set_xlabel('X2')
p2.set_ylabel('X')
generacion-datos-aleatorio-multiple

Generation of random data of 2 independent variables and 1 dependent

# I represent the surface that best fits the data
p3 = plt.figure(figsize = (15.10)). gca (projection = '3d')
x1, x2 = np.meshgrid(range (100), range (100))
# Surface obtained with multiple linear regression
z_modelo = model.coef_[0][0] * x1 + model.coef_[0][1] * x2
# Real surface of the data
z_real = real coefficients[0] * x1 + real coefficients[1] * x2
# I represent both surfaces
p3.plot_surface(x1, x2, z_model, alpha = 0.3, color = 'green')
p3.plot_surface(x1, x2, z_real, alpha = 0.3, color = 'yellow')
# I also represent the data to see the adjustment
p3.scatter(X[:, 0], X[:, 1], Y)
p3.set_title(u'Regression linear using two variables \
               (X1, X2) to predict Y ')
p3.set_xlabel('Axis X1')
p3.set_ylabel('Axis X2')
p3.set_zlabel('Y axis')
p3.view_init (10,)
plt.show ()

Adjusting the two hyperplanes to the data

Evaluation of model accuracy

The quality of the linear regression adjustment is typically evaluated using two measures: the RSE (standard error of residuals or residual standard error) and the statistic R squared.

RSE


The RSE is an estimate of the standard deviation of the error \(\epsilon\). In other words, it is the average quantity that the answer will deviate from the real regression line. It is obtained with the following formula:
$$RSE = \sqrt{\frac{RSS}{(n - 2)}} = \sqrt{\frac{\sum_{i=1}^{n}{(y_i-\hat{y_i})}^2}{(n - 2)}}$$

Its value can be good or bad depending on the context since it is not dimensionless. If the values ​​are very large, we can obtain huge but relatively good CSRs since they depend on the dimension of the Y's. Independently, lower CSR values ​​indicate a better fit.

Coefficient of determination (R squared)


The RSE, being an absolute measure and measured in units of Y, it is not very clear what constitutes good CSR. The \(R^2\) or coefficient of determination , however it takes values ​​between 0 and 1, represents the proportion of the variance explained and is independent of the scale of Y.

To calculate the \(R^2\) the following formula is used:

$$R^2 = \frac{TSS-RSS}{TSS} = 1 - \frac{RSS}{TSS}$$

TSS (Total sum of squares) being the sum total of square errors. Measure the total variance of the answer Y. Its calculation is:

$$TSS = \sum_{i=1}^{n}({y_{i} - \bar{y}})^2$$

RSS measures the variability that remains unexplained in the model. TSS measures the inherent variability of Y before modeling. That is, \(R^2\) measures the proportion of the variability of Y that is explained by the model. Or put another way, it measures the ratio in terms of quadratic error that our model improves to a model that simply gives as a prediction for all y, the mean \(\hat{y_i}=\bar{y}\)

A value of \(R^2\) close to 1 indicates that a large proportion of the variability of Y is explained or removed by the model. However, like the RSE, it depends on the application what makes a \(R^2\) good or not, since there are problems where the residual error is small and others where it is not.

The statistic \(R^2\) is a measure of the linear relationship between X and Y. So is the correlation, whose formula is:

$$r = Corr(X,Y) = \frac{\sum_{i=1}^{n}({x_{i} - \bar{x}})({y_{i} - \bar{y}})}{\sqrt{\sum_{i=1}^{n}({x_{i} - \bar{x}})^2}\sqrt{\sum_{i=1}^{n}({y_{i} - \bar{y}})^2}}$$

In fact, in the case of simple linear regression, it is true that \(r^2 = R^2\). This does not happen in the case of multiple linear regression, since the correlation relates only to a pair of variables. However, the coefficient \(R^2\) is still valid for multivariable cases.

¡Comparte!