Linear Regression

Today we’ll talk about linear regression. We use linear regression when we wan to predict future outcomes.

Linear regression basically assumes that we have data($x_{ i }, y_{ i }$) for i=1,…,n. Furthermore $x_{ i }$ is not random and is called the predictor variable. $y_{ i }$ on the other side is a function of $x_{ i }$ with some random noise and is called the response variable.

Using least squares to fit a line

Suppose we have data($x_{ i }, y_{ i }$) and we want to find the best fitting line $y=\beta_{ 0 }+\beta_{ 1 }x$.

We model the line with $y_{ i }=\beta_{ 0 }+\beta_{ 1 }x_{ i }+\epsilon_{ i }$ where $\epsilon_{ i }$ is the random noise. The random noise is then described by $\epsilon_{ i }=y_{ i }-\beta_{ 0 }-\beta_{ 1 }x_{ i }$. We want to find the best fitting line so we want to minimise our error. Because the error term can be negative or positive, we square the error. Least Squares therefore finds the estimates $\hat{ \beta }_{ 0 }, \hat{ \beta }_{ 1 }$ of $\beta_{ 0 }, \beta_{ 1 }$ that minimises the sum of the squared error.

$S(\beta_{ 0 }, \beta_{ 1 })=\sum{ \epsilon_{ i }^{ 2 } }=\sum{ (y_{ i }-\beta_{ 0 }-\beta_{ 1 }x_{ i })^{ 2 } }$

Doing some calculus we get: $\hat{ \beta }_{ 1 }=\frac{ s_{ xy } }{ s_{ xx } }$ and $\hat{ \beta }_{ 0 }=\overline{ y }-\hat{ \beta }_{ 1 }\overline{ x }$

where:

• $\overline{ x }=\frac{ 1 }{ n }\sum{ x_{ i } }$
• $\overline{ y } =\frac{ 1 }{ n }\sum{ y_{ i } }$
• $s_{ xx }=\frac{ 1 }{ n-1 }\sum{ (x_{ i }-\overline{ x })^{ 2 } }$; The sample variance of x.
• $s_{ xy }=\frac{ 1 }{ n-1 }\sum{ (x_{ i }-\overline{ x })(y_{ i }-\overline{ y }) }$; The sample covariance of x and y.

We call the fitting of a line to given data simple linear regression. Furthermore we call $\epsilon_{ i }$ the residual.

Homoscedasticity  is when $\epsilon_{ i }$ has the same variance for all i.

Heteroscedastic is when the variance of $\epsilon_{ i }$ increases as x increases. If that is the case we have to transform the data.

We can visualise simple linear regression with a few lines of python code. Note the error is here not normal distributed but the regression works though.

import math
import numpy as np
import random

import matplotlib.pyplot as plt
from matplotlib import style

style.use('ggplot')
fig = plt.figure()
ax1 = plt.subplot2grid((1,1), (0,0))

def simple_regression(xs,ys):
if len(xs)==len(ys):
x_bar = sum(xs)/float(len(xs))
y_bar = sum(ys)/float(len(ys))

sxx = sum((x-x_bar)**2 for x in xs)/(len(xs)-1)
sxy = sum((x-x_bar)*(y-y_bar) for x,y in zip(xs,ys))/(len(xs)-1)

beta = sxy/sxx
alpha = y_bar - beta*x_bar

return alpha, beta
else:
print ("Error x and y are not of the same length")

xs = np.arange(0.1,5,0.3)
ys = [(1.5*x+0.5+random.uniform(-1,1)) for x in xs]

alpha, beta = simple_regression(xs,ys)
print('Alpha: {}\nBeta: {}'.format(alpha, beta))

ax1.scatter(xs, ys, color='blue', label='Data')
xs = np.arange(0,6,0.01)
ys =[(beta*x+alpha) for x in xs]
ax1.plot(xs,ys, color='red', label='Regression Line')
plt.legend()
plt.show()



The python code from above produces the following outcome:

Linear regression for polynomial data

Linear in linear regression doesn’t mean that the data has to be linear. Indeed tha data can have any possible form the only restriction is, that the parameters are linear, i.e. $\beta_{ o }$ or $\beta_{ 1 }$

Example Parabola:

We can for example fit a line to a parabola with $y=\beta_{ 0 }+\beta_{ 1 }x+\beta_{ 2 }x^{ 2 }$. Our Data would then follow $y_{ i }=\beta_{ 0 }+\beta_{ 1 }x_{ i }+\beta_{ 2 }x_{ i }^{ 2 }+\epsilon_{ i }$. The squared error is therefore:

$S(\beta_{ 0 },\beta_{ 1 },\beta_{ 2 })=\sum{ (y_{ i }-(\beta_{ 0 }+\beta_{ 1 }x+\beta_{ 2 }x^{ 2 }))^{ 2 } }$

To find the best fitting line we have to substitute the values for $x_{ i }$ and $y_{ i }$ and find the values for $\beta_{ 0 },\beta_{ 1 },\beta_{ 2 }$ that minimise the squared error.

Measuring the fit

Linear regression is not just about fitting a line to the data, it is also about measuring how well that line fits the data. There’s no use for a line that best fits the data but doesn’t tell us anything about the data, because the fit is so bad. We call the measure of the fit the coefficient of determination or $R^{ 2 }$. It tells us the ratio of the “explained” part to the total sum of squares. Or in mathematical terms:

$R^{ 2 }=\frac{ TSS-RSS }{ TSS }$ where $TSS=\sum{ y_{ i }-\overline{ y } }^{ 2 }$ and $RSS=\sum{ y_{ i }-\beta_{ o }-\beta{ 1 }x_{ i } }^{ 2 }$ in case of simple linear regression.

The fit of the line is perfect when $R^{ 2 }=1$ and bad when $R^{ 2 }=0$