Linear Regression

Today we’ll talk about linear regression. We use linear regression when we wan to predict future outcomes.

Linear regression basically assumes that we have data(x_{ i }, y_{ i } ) for i=1,…,n. Furthermore x_{ i } is not random and is called the predictor variable. y_{ i } on the other side is a function of x_{ i } with some random noise and is called the response variable.

Using least squares to fit a line

Suppose we have data(x_{ i }, y_{ i } ) and we want to find the best fitting line y=\beta_{ 0 }+\beta_{ 1 }x .

We model the line with y_{ i }=\beta_{ 0 }+\beta_{ 1 }x_{ i }+\epsilon_{ i } where \epsilon_{ i } is the random noise. The random noise is then described by \epsilon_{ i }=y_{ i }-\beta_{ 0 }-\beta_{ 1 }x_{ i } . We want to find the best fitting line so we want to minimise our error. Because the error term can be negative or positive, we square the error. Least Squares therefore finds the estimates \hat{ \beta }_{ 0 }, \hat{ \beta }_{ 1 } of \beta_{ 0 }, \beta_{ 1 } that minimises the sum of the squared error.

S(\beta_{ 0 }, \beta_{ 1 })=\sum{ \epsilon_{ i }^{ 2 } }=\sum{ (y_{ i }-\beta_{ 0 }-\beta_{ 1 }x_{ i })^{ 2 } }

Doing some calculus we get: \hat{ \beta }_{ 1 }=\frac{ s_{ xy } }{ s_{ xx } } and \hat{ \beta }_{ 0 }=\overline{ y }-\hat{ \beta }_{ 1 }\overline{ x }

where:

  • \overline{ x }=\frac{ 1 }{ n }\sum{ x_{ i } }
  • \overline{ y } =\frac{ 1 }{ n }\sum{ y_{ i } }
  • s_{ xx }=\frac{ 1 }{ n-1 }\sum{ (x_{ i }-\overline{ x })^{ 2 } } ; The sample variance of x.
  • s_{ xy }=\frac{ 1 }{ n-1 }\sum{ (x_{ i }-\overline{ x })(y_{ i }-\overline{ y }) } ; The sample covariance of x and y.

We call the fitting of a line to given data simple linear regression. Furthermore we call \epsilon_{ i } the residual.

Homoscedasticity  is when \epsilon_{ i } has the same variance for all i.

Heteroscedastic is when the variance of \epsilon_{ i } increases as x increases. If that is the case we have to transform the data.

We can visualise simple linear regression with a few lines of python code. Note the error is here not normal distributed but the regression works though.

import math
import numpy as np
import random

import matplotlib.pyplot as plt
from matplotlib import style

style.use('ggplot')
fig = plt.figure()
ax1 = plt.subplot2grid((1,1), (0,0))

def simple_regression(xs,ys):
    if len(xs)==len(ys):
        x_bar = sum(xs)/float(len(xs))
        y_bar = sum(ys)/float(len(ys))

        sxx = sum((x-x_bar)**2 for x in xs)/(len(xs)-1)
        sxy = sum((x-x_bar)*(y-y_bar) for x,y in zip(xs,ys))/(len(xs)-1)

        beta = sxy/sxx
        alpha = y_bar - beta*x_bar

        return alpha, beta
    else:
        print ("Error x and y are not of the same length")

xs = np.arange(0.1,5,0.3)
ys = [(1.5*x+0.5+random.uniform(-1,1)) for x in xs]

alpha, beta = simple_regression(xs,ys)
print('Alpha: {}\nBeta: {}'.format(alpha, beta))

ax1.scatter(xs, ys, color='blue', label='Data')
xs = np.arange(0,6,0.01)
ys =[(beta*x+alpha) for x in xs]
ax1.plot(xs,ys, color='red', label='Regression Line')
plt.legend()
plt.show()

The python code from above produces the following outcome:

regression_line

Linear regression for polynomial data

Linear in linear regression doesn’t mean that the data has to be linear. Indeed tha data can have any possible form the only restriction is, that the parameters are linear, i.e. \beta_{ o } or \beta_{ 1 }

Example Parabola:

We can for example fit a line to a parabola with y=\beta_{ 0 }+\beta_{ 1 }x+\beta_{ 2 }x^{ 2 } . Our Data would then follow y_{ i }=\beta_{ 0 }+\beta_{ 1 }x_{ i }+\beta_{ 2 }x_{ i }^{ 2 }+\epsilon_{ i } . The squared error is therefore:

S(\beta_{ 0 },\beta_{ 1 },\beta_{ 2 })=\sum{ (y_{ i }-(\beta_{ 0 }+\beta_{ 1 }x+\beta_{ 2 }x^{ 2 }))^{ 2 } }

To find the best fitting line we have to substitute the values for x_{ i } and y_{ i } and find the values for \beta_{ 0 },\beta_{ 1 },\beta_{ 2 } that minimise the squared error.

Measuring the fit

Linear regression is not just about fitting a line to the data, it is also about measuring how well that line fits the data. There’s no use for a line that best fits the data but doesn’t tell us anything about the data, because the fit is so bad. We call the measure of the fit the coefficient of determination or R^{ 2 } . It tells us the ratio of the “explained” part to the total sum of squares. Or in mathematical terms:

R^{ 2 }=\frac{ TSS-RSS }{ TSS } where TSS=\sum{ y_{ i }-\overline{ y } }^{ 2 } and RSS=\sum{ y_{ i }-\beta_{ o }-\beta{ 1 }x_{ i } }^{ 2 } in case of simple linear regression.

The fit of the line is perfect when R^{ 2 }=1 and bad when R^{ 2 }=0

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s