Today we’ll talk about linear regression. We use linear regression when we wan to predict future outcomes.

Linear regression basically assumes that we have data() for i=1,…,n. Furthermore is not random and is called the predictor variable. on the other side is a function of with some random noise and is called the response variable.

## Using least squares to fit a line

Suppose we have data() and we want to find the best fitting line .

We model the line with where is the random noise. The random noise is then described by . We want to find the best fitting line so we want to minimise our error. Because the error term can be negative or positive, we square the error. Least Squares therefore finds the estimates of that minimises the sum of the squared error.

Doing some calculus we get: and

where:

- ; The sample variance of x.
- ; The sample covariance of x and y.

We call the fitting of a line to given data simple linear regression. Furthermore we call the residual.

**Homoscedasticity ** is when has the same variance for all i.

**Heteroscedastic **is when the variance of increases as x increases. If that is the case we have to transform the data.

We can visualise simple linear regression with a few lines of python code. Note the error is here not normal distributed but the regression works though.

import math import numpy as np import random import matplotlib.pyplot as plt from matplotlib import style style.use('ggplot') fig = plt.figure() ax1 = plt.subplot2grid((1,1), (0,0)) def simple_regression(xs,ys): if len(xs)==len(ys): x_bar = sum(xs)/float(len(xs)) y_bar = sum(ys)/float(len(ys)) sxx = sum((x-x_bar)**2 for x in xs)/(len(xs)-1) sxy = sum((x-x_bar)*(y-y_bar) for x,y in zip(xs,ys))/(len(xs)-1) beta = sxy/sxx alpha = y_bar - beta*x_bar return alpha, beta else: print ("Error x and y are not of the same length") xs = np.arange(0.1,5,0.3) ys = [(1.5*x+0.5+random.uniform(-1,1)) for x in xs] alpha, beta = simple_regression(xs,ys) print('Alpha: {}\nBeta: {}'.format(alpha, beta)) ax1.scatter(xs, ys, color='blue', label='Data') xs = np.arange(0,6,0.01) ys =[(beta*x+alpha) for x in xs] ax1.plot(xs,ys, color='red', label='Regression Line') plt.legend() plt.show()

The python code from above produces the following outcome:

## Linear regression for polynomial data

Linear in linear regression doesn’t mean that the data has to be linear. Indeed tha data can have any possible form the only restriction is, that the parameters are linear, i.e. or

**Example Parabola:**

We can for example fit a line to a parabola with . Our Data would then follow . The squared error is therefore:

To find the best fitting line we have to substitute the values for and and find the values for that minimise the squared error.

## Measuring the fit

Linear regression is not just about fitting a line to the data, it is also about measuring how well that line fits the data. There’s no use for a line that best fits the data but doesn’t tell us anything about the data, because the fit is so bad. We call the measure of the fit the coefficient of determination or . It tells us the ratio of the “explained” part to the total sum of squares. Or in mathematical terms:

where and in case of simple linear regression.

The fit of the line is perfect when and bad when