# A Probabilistic View of Linear Regression

Regression analysis is one of the most widely used techniques for analyzing data. Its broad appeal and usefulness result from the conceptually logical process of using an equation to express the relationship between a variable of interest (the response) and a set of related predictor variables.

# 1 Assumptions in Linear Regression

The linear regression has five key assumptions:

**Linear relationship**: linear regression needs the relationship between the independent and dependent variables to be linear.**The data is homoskedastic**: meaning the variance in the residuals (the difference in the real and predicted value) is more or less constant.**The residuals are independent**: meaning the residuals are distributed randomly and not influenced by the residuals in previous observations. If the residuals are not independent of each other, they’re considered to be autocorrelated.**The residuals are normally distributed:**this assumption means the probability density function of residual values is normally distributed at each*x*value.**No or little multicollinearity**: two variables are collinear if both of them have a mutual dependency. Due to this, it becomes a tough task to figure out the true relationship of predictors with the response variables or find out which variable is actually contributing to predict the response variable.

This causes the standard errors to increase. With large standard errors, the confidence interval becomes wider leading to less precise estimates of coefficients.

# 2 Background

The basic idea behind a regression is that you want to model the relationship between an outcome variable *y* (a.k.a **dependent variable**) and a vector of explanatory variables *x* (a.k.a **independent variable**). A linear regression relates *y* to a linear predictor function of *x*. For a given data point *i*, the linear function is on the form:

There are usually two main reasons to use a regression model:

**Predicting**a future value of*y*given its corresponding explanatory variables.**Quantifying**the strength of the relationship of*y*in terms of its explanatory variables.

The simplest form of linear regression model equates the outcome variable with the linear predictor function (ordinary linear regression), adding an **error term** (*ε*) to model the noise that appears when fitting the model. The error term is added because the *y* variable almost never can be exactly determined by *x*, there is always some noise or uncertainty in the relationship which we want to model.

# 3. Model the Outcome as a Normal Distribution

Instead of starting off with both *y* and *x* variables, we’ll start by describing the probability distribution of just *y* and then introducing the relationship to the explanatory variables.

## 3.1 A Constant Mean Model

First, model *y* as a standard normal distribution with a zero (i.e. known) mean and unit variance. Note this not depend on any explanatory variables:

In this model for *y*, we have nothing to estimate: all the normal parameters are already set (mean *μ = 0*, variance *σ² = 1*). In the context of linear regression, this model would be represented as *y = 0 + ε* with no dependence on any *x* values and *ε* being a standard normal distribution.

Next, let’s make it a little bit more interesting by assuming a fixed unknown mean and variance *σ²* corresponding to *y = μ + ε* regression model:

We are still not modeling the relationship between *y* and *x* (we’ll get there soon). A way to find this estimate is to maximize the likelihood function.

## 3.2 Maximizing Likelihood (1)

Consider that we have *n* points, each of which is drawn in an independent and identically distributed (i.i.d.) way from the normal distribution in Equation 4. For a given *μ*, *σ²*, the probability of those *n* points being drowned define the likelihood function, which is just the multiplication of *n* normal probability density function.

Once we have a likelihood function, a good estimate of the parameters (i.e. *μ*, *σ²*) is to just find the combination of parameters that maximizes this function for the given data points. Here we derive the maximum likelihood estimate for *μ*:

To find the actual value of the optimum point we can take the partial derivative of Equation 6 with respect to *μ* and set it to zero:

This is precisely the mean of *y* values as expected. Even though we knew the answer ahead of time, this work will be useful once we complicate the situation by introducing the explanatory variables.

Finally, the expected value of *y* is just the expected value of a normal distribution, which is just equal it means:

## 3.3 Modeling Explanatory Variables

Now that we have understood that *y* is a random variable, let’s add is some explanatory variables. We can model the expected value of *y* as a linear function of *p* explanatory variables, similar to Equation 2:

Combining Equation 8 with Equation 9, the mean of *y* is now just this linear function. Thus, *y* is a normal variable with mean as a linear function of *x* and a fixed standard deviation:

This notation makes it clear that *y* is still a random normal variable with an expected value corresponding to the linear function *x*.

## 3.4 Maximizing Likelihood (2)

To get point estimates for the *β* parameters we can again use a maximum likelihood estimate. From Equation 6 we can substitute the linear equation from Equation 9 in for *μ* and try to find the maximum values for the vectors of *β* values:

## 3.5 Prediction

Once we have the coefficients for our linear regression from Equation 11 we can now predict the new values. Given a vector of explanatory variables *x*, predicting *y* is a simple computation