“Guys in front of a regression chart” using unDraw cutouts | Image by author

The Linear Regression Equation in a Nutshell

Tumin Sharma
Towards Data Science
7 min readOct 24, 2020

--

What really is regression? Regression Analysis is about predicting a value or attribute of a variable based on some other variables. And linear regression is when there is only one variable you want to predict based on another single variable.

The Definition

Actually, the formal definition of simple linear regression that I learned at school is, by regression of a variable y on another variable x, we mean the dependence of y on x, on the average. This is more of the answer you can write in your exam sheet. The concept of regression is actually a lot easier than that.

Regression is actually about finding the trend of a data set plotted in a graph. If you draw a regression line ( buzzword alert! ) through a scatter plot, by just looking at the line you can tell if the trend is increasing or decreasing with respect to something!!

A graph containing s scatter plot
A matplotlib scatterplot | Image by author

You should know that regression analysis is the way of calculating and formulating the equation of the line ( do not worry we will get to it ) while the regression line is the line itself. While the equation of simple regression is the equation of a line.

Y = mX + b

Intuition

While going around the internet you will find two types of an intuitive approach to linear regression. One is where people will tell you regression is the way you can predict a value of a variable say, y with an input of x which you may already have and that is all right! ( That is the best approach of ML)

The other approach is what I defined in the definition. Think about it, you can draw like thousands of lines through a scatterplot but the line which fits the best and is unique from all every other line is the regression line. The telling it out is in the next paragraph!

Representation of the various lines we can draw through a scatter plot.
Various lines in a matplotlib scatterplot | Image by author

The approach of prediction is true all right but not 100% guaranteed and that’s why in my school book, at the end of the definition there is written: “on an average”. To know why this is true you and I have to go in-depth with mean squared error ( ooo another buzzword! ). For now, think of it as a function that tells how good the line is working. The lower the value of the mean squared error is the better the line should be.

Mean Squared Error

As I said earlier squared error is the function to know how much error is present in the line with respect to the plotted data, the better the line is the minimum is the value of mse.

Now let us learn how to find mse so that you can calculate on your own. From the figure below let us assume that the blue line is the regression line and then we will try to find the perpendicular distance of the line from that point marked as red present at (xᵢ, yᵢ) on the graph.

That is, we subtract the height of the line at that point xᵢ, that is Yᵢ from the real height of the point yᵢ and assume the value is eᵢ

eᵢ = yᵢ - Yᵢ

finding the distance from the regression line to the scatter points.
Distance from line and points in matplotlib | Image by author

Now in proceeding the same way we add all the values of eᵢ and we get,

∑eᵢ = ∑(yᵢ — Yᵢ)

Now think about it yourself, some points could lie on the line, there could be a possibility that there is more number of points below the line which can make the total summation negative but convention decided to let the mse be a positive value for better understanding.

Now if we use modulus to make the values positive then the value will become too large with more data points in the plot. So the great minds decided to square the value before adding. Then after adding they recommended dividing the sum with the total count of the dataset. Thus it came to know as the mean squared error.

Equation of MSE | Equation by author

Linear Regression Equation

As I said to you in the intuition part, the lower the value of the mean squared error (mse) is the better the line will be called the best fit. Since mse itself cannot be 0 ( practically impossible for most cases but not for all ) we have to find the minimum value for mse by simplifying the equation of mse.

After simplifying the equation for mse would be something like this,

Expanding the equation of MSE | Equation by author

Where n is the total number data points present, r is the correlation value of the dataset, sₓ, and sᵧ is the standard deviation of all the y values and x values respectively and x̅̄ & y̅ is the mean value of x and y values respectively.

Now since r always lies between -1 to 1, r² it can cause the sᵧ²(1 — r²) to be 0 and sᵧ itself cannot be 0, this term of the help doesn’t really come in handy to formulate the linear regression equation.

Now, if (msₓ-rsᵧ)² = 0, then m = r(sᵧ/sₓ), where m could also be written as mᵧₓ ( sounds as m y on x ).
Now, if (y̅ — b — mx̅̄)² = 0, then, y̅ = b + mx̅, or b = y̅ — mᵧₓx̅̄, where b could also be written as bᵧₓ ( sounds as b y on x ).

Now if you put the values of bᵧₓ and mᵧₓ in the main linear equation we talked about in the definition, it would look something like this,

Deriving the Regression Equation | Equation by author

Yeah I know it looks pretty ugly for an equation but you do not have to remember this every time you need it. Just remember how the slope is only the correlation times ratio of standard deviations and the intercept is looking like the old linear equation where x and y are the respective means And x is considered as xi because Y will be the unknown of the lines for the input of x we put.

graph after calculating the best fit line
The Best Fit Line in matplotlib | Image by author

Here is a visual representation of how the regression line will look like finally. Now if you put a value of that was not used in the calculation you will get the value of y predicted which will line on the line like the red dot.

Important Note

This is the point where most of the regression newbies does the mistake, look carefully that I used mᵧₓ and bᵧₓ that is y on x instead of x on y because this implies we will be predicting the values of y with the values of x as function input.

For only that case I stated the equation for mse as yᵢ — Yᵢ. But if you want to predict someday the value for x based on a y as an input you have to calculate the mse with xᵢ — Xᵢ because if you think carefully the horizontal distance of a certain data point is a lot different from its vertical distance from the line.

Then the slope will become mₓᵧ = r(sᵧ/sₓ) ( look carefully that the ratio of standard deviation is the reciprocal of the mᵧₓ value and now sounds as x on y). And intercept will become b = x̅̄ — mₓᵧy̅ and finally the equation would look like, X = mₓᵧyᵢ + bₓᵧ

comparison of the two line made with the two equations
Regression Equation of y on x and x on y together in matplotlib | Image by author

This blue line is the equation of the function where you want to predict values of y based on x and the green line is the function where you want to predict x based on y. This proves none of the regression lines is the inverse function of the other.

Conclusion

Linear Regression is commonly used in the field of data scientist, as a machine learning model and mostly as a statistical tool in some fields of analytics. If you have read carefully, though how long of an article this is, the math behind linear regression is too easy. You have to second me on that!

--

--