Least squares means your minimizing #sum# ( #Y_"i"# - #Y_"hat"##)^2#
Note: #Y_"hat"# is often written as Y with a ^ written above it
You can think of regression as fitting a straight line to a bunch of points on a graph (this is a simple case, and it can be more complicated, but why make it complicated). The question is then how are we going to fit the line through the points, or put another way, of all the different lines that we could choose that go through the points which one do we choose?
Although most people only ever learn about least squares regression, there are different types of regression that use different methods for choosing the line. For example, least Median Regression is another type.
In Least Squares regression we choose the line that goes through the points where the total (#sum#) of the distance from the line to each point squared ( #Y_"i"# - #Y_"hat"# #)^2# is the smallest.
Lets brake this down further. First what is/ why are we using #Y_"i"# - #Y_"hat"#?
#Y_"i"# - #Y_"hat"# is just the distance vertically from a point (the ith point) to the line. This is also called the residual. Hopefully it will make sense that we are trying to find the best line by choosing the line with the smallest distance between the line itself and the points. But at this point you may wonder why are we squaring this difference?
Well squaring this difference does a number of useful things. First it makes all the differences positive before we find the minimum of their sum. This is a very good thing, distance is always positive. But wait you may ask, couldn't we use absolute value? Isn't that how we normally make distances positive? And you'd be right for asking that.
That is a valid approach, and is a different type of regression. But by using the squares of the differences we get an additional benefit. It means that big differences become huge in our calculation (since a big difference* a big difference = a really really big number) and that means we will prefer to choose lines that make really big distances smaller over lines that pay less attention to these big differences.
Finally we sum these squared differences because we need to consider the distance from the line to each dot.