# Thread: Definition of standard deviation

1. ## Definition of standard deviation

Math help please. My teenager is learning about SD in math and i was wondering how/why the formula was developed.

Anyone got an opinion as to my hypothesis that the definition of SD relates to the mathematical hack / trick/fact that a square of a difference from the mean (positive or negative) will always be positive. You average those up, take a square root and voila! (As long as ignore the negative answer).

Is there any other reason why square roots of squares are involved?

Sent from my iPhone using Tapatalk  Reply With Quote

2. I don't have a background in math theory but my assumption has always been that the reason squaring is involved is because small deviations should result in a small influence on the error margins (suppression) whereas larger deviations should have a large influence (magnification).  Reply With Quote

3. Its a convention for the bell shaped curve you get if you measure many kinds of things. The distribution of values falls into a pattern which is remarkably similar for many different things. There is a clear central value. Squaring all the differences does get rid of the minus values as you say, and then you add them all up and take the square root as a standard measure of how much variance that is. Then you can use one, two, or three standard deviations to express in a standardised way the proportion of your population that is close to the mean. You will see that squaring the differences does accentuate the extreme low and high values as well as providing a mathematical track that enables many different populations to be compared. In other examples of root mean square you get values which have meaning in electrical power for example which are more useful than trying to do an average in calculation. But to calculate the average or mean you just use the plus and minus values without squaring them.  Reply With Quote

4. Originally Posted by plant Math help please. My teenager is learning about SD in math and i was wondering how/why the formula was developed.

Anyone got an opinion as to my hypothesis that the definition of SD relates to the mathematical hack / trick/fact that a square of a difference from the mean (positive or negative) will always be positive. You average those up, take a square root and voila! (As long as ignore the negative answer).

Is there any other reason why square roots of squares are involved?
It has to do with the origin of the concept of "variance" (standard deviation squared), in dealing with errors in position estimates (for stars, I think). Given a cluster of observations, the most likely real location of the star was calculated by minimizing the squares of the distances of all observations from the estimated real point. (The same technique is used in fitting lines to data - the method of least squares, which minimizes the squares of the residuals.)
So that mathematical basis spilled over into using variance, the mean of the squared residuals, as a way of describing data scatter. Taking the square root to produce standard deviation simply restores the units of measurement to the same dimensions as the original data.
So the squared residuals have underlying mathematical usefulness that (say) absolute values do not.

Grant Hutchison  Reply With Quote

5. Originally Posted by plant Math help please. My teenager is learning about SD in math and i was wondering how/why the formula was developed.

Anyone got an opinion as to my hypothesis that the definition of SD relates to the mathematical hack / trick/fact that a square of a difference from the mean (positive or negative) will always be positive. You average those up, take a square root and voila! (As long as ignore the negative answer).

Is there any other reason why square roots of squares are involved?
The squaring (i.e. Standard Deviation) method is more "efficient" or perhaps more effective. But simply using absolute values and not squaring (i.e. Mean Deviation) can be the better method if the data set is more empirical (more prone to error). Eddington argued for MD, given his empirical datasets, over SD. But for ideal datasets (e.g. gaussian) SD is, apparently, superior.

More here.  Reply With Quote

6. Originally Posted by plant Anyone got an opinion as to my hypothesis that the definition of SD relates to the mathematical hack / trick/fact that a square of a difference from the mean (positive or negative) will always be positive.
I think that is very likely a reason for the common use of standard deviation and variance (the square of standard deviation).

George has pointed out that there are alternate methods, such as absolute deviation, which is the expected value of the absolute value of the deviation. Originally Posted by plant Is there any other reason why square roots of squares are involved?
You could just not take the square root, in which case you have the variance. However, variance is in different units than the original quantity. For example, if you are trying to estimate the average height of a population in inches, then the estimated height will be in units of inches, but the estimated variance will be in units of inches squared. If you switch to centimetres, since one inch is 2.54 centimetres, the estimated average is just multiplied by 2.54, but the estimated variance is multiplied by 6.4516.

Knowing the variance is the same thing as knowing the standard deviation (if you have one, just square it or take its square root to find the other), but if you use standard deviation, the units are the same as the original measurement - e.g., inches or centimetres in the above example, rather than inches squared or centimetres squared for the variance. So it is just more intuitive and convenient, easier to interpret.

If you use absolute deviation instead, there is no need to take a square root for the sake of getting the units right, because the units are already the same as those for the original measurement. Originally Posted by George The squaring (i.e. Standard Deviation) method is more "efficient" or perhaps more effective. But simply using absolute values and not squaring (i.e. Mean Deviation) can be the better method if the data set is more empirical (more prone to error). Eddington argued for MD, given his empirical datasets, over SD. But for ideal datasets (e.g. gaussian) SD is, apparently, superior.

More here.
For a Gaussian distribution, this has entirely to do with estimation. If we know the standard deviation, we can find the mean absolute deviation (just multiply standard deviation by the square root of (two over pi)). So they are essentially the same.

However, if we have to estimate these quantities from a finite dataset, and the data are drawn from a Gaussian distribution, it is indeed more "efficient" (which has a very precise statistical meaning) to estimate the standard deviation by first estimating the mean, squaring all the deviations from the mean, averaging them, and taking the square root. (Note - when doing this, it is "efficient" to average by dividing by the number of observations, not the number of observations minus one, as is frequently done in beginning statistics.) If you want the mean absolute deviation, it is actually more efficient (this refers to the accuracy of the result, not the computational difficulty of finding it) to estimate the standard deviation first, then multiply by the square root of (two over pi).

In many cases, you won't really know the true distribution of your data, whether it is Gaussian or something else, for example. In this case, the choice of standard deviation or mean absolute deviation (or some other so-called "robust" estimator) ideally depends on how bad it is to get your estimates wrong. By squaring deviations, large deviations have a very strong effect on estimation off the "average" - large numbers squared are really large numbers. So "robust" estimation methods may try to minimise estimated mean absolute deviation rather than estimated standard deviation.

If you want to estimate a quantity and minimise mean squared deviation, the ideal estimate is the mean of your data set. If you want to minimise mean absolute deviation, the best estimate is the median of your data set.  Reply With Quote

7. Established Member
Join Date
Jun 2009
Posts
1,888 Originally Posted by plant Is there any other reason why square roots of squares are involved?
Yes. To fully appreciate the utility of the standard deviation requires understanding probability distributions and functions of random variables. If your teenager has not studied these things, you can't explain the utility of the "standard deviation". You can't even distinguish among the distinct concepts: standard deviation of a distribution, standard deviation of a sample, and estimator of the standard deviation of a population. All those distinct concepts are called "standard deviation".

Your proposed explanation doesn't make sense. There are lots of algorithms that take a set of numbers and produce a positive result. You haven't any explanation why one particular algorithm is to be preferred over another. It's better not to give an explanation if the alternative is giving a wrong explanation.  Reply With Quote

8. Originally Posted by profloater You will see that squaring the differences does accentuate the extreme low and high values
I guessed that.. however my 1st thought once you average it, you then take the square root which would negate it.. but it would seem those weightings are still 'baked in' to the final result.

So... why are distances weighted by the square of the distance from the mean? I suppose it gives you an expanded 'scale' to show up smaller differences in distribution between data sets.... but why not a different power?

Is this something to do with the 'bell curve' / 'normal' distribution?

Thanks again!
35yrs since math class!  Reply With Quote

9. Originally Posted by tashirosgt You haven't any explanation why one particular algorithm is to be preferred over another. It's better not to give an explanation if the alternative is giving a wrong explanation.
..and have you made any effort to give an explanation to the original post? I don't see any here. Yet other posters here have managed to provided GREAT and/or thought provoking answers!

Why are the differences squared rather than cubed? What is 'standard' about standard deviation? Are these not interesting questions? presumably they lead toward other interesting questions- why are many things distributed in a bell-curve fashion? Does this say something interesting about the world (or it's mind-dependent models), or something intersting about our math?

I think it is 'better' to have an enquiring mind as to why things are done in certain ways rather than to accept 'this is how you do standard deviation'. I certainly wasn't taught WHY in high-school and it seems not much has changed since then.

I'm sure if one has a degree is science/math one would be much more knowledgable.. however perhaps the mark of true understanding is the ability to explain it to others.  Reply With Quote

10. Order of Kilopi
Join Date
Mar 2010
Location
United Kingdom
Posts
7,237 Originally Posted by plant So... why are distances weighted by the square of the distance from the mean? I suppose it gives you an expanded 'scale' to show up smaller differences in distribution between data sets.... but why not a different power?

Is this something to do with the 'bell curve' / 'normal' distribution?
Using other powers gives different moments. For the normal distribution these are related to the shape of the probability density curve. The second moment (SD) is related to the width. The third moment (skewness) is related to the symmetry of the distribution tails. The fourth moment (kurtosis) is related to the comparative strength of the tails.  Reply With Quote

11. Originally Posted by plant I guessed that.. however my 1st thought once you average it, you then take the square root which would negate it.. but it would seem those weightings are still 'baked in' to the final result.
The trick is, you square the individual results, but you take the square root of the sum.

Example - suppose you have three observations, which are 0, 6, and -6. The estimated mean (by the usual method) is zero. The estimated mean absolute deviation (also by the usual method) is four.

Now suppose you have a different data set, with 3, -6, and 3. The estimated mean and mean absolute deviation are zero and four, just like before.

However, the standard deviations are not the same - they are (by one common method of estimation) 4.898979486 and 4.242640687 in the two data sets. (By another common method, they are 6 and 5.196152423.)

The first data set has two observations with a large deviation, and one with no deviation at all; the second data set has one observation with a large deviation, and two observations with smaller deviations. By mean absolute deviation, they are the same, but the squaring gives more influence to the more extreme deviations in the first data set. Having a "6" and a "0" increases standard deviation more than having two "3"s. Originally Posted by plant So... why are distances weighted by the square of the distance from the mean? I suppose it gives you an expanded 'scale' to show up smaller differences in distribution between data sets.... but why not a different power?
In a lot of situations, there isn't really a strong theoretical justification to go one way or the other, and "custom" seems to be the best explanation for use of standard deviation. For example, there is a whole field called "robust" estimation, which typically argues that squaring deviations gives too much influence to extreme observations, and using absolute value or other methods is better.

There are some cases where there is a theoretical justification for going one way or another, though. For example: Originally Posted by plant Is this something to do with the 'bell curve' / 'normal' distribution?
If you know the mean and standard deviation of a normal (or Gaussian) distribution, you know its entire shape. The distribution is entirely characterised by these two numbers. However, there is nothing special about standard deviation in this sense - the normal distribution is completely characterised by its mean and mean absolute deviation as well. If you know those two numbers, you know the entire shape of the distribution.

However, there is a way in which the standard deviation is "special" here - I'll provide a rather sketchy and vague explanation, but if you want more details, just ask. If you want to estimate the shape of the distribution, estimating standard deviation is a more accurate method than estimating other quantities, such as mean absolute deviation. So that's a case where there is a theoretical justification for why standard deviation is "better" than other measures of dispersion.

I don't do robust estimation much, but it wouldn't suprise me if there were distributions where mean absolute deviation turns out to be optimal, or at least better than standard deviation.  Reply With Quote

12. Thanks all.  Reply With Quote

13. Established Member
Join Date
Jun 2009
Posts
1,888 Originally Posted by plant ..and have you made any effort to give an explanation to the original post?
I don't see any here. Yet other posters here have managed to provided GREAT and/or thought provoking answers!
If that's the case, why would I need to add anything? I evaluate the situation by reading your replies and questions to the other posters. My evaluation is that you can't explain the formula for "standard deviation" to your teenager. You can certainly raise thought provoking questions with your teenager, and that may be even more important that providing an explanation.

Why are the differences squared rather than cubed? What is 'standard' about standard deviation? Are these not interesting questions? presumably they lead toward other interesting questions- why are many things distributed in a bell-curve fashion? Does this say something interesting about the world (or it's mind-dependent models), or something intersting about our math?
Yes, these are interesting questions. Assuming other posters have answered them, can you answer those questions for your teenager?

I can imagine two distinct situations:
1) The material your teenager is studying is a comprehensive introduction to statistics and it includes the standard concepts presented in such an introduction.
2) The material your teenager is studying is not a comprehensive introduction to statistics. Perhaps the formulas for computing the sample mean and sample standard deviation are being presented in an isolated, "cookbook" manner in a course that is about some topic other than statistics.

In situation 1), it would be important and possible to give a technically correct motivation for why the formula for computing the standard deviation of a sample is what it is.

In situation 2), giving a correct answer requires introducing much of the material from situation 1). Unless you are going to introduce all that material, the best you can do is encourage your teenager to worry about the formula and offer some speculations why it takes a particular form.

Posters who have provided extensive answers have experience from situation 1). They have not explained all that background knowledge in their posts. So my guess is that you can't translate their answers to technically correct answers for your teenager. I might be wrong.

It is interesting to discuss the material studied in sitatuion 1) but the concepts are extensive and sophisticated. Before composing a series of internet posts on those topics, I want to be sure my reader has a strong interest and enduance, not just a casual curiosity.  Reply With Quote

14. Originally Posted by plant I guessed that.. however my 1st thought once you average it, you then take the square root which would negate it.. but it would seem those weightings are still 'baked in' to the final result.

So... why are distances weighted by the square of the distance from the mean? I suppose it gives you an expanded 'scale' to show up smaller differences in distribution between data sets.... but why not a different power?

Is this something to do with the 'bell curve' / 'normal' distribution?

Thanks again!
35yrs since math class!
Is it too late to give an engineering example? Suppose you make a widget and its diameter is important. You make many and when measured there are variations. You sample your production and you want to know how many will fall outside your limits, but you do not measure every one. The standard deviation method, or the variance,will tell you the standard deviation with a small sample and then you know from the gaussian distribution, the probility of90% 99% 99.99% and so on. If you do take the square root you maintain the units of measurement. It's quite amazing how the gaussian distribution does crop up in measurements, probability rules,!  Reply With Quote

#### Posting Permissions

• You may not post new threads
• You may not post replies
• You may not post attachments
• You may not edit your posts
•