# Thread: Difference between sample and population standard deviation

1. Newbie
Join Date
Feb 2021
Posts
2

## Difference between sample and population standard deviation

Hi,

I am new to this forum. I didn't find my answer from an existing forum post.
That's why I am asking it here.

I have 2 questions:
1. what's the difference between sample and population standard deviation?
2. How to find the sample and population variance, mean, and SD?

Hi,

I am new to this forum. I didn't find my answer from an existing forum post.
That's why I am asking it here.

I have 2 questions:
1. what's the difference between sample and population standard deviation?
2. How to find the sample and population variance, mean, and SD?

This sounds like a "homework problem," the answers to which are not typically supplied here at Cosmoquest, preferring that you learn the details yourself from any text or online publication. Or maybe you'll get further response, you never know.

But the idea of sampling a small percentage of the population and getting a good estimate for the entire population is a pretty useful tool, and the math will tell you if your estimate is accurate plus or minus 3% or 5% or whatever. I always add that the sampling must be random over the entire population, a restriction I wonder about sometimes in the various polling outfits....

3. Order of Kilopi
Join Date
Mar 2010
Location
Australia
Posts
7,395
If you take a look at the Wikipedia page https://en.wikipedia.org/wiki/Standa...ion#Estimation it shows the population and sample standard deviations. The section also covers off the variance because it is just the square of the SD. Hopefully you can get what you need from this page. If you have questions from this feel free to ask them - I think people find it easier to clarify something that try to teach the subject!

The means are simpler - you calculate them in exactly the same way, you just have to be very clear which you are calculating and what you can do with it.

4. Originally Posted by Shaula
The means are simpler - you calculate them in exactly the same way, you just have to be very clear which you are calculating and what you can do with it.
I'm not sure in what sense you calculate them in exactly the same way.

A population moment can be calculated when you know the probability distribution. For the mean, it is the sum of the probabilities of each outcome times the value of the outcome, if the distribution is discrete. If it is a continuous distribution, it is the integral of the probability density function times the outcome of the variable. More generally, it is the integral of the outcome with respect to the probability measure (instead of Lebesgue measure) which includes both discrete and continuous random variables as a special case, but also covers probability distributions that are neither continuous nor discrete.

A sample mean is an estimate of the population mean (usually - but you can calculate a sample mean even when the population mean is not defined, as is the case for a Cauchy distribution), and you just add up all the observations of the random variable, and divide by the number of observations.

If you have a uniform distribution (so each outcome has the same probability) and you have an enumeration of all possible outcomes, I guess you could say you calculate the population mean the same way you calculate the sample mean, but that's a pretty special case.

5. Order of Kilopi
Join Date
Mar 2010
Location
Australia
Posts
7,395
Originally Posted by 21st Century Schizoid Man
I'm not sure in what sense you calculate them in exactly the same way.
What I meant was that you add up all the things in your dataset and divide by the number of things in your dataset. Same calculation. Whereas for the population vs sample standard deviation the second is calculated using a different formula (applying Bessel's factor). Sorry if that was poorly explained.

In either case it is very important to know which you are dealing with and to bear in mind that sample statistics are at best an estimator of the population statistics.

6. Originally Posted by Shaula
What I meant was that you add up all the things in your dataset and divide by the number of things in your dataset. Same calculation.
If we have a "dataset", it sounds like we are talking about a sample. Unless the dataset includes all possible outcomes, in which case we are already restricted to a finite population.

The population moment is based on the probability distribution, and the calculation then is to multiply each possible outcome by the probability of that outcome occurring, and then adding them all up. It's only the same calculation if the probability of each outcome is the same.

Suppose you throw two of the usual six-sided dice, and add up the numbers, and that is your random variable. So the outcome is from 2 to 12, but they are not equally likely. You can't just add up the numbers form 2 to 12, and divide by 11.

I suppose you could say your population is {2,3,3,4,4,4,5,5,5,5,6,6,6,6,6,7,7,7,7,7,7,8,8,8,8 ,8,9,9,9,9,10,10,10,11,11,12}, counting 3+4 as a distinct outcome from 4+3, then add them all up and divide by 36, which gives the population mean of seven. But this only works when you can enumerate a list of equally probably outcomes. If your outcomes are things like, "the rocket lifts off the launchpad" and "the rocket explodes on the launchpad", we're going to need probabilities, which are implicit in the dice example (when you divide by 36, you are multiplying by the probability of an outcome, 1/36).

Originally Posted by Shaula
Whereas for the population vs sample standard deviation the second is calculated using a different formula (applying Bessel's factor).
The same issue as above would apply to standard deviation. The use of "divide by N-1" instead of "divide by N" makes the sample standard deviation an unbiased estimator of the population standard deviation. (No such luck for the variance.) But this is really a choice - dividing by N to estimate the standard deviation of a random variable from a sample is not wrong, it's just a different method with different statistical properties. If you are estimating the standard deviation from a sample of a random variable that is believed to be normally distributed, dividing by N is the maximum likelihood estimate of the standard deviation. It's biased, but it has lower mean squared error than the N-1 method. It's not wrong, it's just different. (If your sample is very large, the difference between the two methods becomes very small.)

To take an example, suppose some item has a probability p of failure during each unit of time, if it hasn't already failed. What is the mean time to failure? (Pretty much the same problem if we ask the mean time until an atom of a radioactive element decays.) Well you can take a sample of the item, measure how long it is until each one fails, add up the times, and divide by the size of your sample, and that gives you a sample mean time to failure, which is a method of estimating the population mean. But to calculate the population mean (which is possible with the information given) is a rather different calculation - it is 1*p+2*p^2+3*p^3+<keep going forever>. I'm having trouble seeing how we can just add up some group of numbers and divide by the number of observations we have. Maybe there is a way to do it, but it seems very contrived to me. For radioactive decay, I suppose we could take our sample to be every atom of the radioactive element in the universe, but then we're going to need to wait a long time to measure how long it is until each one decays.

If you have a finite population of equally probable outcomes, I agree, you can take your "sample" to be the entire population, and calculate the mean the same way. But this is a pretty restricted special case.

7. Newbie
Join Date
Feb 2021
Posts
2
Hello guys,

I have read your replies but it seems a bit confusing to understand.
I got my answer in detail on this website: https://standard-deviation-calculator.net

8. If events are rare, like some failures, and not related to each other, i.e. random, the Poisson distribution applies and the number of events by time is the same as the variance. The expected probability integer number in a given time, or number of trials, is then found by a different formula incorporating the Euler number.

9. Originally Posted by profloater
If events are rare, like some failures, and not related to each other, i.e. random,
I would not characterise "not related to each other" as "random" - random variables can be dependent. But it now seems to be common usage in everyday talk - if something is "random", people think it must have a uniform distribution, with each outcome having equal probability. If some outcomes have more probability than others, then they say that this is not "random". But a more precise way to characterise "not related to each other" would be "independent".

Originally Posted by profloater
the Poisson distribution applies and the number of events by time is the same as the variance.
The word "expected" is missing before "number of events". For a Poisson distribution, the expected value is the same as the variance. But the "number of events" is a non-constant random variable, and the "variance" is a number, so those two cannot be equal. At least not in all possible outcomes.

Originally Posted by profloater
The expected probability integer number in a given time, or number of trials, is then found by a different formula incorporating the Euler number.
I am not sure what "number of trials" in the above is supposed to mean. Is it supposed to be the number of occurrences? The meaning of "expected probability integer number" is also not clear - is this the expected number of occurrences in a unit of time?

If the Euler number is e (about 2.718), then I am not sure how that comes into it. There is not a single Poisson distribution, but an entire family of distributions, normally parameterised by "lambda", which turns out to be both the expected value and the variance. The probability mass function of a Poisson does have an "e" in it. Maybe there is some other way of parameterising the distribution such that an "e" shows up in the mean and variance. But even in that case, the mean would be some function that has an "e" in it, and also an unknown parameter, and the problem of finding the population mean has simply been transformed into the problem of finding the unknown parameter.

But if parameterised by lambda, then if you know the value of lambda, you know the population mean already (and the population variance, which is the same). If you don't know lambda, then it needs to be estimated in some way, which could be done using a sample average - the same way you normally find the sample average of any random variable. (The fact that it is a Poisson distribution would be completely irrelevant in the calculation of the sample average.)

However, if you want to find the population average, that brings us back to the point that I commented on - the way you do this is not to take some dataset, add up all the numbers, and divide by the number of observations. That is a sample average, and if you could choose your "sample" to be the entire population, then it would work. But here, the population has infinitely many possible outcomes (very high numbers of occurrences are unlikely, but possible), and the outcomes do not all have the same probability (since there are infinitely many, the can't have the same probability, since the probabilities have to add up to one). This is a case where you clearly cannot find a population average the same way you find a sample average. That only works with a finite population, with each outcome having equal probability.

For what it's worth, if we have the usual assumptions that cause the number of events per time period to have a Poisson distribution, then the time between events has an exponential distribution. I posed the problem with the failures as an expected time until failure, with the underlying distribution being exponential. However, this is equivalent to asking the expected number of failures per unit of time. ("Equivalent" in the sense that if you know one, you can find the other - not that they are the same number.)

But either distribution will serve the purpose here. Suppose we have either an exponential distribution (the way I posed the failure problem, the random variable being time until next failure) or a Poisson distribution (the random variable is the number of failures per unit of time). How to find the population mean? We cannot find it the same way we find the sample mean.

10. Originally Posted by 21st Century Schizoid Man
I would not characterise "not related to each other" as "random" - random variables can be dependent. But it now seems to be common usage in everyday talk - if something is "random", people think it must have a uniform distribution, with each outcome having equal probability. If some outcomes have more probability than others, then they say that this is not "random". But a more precise way to characterise "not related to each other" would be "independent".

The word "expected" is missing before "number of events". For a Poisson distribution, the expected value is the same as the variance. But the "number of events" is a non-constant random variable, and the "variance" is a number, so those two cannot be equal. At least not in all possible outcomes.

I am not sure what "number of trials" in the above is supposed to mean. Is it supposed to be the number of occurrences? The meaning of "expected probability integer number" is also not clear - is this the expected number of occurrences in a unit of time?

If the Euler number is e (about 2.718), then I am not sure how that comes into it. There is not a single Poisson distribution, but an entire family of distributions, normally parameterised by "lambda", which turns out to be both the expected value and the variance. The probability mass function of a Poisson does have an "e" in it. Maybe there is some other way of parameterising the distribution such that an "e" shows up in the mean and variance. But even in that case, the mean would be some function that has an "e" in it, and also an unknown parameter, and the problem of finding the population mean has simply been transformed into the problem of finding the unknown parameter.

But if parameterised by lambda, then if you know the value of lambda, you know the population mean already (and the population variance, which is the same). If you don't know lambda, then it needs to be estimated in some way, which could be done using a sample average - the same way you normally find the sample average of any random variable. (The fact that it is a Poisson distribution would be completely irrelevant in the calculation of the sample average.)

However, if you want to find the population average, that brings us back to the point that I commented on - the way you do this is not to take some dataset, add up all the numbers, and divide by the number of observations. That is a sample average, and if you could choose your "sample" to be the entire population, then it would work. But here, the population has infinitely many possible outcomes (very high numbers of occurrences are unlikely, but possible), and the outcomes do not all have the same probability (since there are infinitely many, the can't have the same probability, since the probabilities have to add up to one). This is a case where you clearly cannot find a population average the same way you find a sample average. That only works with a finite population, with each outcome having equal probability.

For what it's worth, if we have the usual assumptions that cause the number of events per time period to have a Poisson distribution, then the time between events has an exponential distribution. I posed the problem with the failures as an expected time until failure, with the underlying distribution being exponential. However, this is equivalent to asking the expected number of failures per unit of time. ("Equivalent" in the sense that if you know one, you can find the other - not that they are the same number.)

But either distribution will serve the purpose here. Suppose we have either an exponential distribution (the way I posed the failure problem, the random variable being time until next failure) or a Poisson distribution (the random variable is the number of failures per unit of time). How to find the population mean? We cannot find it the same way we find the sample mean.
https://en.m.wikipedia.org/wiki/Poisson_distribution

Well when I studied statistics I was content with a probability of NP and by trials I mean for example tries to start an engine with a low probability of failure. Unrelated rare events, like deaths from horse kicks in the cavalry, sounds pretty much like rare and random, but as with binomials we assume events are not related. In life, a second engine failure may well be causally linked to a first engine failure and thus a new probability must be calculated. I guess fatal horse kicks could come in clusters too, with various hypothetical links. That is often the problem with life trials, they are not simple like six sided fair dice. If our OP is indeed answering a homework question, we have added confusion.

In a case of real sampling in production I discovered the samples were taken when the operator saw or felt there was a change! Or would sample at the beginning and end of a run between tool changes or shifts.! There is a quoted case of heights in men with an upward kink at 6 feet, this being seen as a desirable height so it Was recorded as 6’ when actually 5’ 11 3/4 or 6’ 0 1/4 . That would mess up both sample and complete data sets!

.

11. Originally Posted by profloater
Unrelated rare events, like deaths from horse kicks in the cavalry, sounds pretty much like rare and random, but as with binomials we assume events are not related. In life, a second engine failure may well be causally linked to a first engine failure and thus a new probability must be calculated.
I don't have any problem with the assumption that the events are unrelated in the particular context. My issue is the use of the word "random" to mean the same thing as "independent". (If I am interpreting the quoted statement above correctly, it was just done again.)

The way I've always seen it (except in every day talk), "random" just means that there is uncertainty. If I have a six-sided die that is weighted so some numbers are more likely than others, so I can cheat people at gambling games, that doesn't mean it's not "random" - it just means the distribution of the outcome is not the uniform distribution, since some numbers occur with higher probability than others. And yet I have a very distinct recollection of someone at this very board declaring with great confidence that some event was not random, because there is a higher probability of it going on way or the other. This idea that something is only random if it is 50/50, if it is 55/45, then it is not random, is a completely different definition of "random" than any I've seen. And then people also use it to mean independent. If the value of X is 1 if I catch COVID and 0 if I don't, and the value of Y is 1 if my friend catches COVID and 0 if my friend doesn't, those events are dependent. We meet sometimes, and there is a chance one of us will catch it from the other. But that doesn't mean that X and Y are not random. There wouldn't be much point in regression analysis if "random" were the same thing as "independent".

But on the point that caused me to enter this thread, I remain of the opinion that the idea that the sample mean and population mean are calculated the same way, is only correct (or even meaningful) if there is a finite population of outcomes with each outcome having equal probability. A Poisson distribution is an example where you absolutely cannot calculate the population mean the same way you calculate the sample mean. So is an exponential distribution. Any continuous distribution has infinitely many possible outcomes, or an infinite "population". (A Poisson distribution is not a continuous distribution, but it still has infinitely many outcomes.) But even when there is a finite population, the probability of each outcome doesn't have to be the same. The idea that sample and population mean are calculated the same way is only true in a very specific context.

Originally Posted by profloater
If our OP is indeed answering a homework question, we have added confusion.
If the OP wants very specific answers, rather than a general discussion, I think there needs to be a lot more context. In particular, it would be helpful to know if there are any implicit assumptions here about what the "population" is.

12. Originally Posted by 21st Century Schizoid Man
I don't have any problem with the assumption that the events are unrelated in the particular context. My issue is the use of the word "random" to mean the same thing as "independent". (If I am interpreting the quoted statement above correctly, it was just done again.)

The way I've always seen it (except in every day talk), "random" just means that there is uncertainty. If I have a six-sided die that is weighted so some numbers are more likely than others, so I can cheat people at gambling games, that doesn't mean it's not "random" - it just means the distribution of the outcome is not the uniform distribution, since some numbers occur with higher probability than others. And yet I have a very distinct recollection of someone at this very board declaring with great confidence that some event was not random, because there is a higher probability of it going on way or the other. This idea that something is only random if it is 50/50, if it is 55/45, then it is not random, is a completely different definition of "random" than any I've seen. And then people also use it to mean independent. If the value of X is 1 if I catch COVID and 0 if I don't, and the value of Y is 1 if my friend catches COVID and 0 if my friend doesn't, those events are dependent. We meet sometimes, and there is a chance one of us will catch it from the other. But that doesn't mean that X and Y are not random. There wouldn't be much point in regression analysis if "random" were the same thing as "independent".

But on the point that caused me to enter this thread, I remain of the opinion that the idea that the sample mean and population mean are calculated the same way, is only correct (or even meaningful) if there is a finite population of outcomes with each outcome having equal probability. A Poisson distribution is an example where you absolutely cannot calculate the population mean the same way you calculate the sample mean. So is an exponential distribution. Any continuous distribution has infinitely many possible outcomes, or an infinite "population". (A Poisson distribution is not a continuous distribution, but it still has infinitely many outcomes.) But even when there is a finite population, the probability of each outcome doesn't have to be the same. The idea that sample and population mean are calculated the same way is only true in a very specific context.

If the OP wants very specific answers, rather than a general discussion, I think there needs to be a lot more context. In particular, it would be helpful to know if there are any implicit assumptions here about what the "population" is.
I was quoting from the WP article I found to refresh Poisson which I learned about a long time ago and used in various engineering sampling projects. I agree that there should be a difference between random and unrelated although with Poisson, and the subsequent academic work, there is a difference of kind. We know that random events can cluster, as can events in a Poisson assumption. Examples from opinion surveys is a different kettle of fish, and has become much more sophisticated, while often still unsatisfactory.

In election results for example it is assumed that non voters will fall into the same ratios as voters. This is clearly a questionable assumption. It is also clear that geography matters for various reasons. My experience is in the questionable area of predicting engineering failure. This highlights the problems of taking historical statistics and projecting into the future.

13. Originally Posted by profloater
We know that random events can cluster, as can events in a Poisson assumption.
Under the typical Poisson assumptions, occurrence of events is independent, and they still cluster. The clustering is just due to random variation - sometimes there are more occurrences, sometimes there are fewer. That's something different than dependence, where the probability distribution of one event depends on the outcome for another event.

14. Established Member
Join Date
Jun 2009
Posts
1,920
Originally Posted by 21st Century Schizoid Man
I'm not sure in what sense you calculate them in exactly the same way.

A population moment can be calculated when you know the probability distribution. For the mean, it is the sum of the probabilities of each outcome times the value of the outcome, if the distribution is discrete. If it is a continuous distribution, it is the integral of the probability density function times the outcome of the variable. More generally, it is the integral of the outcome with respect to the probability measure (instead of Lebesgue measure) which includes both discrete and continuous random variables as a special case, but also covers probability distributions that are neither continuous nor discrete.

A sample mean is an estimate of the population mean (usually - but you can calculate a sample mean even when the population mean is not defined, as is the case for a Cauchy distribution), and you just add up all the observations of the random variable, and divide by the number of observations.
That is an accurate answer. To sort out additional ambiguity, we can consider 3 distinct meanings of "mean" These are:

1. Population mean: the mean of a random variable. This is defined (i.e. "calculated") in terms of the distribution of the random variable.

2. Sample mean as a single number (e.g. 32.9): The mean of a specific set of samples of a random variable calculated by applying the usual formula to the set of values.

3. Sample mean as a random variable. The formula for calculating the mean of samples defines a random variable whose value depends on the samples since the samples can be considered as random variables. This "sample mean" also has a probability distribution. As a random variable, the sample mean also has a mean and a variance. It's mean and variance are "population" parameters and they are defined ("calculated") in terms of the probability distribution for the sample mean.