- Cover the basics of hypothesis testing.
- Explain its dependence on Central Limit Theorem
What you are already supposed
to know:
- Basics of probability theory
- Central Limit Theorem (CLT)
- Gaussian Distribution
If you are a student of Statistics
or Business Research, Data Analytics or Business Analytics you might have heard
a lot about this term ‘Hypothesis Testing’. Possibilities are that you are
already applying it without having a clear picture of what is going around. You
are confused how null hypothesis is defined, what is null hypothesis, how and
when is it rejected or what is meant by its rejection etc. If you were having
any of the above doubts and were not able to clear them up until now then you
are at the right place. This article will try to clear the cloud around
Hypothesis testing.
Before getting any technical, let
us start with a simple question to create curiosity in the air. You are given a
big candy jar having tens of thousands of candies into it by someone and that
someone claims that on an average each candy weighs 10 grams. You have to
verify this claim. One way of doing that would be to weigh the whole content in
the jar and count the number of candies into it and then calculate the average.
But this method seems little infeasible and impractical to you as counting tens
of thousands of candies will take you days. You now came up with another idea.
You took out just one hundred candies out of it, you weighed them &
calculated their average. The average of hundred candies but turn out to be
just 8 grams. The question now in front of you is, whether to accept or reject
his claim. But before you conclude anything just keep this in mind that he was
talking about whole candy jar and you just verified the claim of only 100
candies.
To answer such questions there is
a statistical technique called Hypothesis Testing that comes to your rescue.
The idea goes like this:
Let’s suppose the jar is actually
filled with the candies having an average weight of 10 grams and for the sake
of understanding let’s further suppose that you took out 1000 such samples of
100 candies each from the jar as you took at the first place and you got the
results as depicted in the table below:
The above table says that out of
1000 samples, 80 samples were those where average weight came up to be 7 grams,
120 samples were having average weight of 8 grams, 300 were having 9 grams as
average weight and so on. We have simply grouped the samples together on the basis of common average weight.
Now, so that your particular
sample gave you the average weight of 8 grams, you can say that out of thousand
samples selected at random only 120 such samples are possible. In other words,
the probability of your sample is 120/1000 i.e. 12%. Now, if 12% is a significant
number for you, it can be said that the jar is actually having the average
weight of 10 grams. Note that, it is because it was initially supposed that the
jar is actually same as claimed by that someone & the table of average weight distribution that we got is actually from the jar.
As of now you might have got a
little bit idea of what we are heading towards. The above assumptions have
cleared a bit of cloud around the topic but there are still a lot of
questions to be answered like:
- Why only 1000 samples, there are infinite random samples of size 100 possible from the jar.
- How is the assumption of jar having average weight of 10 grams and the table of samples connected?
I will try to connect the dots
but before doing that, it is time to understand Central Limit Theorem. You are
supposed to know about the theorem to understand this article clearly but let’s
discuss this theorem a little bit too.
Central Limit Theorem (CLT)
The Central Limit Theorem
states that if you have a population with mean μ and standard deviation σ
and you take sufficiently large random samples from the population with replacement,
then the distribution of the sample means will be approximately normally
distributed. This will be true regardless of the fact that whether the source
population is normal or not. Further, the mean of this particular sampling
distribution will be equal to mean of the population and variance of the
sampling distribution will be equal to the variance of the population divided
by the sample size. Which further indicates that larger the sample size more
tendency to the normal behavior this sampling distribution would have.
I will try to fit the above
statements into the current scenario to make it easy to understand. For
example, from the candy jar if you begin to do random sampling of size 100.
Every time you draw a sample, you calculate its average weight and put the candies back, bring out another sample and repeat the process.
You will get all the possible values of average weights and when grouping them
together on the basis of common average weights, you will get a table similar
to one shown above. There would be little change though, the numbers you will
get won’t be necessarily integers, they can be decimals like 8.3 grams.
Grouping all the decimal numbers won’t be feasible, hence the frequency of
average weights would be mentioned against class intervals.
For example, the above table
would be represented as:
This frequency table tells us
that there were 80 samples who throw up the average weight between 7 grams to 8
grams, 120 samples where average weight came up between 8 grams to 9 grams and
so on. The Average weight classes will be having frequencies or number of times they occurred against them. To draw a distribution graph, frequency values would be converted to relative frequencies or probability values. If you add up all the relative frequencies together, you will get the value of 1.
If you make the class intervals very small and do the sampling infinite number of times, the central limit theorem suggests that the plot of average weight classes against the probability values will be as shown below:
Highlights of the above plot:
- The nature of the plot is normal.
- The mean of the plot is 10 grams (= mean of population as per assumption).
- The variance of the plot will be variance of population/100 (sample size =100).
This type of distribution plot
that we get after repeated sampling is called the sampling distribution
So now, CLT has given us the tool
to inspect all those infinite random samples that are possible from the candy
jar and not just 1000.
Back to the testing
Let’s now recall the original
problem where we took a single sample of 100 candies and got the average weight
of 8 grams. Since the original claim was of more than 8 grams average weight,
we will calculate the probability of getting an average weight of 8 grams or
less when a random sample is drawn from a jar having overall average weight of
10 grams. If the probability is high enough, we can conclude that the
assumption was right or in other words the claim saying that the average weight
is 10 grams cannot be nullified. If the probability value is too little, it
means the chances of getting this sample from the population having average
weight of 10 grams is too low & since we still got this sample, the claim
that the average weight of population is 10 grams is doubtful. So far, the
concept of validating a sample through sampling distribution must be clear. We
will now proceed to the mathematical stuff and hypothesis formulation.
Hypothesis Testing
Recalling the previous
assumptions. The average weight of candies in the candy jar is 10 grams and as
per CLT the sampling distribution of the jar would be normal with its peak at
10 grams as shown below (I know I keep repeating this :p). As you are already
aware, for a normal probability distribution if we have to calculate
probability between two points it is given by the area under the curve between
those 2 points as shown in the graph below (probability between 8 and 10):
If we want to calculate the
probability value analytically, we have to use the Gaussian equation:
Applying to the present context,
we need to find the probability of getting the average weight less or equal to
8 grams, which would be given by:
Let us inspect the
various parameters in the above equation
𝜇 = Mean of the sampling distribution
= Mean of the population
= 10 grams
𝜎 = Standard deviation of the sampling distribution
= standard deviation of the population/√n
= standard deviation of the
population/√100
Changing Normal to Standard
Normal
You might be aware that we can
convert any normal distribution integral to Standard normal integral by setting x-mean/std deviation = Z
(z
would be the new variable in standard normal equation)
You can read more
about normal distribution and standard normal distribution here
Standard Normal Distribution
is the one having mean 0 and standard deviation 1.
Now in the above case 𝜇 would be same as population mean and standard
deviation would be 𝜎/√n so we have:
The above quantity is
called Z – statistic (Zee statistic) and is directly linked to Hypothesis
testing.
The advantage of
calculating Z – statistic from mean and standard deviation is that we can
easily use the Z- table, already formulated, to calculate Probability value.
For example, the one available here .
Also, in most of the
cases as in present case we don’t know the standard deviation of population and
in that scenario, we calculate the standard deviation of the sample and
consider that as the population standard deviation.
Steps in Hypothesis
Testing
- Assume the claim about population data to be true (Null Hypothesis).
- Take a sample and calculate mean and standard deviation.
- Calculate Z – Statistic using the formula
- Use Z-Statistic to calculate Probability value (called p-value) from Z-table
- Reject or don’t reject null hypothesis based on p-value
Regarding p-value
You may ask the question that
what is the value of probability below which we reject null hypothesis. The
answer is: it depends upon case to case and relies wholly upon the one who
tests the hypothesis. The value of probability which is considered as threshold
is called significance level. Normally a 5% significance level is considered in
most of the cases.
Completing the case
Coming back to the candy jar. The
things that we know so far are
Sample mean = 8 grams
Population mean = 10 grams
Population Standard deviation
(let’s assume some value for it here, in actual scenario you can calculate it
from the sample you got) = 2 grams
Sample size (n) =
100
Significance Level = 5% (let’s settle on 5%)
The above Z- statistic
can now be used to calculate p-value (Probability) to find how significant our
sample is. We will use the Z-table available here.
If you look up the Z-table, you will find the p-value is significantly low (much
lower than our significance level). Hence, we can safely reject the null
hypothesis that the candy jar average weight is 10 grams.
Further Reading
If you find the above concepts interesting,
you can further read the following topics to know more about this field of
statistics.
Thanks for reading this
Have a good time 😊
Comments
Post a comment