Significance Tests / Hypothesis Testing
Suppose someone suggests a hypothesis that a certain population is 0. Recalling the convoluted way in which statistics works, one way to do this would be to
We fail to reject the hypothesis if
-1.96 SEM
0
+1.96 SEM which can be rewritten
+1.96 On the other hand, we reject the hypothesis if
-1.96 or
1.96 The statistic
is denoted by
the symbol t. The test can be summarized as: Reject the
hypothesis that the population mean is 0 if and only if the absolute
value of t is greater than 1.96.
There is a 5% chance of obtaining a 95% CI that excludes 0 when it is in fact the population mean. For this reason, we say that this test has been performed at the 0.05 level of significance. Had a 99% CI been used, we would say that the test had been performed at the 0.01 level of significance, that is, the significance level (or simply the level) of the test is the probability of rejecting a hypothesis when it is true.
Statistical theory says that in many situations where a population value is estimated by drawing random samples, the sample and population values will be within two standard errors of each other 95% of the time. That is, 95% of the time,
population
value - sample value
1.96 SE
[*] This is the case for means, differences between means, proportions, differences between proportions, and regression coefficients. After an appropriate transformation, this is the case for odds ratios and even correlation coefficients.
We have used this fact to construct 95% confidence intervals by restating the result as
population quantity
sample value + 1.96 SE For example, a 95% CI for the difference between two population means,
x-
y, is given by
.
When we perform significance tests, we reexpress [*] by noting that 95% of the time
![]()
Suppose you wanted to
test whether a population quantity were equal to 0. You could calculate
the value of
![]()
which we get by inserting the hypothesized value of the population mean difference (0) for the population_quantity. If t<-1.96 or t>1.96 (that is, |t|>1.96), we say the data are not consistent with a population mean difference of 0 (because t does not have the sort of value we expect to see when the population value is 0) or "we reject the hypothesis that the population mean difference is 0". If t were -3.7 or 2.6, we would reject the hypothesis that the population mean difference is 0 because we've observed a value of t that is unusual if the hypothesis were true.
If -1.96
t
1.96 (that is, |t|
1.96), we say the data are
consistent with a population mean difference of 0 (because t has
the sort of value we expect to see when the population value is 0) or "we
fail to reject the hypothesis that the population mean difference is
0". For example, if t were 0.76, we would fail reject the hypothesis
that the population mean difference is 0 because we've observed a value
of t that is unremarkable if the hypothesis were true.
This is called "fixed level testing" (at the 0.05 level).
For example, if H0:
x =
y (which can be rewritten H0:
x -
y = 0), the test statistic
is

If |t|>1.96, reject H0:
x =
y at the 0.05 level of significance.
When we were constructing confidence intervals, it mattered whether the data were drawn from normally distributed populations, whether the population standard deviations were equal, and whether the sample sizes were large or small, The answers to these questions helped us determine the proper multiplier for the standard error. The same considerations apply to significance tests. The answers determine the critical value of t for a result to be declared statistically significant.
When populations are normally distributed with unequal standard deviations and the sample size is small, the multiplier used to construct CIs is based on the t distribution with noninteger degrees of freedom. The same noninteger degrees of freedom appear when performing significance tests. Many ways to calculate the degrees of freedom have been proposed. The statistical program package SPSS, for example, uses the Satterthwaite formula
, where
.
-1.96, t
1.96}, the critical values are
1.96 as in the picture above.
(alpha). Rejecting the null hypothesis, H0, when it is
true is called a Type I Error. Therefore, if the null hypothesis
is true
, the level of the
test, is the probability of a type I error.
is also the power of the test when the null hypothesis,
H0, is true. In the picture above,
is the proportion of the
distribution colored in red. The choice of
determines the critical values.
The tails of the distribution of t are colored in until the
proportion filled in is
,
which determines the critical values.
(beta).
By definition, power = 1 -
when the null hypothesis is false.
The difference between type I & type II errors is illustrated by the following legal analogy. Under United States law, defendants are presumed innocent until proven guilty. The purpose of a trial is to see whether a null hypothesis of innocence is rejected by the weight of the data (evidence). A type I error (rejecting the null hypothesis when it is true) is "convicting the innocent." A type II error (failing to reject the null hypothesis when it is false) is "letting the guilty go free."
A common mistake is to confuse a type I or II error with its
probability.
is not a type I error. It
is the probability of a type I error. Similarly,
is not a type II error. It is the
probability of a type II error.
There's a trade-off between
and
. Both are probabilities of making an error. With
a fixed sample size, the only way to reduce the probability of making one
type of error is to increase the other. For the problem of comparing
population means, consider the rejection region whose critical values are
. This excludes every possible
difference in sample means. H0 will never be rejected. Since
the null hypothesis will never be rejected, the probability of rejecting
the null hypothesis when it is true is 0. So,
=0. However, since the null hypothesis will never
be rejected, the probability of failing to reject the null hypothesis
when it is false is 1, that is,
=1.
Now consider the opposite extreme--a rejection region whose critical
values are 0,0. The rejection region includes every possible difference
in sample means. This test always rejects H0. Since the null
hypothesis is always rejected, the probability of rejecting H0
when it is true is 1, that is,
=1. On the
other hand, since the null hypothesis is always rejected, the probability
of failing to reject it when it is false is 0, that is,
=0.
To recap, the test with a critical region bounded by
has
=0 and
=1, while the test with a critical region
bounded by 0,0 has
=1 and
=0. Now consider tests with intermediate critical
regions bounded by
k. As
k increases from 0 to
,
decreases from 1 to 0 while
increases from 0 to 1.
Every statistics textbook contains discussions of
,
, type I
error, type II error, and power. Analysts should be familiar with all of
them. However,
is the only one that is
encountered regularly in reports and published papers. That's because
standard statistical practice is to carry out significance tests at the
0.05 level. As we've just seen, choosing a particular value for
determines the value of
.
The one place where
figures prominently
in statistical practice is in determining sample size. When a study is
being planned, it is possible to choose the sample size to set the power
to any desired value for some particular alternative to the null
hypothesis. To illustrate this, suppose we are testing the hypothesis
that two population means are equal at the 0.05 level of significance by
selecting equal sample sizes from the two populations. Suppose the common
population standard deviation is 12. Then, if the population mean
difference is 10, a sample of 24 subjects per group gives an 81% chance
of rejecting the null hypothesis of no difference (power=0.81,
=0.19). A sample of 32 subjects per group gives an
91% chance of rejecting the null hypothesis of no difference (power=0.91,
=0.09). This is discussed in detail in the
section on sample size determination.
[back to LHSP]