Student's t test for independent samples is equivalent to a linear regression of the response variable on the grouping variable, where the grouping variable is recoded to have numerical values, if necessary.
Here's an example involving
glucose levels in two strains of rats, A and B. First, the data are
displayed in a dot plot. Then, Glucose is plotted against A0B1, where
A0B1 is created by setting it equal 0 for strain A and 1 for
strain B.
Student's t test for independent samples yields
Variable: GLU STRAIN N Mean Std Dev Std Error ----------------------------------------------------- A 10 80.40000000 29.20502240 9.23543899 B 12 99.66666667 19.95601223 5.76080452 Variances T DF Prob>|T| --------------------------------------- Unequal -1.7700 15.5 0.0965 Equal -1.8327 20.0 0.0818
The linear regression of glucose on A0B1 gives the equation GLU = b0 + bA0B1 A0B1 .
Dependent Variable: GLU
Parameter Estimates
Parameter Standard T for H0:
Variable DF Estimate Error Parameter=0 Prob > |T|
INTERCEPT 1 80.400000 7.76436303 10.355 0.0001
A0B1 1 19.266667 10.51299725 1.833 0.0818
The P value for the Equal Variances version of the t test is equal to the P value for the regression coefficient of the grouping variable A0B1 (P = 0.0818). The corresponding t statistics are equal in magnitude (|t| = 1.833). This is not a coincidence. Statistical theory says the two P values must be equal. The t statistics must be equal in magnitude. The signs will be the same if the t statistic is calculated by subtracting the mean of group 0 from the mean of group 1.
The equal variances version of Student's t test is used to test the
hypothesis of the equality of
A and
B, the means of two normally
distributed populations with equal population variances. The hypothesis
can be written H0:
A =
B. The population means can be
reexpressed as
A=
and
B=
+
,
where
=
B-
A (that is, data from strain A
are normally distributed with mean
and
standard deviation
while data from
strain B are normally distributed with mean
+
and
standard deviation
) and the hypothesis
can be rewriten as H0:
=
0.
The linear regression model says data are normally distributed about
the regression line with constant standard deviation
. The predictor variable
A0B1(the grouping variable) takes on only two
values. Therefore, there are only two locations along the regression line
where there are data (see the display). "Homoscedastic (constant spread
about the regression line) normally distributed values about the
regression line" is equivalent to "two normally distributed populations
with equal variances".
A is
equal to
0,
B
is equal to
0+
A0B1, and
A0B1 is equal
to
. Thus, the hypothesis of equal means
(H0:
=
0) is equivalent to the hypothesis that the regression coefficient of A0B1
is 0 (H0:
A0B1 =
0). The population means are equal if and only if the regression line is
horizontal. Since the probability structure is the same for the two
problems (homoscedastic, normally distributed data), test statistics and P
values will be the same, too.
The numbers confirm this. For strain A, the predicted value b0+bA0B1*0, is 80.400000 + 19.266667*0 = 80.40, the mean of strain A. For strain B, b0+bA0B1*1 is 80.400000 + 19.266667*1 = 99.67, the mean of strain B. Had the numerical codes for strain been different from 0 and 1, the intercept and regression coefficient would change so that the two predicted values would continue to be the sample means. The t statistic and P value for the regression coefficient would not change. The best fitting line passes through the two points whose X-values are equal to the coded Strain values and whose Y-values are equal to the corresponding sample means. This minimizes the sum of squared differences between observed and predicted Y-values. Since this involves only two points and two points determine a straight line, the linear regression equation will always have the slope & intercept necessary to make the line pass through the two points. To put it another way, two points define the regression line. The Y-values are the sample means. The X-values are determined by the coding scheme. Whatever the X-values, the slope & intercept of the regression line will be those of the line that passes through the two points.