Wilcoxon 符號等級檢定及報表判讀

處理前、後觀察值（小樣本，N=11）

受試者	A處理	B處理	差異	差異排序	符號排序
1	25	27	2	2	2
2	25	29	4	3.5	3.5
3	27	37	10	6	6
4	44	56	12	7	7
5	30	46	16	10	10
6	67	82	15	8.5	8.5
7	53	57	4	3.5	3.5
8	53	80	27	11	11
9	52	61	9	5	5
10	60	59	-1	1	-1
11	28	43	15	8.5	8.5
					W=64

Wilcoxon 符號等級檢定是以二樣本差的絕對值（不考慮正負符號）進行排序。將排序位標上正負符號。再將所有符號排序(標上正負號的排序值)進行加總，而得到兩個檢定量W（一個是正的[64]，另一個是負的[-1]），然後以W值(絕對值)大的進行作檢定。

原理是如果虛無假設成立（也就是A處理不會改變估觀察值），那麼W應該是接近0，表示：改變量為正值與改變量為負值的總排序應該相同。

然後進一步對W作檢定，如果「W」此時是指「B減A為正值」，那如果W=64又為正值，且檢定的顯著性為p<0.05，因此結論應該為「B對對觀察值有顯著效果」。

SPSS的輸出（如下面兩張表），應該如何來判斷結果呢？通常SPSS均以「等級總和」較小者作檢定（下表的1890.50）。以本例來說，Z=-2.321, p<0.05，因此可知「A 與 B 分數不同」，那究竟是何者為高呢？

如同在T檢定顯著時，會比較「平均數」一樣，這裡也是依據「平均等級」來判讀結果大小。若以Mean Rank來看，54.01>49.40，則代表正等級>負等級，也就是表2註解b成立，即「A > B」成立。

但也有在如果顯著時，依據「個數」判斷大小。依本例來說，66>35，即表1註解a成立，即「A < B」成立。

另一種可能的判斷方法，表3「為□等級基礎」（Base on）；本例是以正等級基礎，而正等級在表2是註解b為A >B，且由於Z為負值，所以應該是為「拒絕這個假設」，也就是說顯著負的Z值為拒絕「正等級, 表2註解b成立, 即為A >B」，因此代表「A > B」成立。

但究竟何者為正確判讀，值得深入研究。

從Wilcoxon signed-rank test檢定量的原理：若兩樣本推論檢定為來自相同母體，則檢定量者理當為0，因此Wilcoxon signed-rank test檢定的結果若為顯著時，通常只能支持「A組跟B組分配不同」，因檢定量同時涵蓋了個數和差額大小兩者融合的概念，只對整體是否有差異進行檢驗，對A組B組樣本何者為大並無未有明確的支持證據。此檢定量只看2邊的分布是否相當，未著眼於何者為大。

當使用者自行延伸解釋時，上述3種方式似乎均有合理的論點，只好case by case斟酌運用。有時，正(或負)差的數目比較重要，或是正(或負)總差值和比較重要，這在研究設計時，研究者就決定好了。舉個例來說：

A方法使10個人戒菸1週；B方法使1個人戒菸3個月

哪個比較好？恐怕是由研究者依目的或其研究性質來決定，是統計分析提供數據幫助研究者釐清觀察到的數據，結論的工作還是回到研究者的功夫上。

表2

		個數	等級平均數	等級總和
A - B	負等級	66(a)	49.40	3260.50
	正等級	35(b)	54.01	1890.50
	等值結	0(c)
	總和	101

a A < B

b A > B

c A = B

表3

	A - B
Z 檢定	-2.321(a)
漸近顯著性 (雙尾)	.020

a 以正等級為基礎。

b Wilcoxon 符號等級檢定

--------------------------------------------------------------------------------------------------------------------

【SPSS】Wilcoxon 符號等級檢定方法

【報表判讀】

Interpret SPSS Output: The statistics for the test are in the following table.

For Two-sided test: The two-sided test p-value for Asymptotic 2-tailed test is .002 and for the Exact 2-tailed is 0.001.

For one-sided test: The first table below shows that the negative mean rank is less than the positive mean rank. This suggests that the pulse rate measure from after is likely higher than the measure from before the treatment was applied. For asymptotic test, the p-value would be half of the p-value from two-tailed test and would be 0.001 in supporting that the pulse rate after treatment is higher than the pulse rate before treatment. For the exact test result, the p-value would be 0.0005.

--------------------------------------------------------------------------------------------------------------------

【魏克森符號等級】檢定法選擇R+與R-數值為檢定統計值以R+代表正di 值之等級和，R-代表負di 值之等級和。
若兩配對母體分布型態相同時，R+與R-數值應相等或非常相近。
倘若R+與R-數值有明顯的差異量，則表示兩配對母體分布型態不相同。

魏克森符號等級檢定法：先計算成對樣本觀測值之差異量di = xAi – xBi，取絕對值|di| = |xAi – xBi|，由小而大依序排列。指定等級(rank)。若差異值相等|di| = |dj|時，以其平均等級作為等級。不計算兩成對觀測值相同的樣本，亦不排序和給予等級。魏克森符號等級檢定法(Wilcoxon signed-rank test)，比單純的【符號檢定】除了計算成對觀測值的差異量的排序等級外，更進一步同時考慮差異的正號(+)或負號(-)的數量。

符號檢定(Sign Test)或Wilcoxon符號等級檢定(Signed Ranks Test)用於檢定兩個有關樣本所來自母體的中位數是否有顯著差異。
假設每對觀察值的差數乃是隨機產生的，將D值由大到小排序，然後將所有負號的等級相加及平均。
D的符號為正或為負的機率相等，均為1/2，若正號與負號出現過多或過少，則代表兩個有關樣本的平均數有顯著差異。

進行兩變量是否存在顯著差異時，t檢定以平均值( mean scores)為依據，Mann-Whitney U-檢定依據平均序位 mean ranks 而 Wilcoxon 符號檢定使用符號等級( signed ranks)。

The Wilcoxon符號等級檢定法作法上與 Mann-Whitney U-檢定 (也稱為 Wilcoxon 2-sample t-test)相似。 It is also similar to the basic principle of the dependent samples t-test, because just like the dependent samples t-test the Wilcoxon sign test, tests the difference of observations.

However, the Wilcoxon signed rank test pools all differences, ranks them and applies a negative sign to all the ranks where the difference between the two observations is negative. This is called the signed rank.

與已知樣本分配的 t 檢定不同，Wilcoxon 符號等級檢定為無母數檢定。Whereas the dependent samples t-test tests whether the average difference between two observations is 0, the Wilcoxon test tests whether the difference between two observations has a mean signed rank of 0. Thus it is much more robust against outliers and heavy tail distributions. Because the Wilcoxon sign test is a non-para continuous-level (無母數連續) test it does not require a special distribution of the dependent variable in the analysis. Therefore it is the best test to compare mean scores when the dependent variable is not normally distributed and at least of ordinal scale.

- See more at: http://www.statisticssolutions.com/academic-solutions/resources/directory-of-statistical-analyses/wilcoxon-sign-test/

We now turn to consider a somewhat analogous alternative to the t-test for correlated samples. The correlated-samples t-test makes certain assumptions and can be meaningfully applied only insofar as these assumptions are met. Namely,

that the scale of measurement for X_A and X_B has the properties of an equal-interval scale;_T
that the differences between the paired values of X_A and X_B have been randomly drawn from the source population; and_T
that the source population from which these differences have been drawn can be reasonably supposed to have a normal distribution.

Here again, it is not simply a question of good manners or good taste. If there is one or more of these assumptions that we cannot reasonably suppose to be satisfied, then the t-test for correlated samples cannot be legitimately applied.

Of all the correlated-samples situations that run afoul of these assumptions, I expect the most common are those in which the scale of measurement for X_A and X_B cannot be assumed to have the properties of an equal-interval scale. The most obvious example would be the case in which the measures for X_A and X_B derive from some sort of rating scale. In any event, when the data within two correlated samples fail to meet one or another of the assumptions of the t-test, an appropriate non-parametric alternative can often be found in the Wilcoxon Signed-Rank Test.

To illustrate, suppose that 16 students in an introductory statistics course are presented with a number of questions (of the sort you encountered in Chapters 5 and 6) concerning basic probabilities. In each instance, the question takes the form "What is the probability of such-and-such?" However, the students are not allowed to perform calculations. Their answers must be immediate, based only on their raw intuitions. They are instructed to frame each answer in terms of a zero to 100 percent rating scale, with 0% corresponding toP=0.0, 27% corresponding to P=.27, and so forth. They are also told that they can give non-integer answers if they wish to make really fine-grained distinctions; for example, 49.0635...%. (As it turns out, none do.)

The instructor of the course is particularly interested in student's responses to two of the questions, which we will designate as question A and question B. He reasons that if students have developed a good, solid understanding of the basic concepts, they will tend to give higher probability ratings for question A than for question B; whereas, if they were sleeping through that portion of the course, their answers will be mere shots in the dark and there will be no overall tendency one way or the other. The instructor's hypothesis is of course directional: he expects his students have mastered the concepts well enough to sense, if only intuitively, that the event described in question A has the higher probability. The following table shows the probability ratings of the 16 subjects for each of the two questions.

Subj.	X_A	X_B	X_A—X_B
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16	78 24 64 45 64 52 30 50 64 50 78 22 84 40 90 72	78 24 62 48 68 56 25 44 56 40 68 36 68 20 58 32	0 0 +2 —3 —4 —4 +5 +6 +8 +10 +10 —14 +16 +20 +32 +40
mean difference = +7.75

The observed results are consistent with the hypothesis. The probability ratings do on average end up higher for question A than for question B. Now to determine whether the degree of the observed difference reflects anything more than some lucky guessing.

Mechanics
The Wilcoxon test begins by transforming each instance of X_A—X_B into its absolute value, which is accomplished simply by removing all the positive and negative signs. Thus the entries in column 4 of the table below become those of column 5. In most applications of the Wilcoxon procedure, the cases in which there is zero difference between X_A and X_B are at this point eliminated from consideration, since they provide no useful information, and the remaining absolute differences are then ranked from lowest to highest, with tied ranks included where appropriate.

1	2	3	4	5	6	7
Subj.	X_A	X_B	original X_A—X_B	absolute X_A—X_B	rank of absolute X_A—X_B	signed rank
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16	78 24 64 45 64 52 30 50 64 50 78 22 84 40 90 72	78 24 62 48 68 56 25 44 56 40 68 36 68 20 58 32	0 0 +2 —3 —4 —4 +5 +6 +8 +10 +10 —14 +16 +20 +32 +40	0 0 2 3 4 4 5 6 8 10 10 14 16 20 32 40	--- --- 1 2 3.5 3.5 5 6 7 8.5 8.5 10 11 12 13 14	--- --- +1 —2 —3.5 —3.5 +5 +6 +7 +8.5 +8.5 —10 +11 +12 +13 +14
W = 67.0 ^TN = 14

The result of this step is shown in column 6. The entries in column 7 will then give you the clue to why the Wilcoxon procedure is known as the signed-rank test. Here you see the same entries as in column 6, except now we have re-attached to each rank the positive or negative sign that was removed from the X_A—X_B difference in the transition from column 4 to column 5.

The sum of the signed ranks in column 7 is a quantity symbolized as W, which for the present example is equal to 67. Two of the original 16 subjects were removed from consideration because of the zero difference they produced in columns 4 and 5, so our observed value of W is based on a sample of size N=14.

Logic & Procedure
Here again, as with the Mann-Whitney test, the effect of replacing the original measures with ranks is two-fold. The first is that it brings us to focus only on the ordinal relationships among the measures—"greater than," "less than," and "equal to"—with no illusion that these measures have the properties of an equal-interval scale. And the second is that it transforms the data array into a kind of closed system whose properties can then be known by dint of sheer logic.

For openers, we know that the sum of the N unsigned ranks in column 6 will be equal to

sum	=	N(N+1) 2

	=	14(14+1) 2	= 105

Thus the maximum possible positive value of W (in the case where all signs are positive) isW=+105, and the maximum possible negative value (in the case where all signs are negative) is W=—105. For the present example, a preponderance of positive signs among the signed ranks would suggest that subjects tend to rate the probability higher for question A than for question B. A preponderance of negative signs would suggest the opposite. The null hypothesis is that there is no tendency in either direction, hence that the numbers of positive and negative signs will be approximately equal. In that event, we would expect the value of Wto approximate zero, within the limits of random variability.

For fairly small values of N, the properties of the sampling distribution of W can be figured out through simple (if tedious) enumeration of all the possibilities. Suppose, for example, that we had only N=3 subjects, whose absolute (unsigned) X_A—X_B differences produced the untied ranks 1, 2, and 3. The following table shows the possible combinations of plus and minus signs that could be distributed among these ranks, along with the value of W that each combination would produce.

Ranks
1	2	3	W
+	+	+	+6
—	+	+	+4
+	—	+	+2
+	+	—	0
—	—	+	0
—	+	—	—2
+	—	—	—4
—	—	—	—6

There is a total of 8 equally probable mere-chance combinations, of which exactly one would yield a positive value of W as large as +6,exactly two would yield a positive value as large as +4, and so on. And similarly at the other end of the distribution: exactly one combination yields a negative value of W as large as —6, exactly two yield negative values of W as large as —4, and so on. Hence the probability of ending up with a positive value of W as large as +4 is 2/8=.25; the probability of obtaining a negative value of W as large as —4 is 2/8=.25; and the "two-tailed" probability of finding a value of ±W as large as ±4 (in either direction) is (2/8)+(2/8)=.5.

The first of the following graphs shows the sampling distribution of this N=3 situation in pictorial form, and the other two show the corresponding distributions for the situations where N=4 and N=5. Note that for any such situation, the number of possible combinations of plus and minus signs is equal to 2^N. Thus for N=3, 2³=8; for N=4, 2⁴=16; for N=5,2⁵=32, and so on.

Examine the shapes of these distributions and you will surely see where things are heading. As the size of N increases, the sampling distribution of W comes closer and closer to the outlines of the normal distribution. With a sample of size N=10 or greater, the approximation is close enough to allow for the calculation of a z-ratio, which can then be referred to the unit normal distribution. (When N is smaller than 10, the observed value of W must be referred to an exact sampling distribution of the sort shown above for N=3, N=4, and N=5. A table of critical values of W for small sample sizes will be provided toward the end of this subchapter.)

We noted earlier that on the null hypothesis we would expect the value of W to approximate zero, within the limits of random variability. This is tantamount to saying that any particular observed value of W belongs to a sampling distribution whose mean is equal to zero. Hence

-_W = 0

Considerably less obvious is the standard deviation of the distribution. As it would be a distraction to try to make it obvious, I will resort to another of those "it can be shown" assertions and say simply: For any particular value of N, it can be shown that the standard deviation of the sampling distribution of W is equal to

_W = sqrt

[

N(N+1)(2N+1)

]

which for the present example, with N=14, works out as

_W = sqrt

[

14(14+1)(28+1)

]

= ±31.86

When considering the Mann-Whitney test in Subchapter 11a we noted that the z-ratio must include a "±.5" correction for continuity. The same is true for the Wilcoxon test, and for the same sort of reason. The measure designated as W can assume decimal values only as an artifact of the process of assigning tied ranks. Intrinsically, the absolute ranks—1, 2, 3, 4, etc.—on which W is based are all integers. Thus, the structure of the z-ratio for the Wilcoxon test is

(W—

_W)±.5

The correction for continuity is "—.5" when W is greater than -_W and "+.5" when W is less than -_W. Since -_W is in all instances equal to zero, the simpler computational formula is

W—.5

For the present example, with N=14, W=67, and -_W=±31.86, the result is

67—.5

31.86

= +2.09

From the following table of critical values of z, you can see that the observed value ofz=+2.09 is significant just a shade beyond the .025 level for a directional test, which is the form of test called for by our investigator's directional hypothesis. For a two-tailed non-directional test, it would be significant just beyond the .05 level.
Critical Values of ±z

Level of Significance for a
Directional Test
.05	.025	.01	.005	.0005
Non-Directional Test
--	.05	.02	.01	.001
z_critical
1.645	1.960	2.326	2.576	3.291

When N is smaller than 10, the observed value of W must be referred to an exact sampling distribution of the sort described earlier. The following table shows the critical values of W for N=5 through N=9. For sample sizes smaller than N=5 there are no possible values of W that would be significant at or beyond the baseline .05 level.

Critical Values of ±W for Small Samples:

	Level of Significance for a
	Directional Test
	.05	.025	.01	.005
	Non-Directional Test
N	--	.05	.02	.01
5	15	--	--	--
6	17	21	--	--
7	22	24	28	--
8	26	30	34	36
9	29	35	39	43

The assumptions of the Wilcoxon test are:

that the paired values of X_A and X_B are randomly and independently drawn (i.e., each pair is drawn independently of all other pairs);_T
that the dependent variable (e.g., a subject's probability estimate) is intrinsically continuous, capable in principle, if not in practice, of producing measures carried out to the n^th decimal place; and_T
that the measures of X_A and X_B have the properties of at least an ordinal scale of measurement, so that it is meaningful to speak of "greater than," "less than," and "equal to."