【相關係數】的觀念釐清

學習統計的進程，從集中趨勢開始，其次是相關、回歸等概念。在這些基本的概念裡，相關性其實不難理解。然而，相關與其他統計概念，如：因果關係，回歸，分佈， Pearson相關係數等之間還是存在著不少值得探討釐清的困惑。首先你要先弄清楚：相關性和因果關係的差別，可以參考 difference between correlation and causation, 和有「相關」，不代表是「因果」！

【相關係數】的觀念釐清

相關(correlation)和相依(dependency)，是同樣的概念嗎？換句話說：如果
若兩事件的相關係數為零，是否表示這兩事件為不相依(not dependent)；倒過來說，也是正確的嗎？
如果兩變量都與第三個變量存在著很高的相關係數，是否表示這兩個變量也將是高度相關？
若A和B都和另一變量C 正相關？那A和B是負相關性的情況，是否仍有可能？
單一離群的異常值是否會大幅減少或增加相關性？Pearson相關係數是否對異常值非常敏感？
存在因果關係是否意味著相關？
相關和簡單線性迴歸的區別是什麼？
選擇Pearson和Spearman相關係數，如何取捨？
如何解釋的相關係數(correlation)和共變數(and covariance)之間的區別？

最常用的相關係數是皮爾斯相關係數(Pearson Coefficient)，公式如下：

其值介於 +1 和 -1之間，兩邊的極值表示存在強烈的相關。
零表示不相關，但不代表獨立(non-dependence) (A value of zero indicates a NIL correlation but not a non-dependence. )

1: Correlation vs. Dependency 【相關和相依】

A non-dependency between two variable means a zero correlation. However the inverse is not true. A zero correlation can even have a perfect dependency.

獨立(不相依)的兩個變數，表示零相關。反之不為真，也就是說：
零相關仍有可能是相依的。例如：

In this scenario, where the square of x is linearly dependent on y (the dependent variable), everything to the right of y axis is negative correlated and to left is positively correlated. So what will be the Pearson Correlation coefficient?

在這個例子中，x的平方與γ（因變量）線性相關，所有的x與y軸的右側為負相關，與左側為正相關。
那麼，Pearson相關係數將成為如何？

你計算x與y的相關係數的結果會是零。 What does that mean?

For a pair of variables which are perfectly dependent on each other, can also give you a zero correlation.

【必背要點】: 相關係數表示：兩個變量的線性相依程度。不能表達非線性的關係。
Correlation quantifies the linear dependence of two variables. It cannot capture non-linear relationship between two variables.

Good Read: Must Read Books in Analytics / Data Science

2: Is Correlation Transitive? 【相關性是否滿足遞移律？】

假設X, Y, 和 Z 是隨機變數。若 X 和 Y是正相關，且Y 和 Z也是正相關。則X 和 Z 是否必為正相關？

如下例所示，答案為：否。
We may prove that if the correlations are sufficiently close to 1, then X and Z must be positively correlated.

Let’s assume C(x,y) is the correlation coefficient between x and y. Like wise we have C(x,z) and C(y,z). Here is an equation which comes from solving correlation equation mathematically :

C(x,y) = C(y,z) * C(z,x) - Square Root ( (1 - C(y,z)^2 ) *  (1 - C(z,x)^2 ) )

Now if we want C(x,y) to be more than zero , we basically want the RHS of above equation to be positive. Hence, you need to solve for :

 C(y,z) * C(z,x) > Square Root ( (1 - C(y,z)^2 ) *  (1 - C(z,x)^2 ) )

We can actually solve the above equation for both C(y,z) > 0 and C(y,z) < 0 together by squaring both sides. This will finally give the result as C(x,y) is a non zero number if following equation holds true:

C(y,z) ^ 2 + C(z,x) ^ 2 > 1

Wow, this is an equation for a circle. Hence the following plot will explain everything :

If the two known correlation are in the A zone, the third correlation will be positive. If they lie in the B zone, the third correlation will be negative. Inside the circle, we cannot say anything about the relationship. A very interesting insight here is that even if C(y,z) and C(z,x) are 0.5, C(x,y) can actually also be negative.

3: Is Pearson coefficient sensitive to outliers?

【Pearson相關係數是否對異常值的敏感程度？】

答案為：是。
Even a single outlier can change the direction of the coefficient. Here are a few cases, all of which have the same correlation coefficient of 0.81 :

Consider the last two graphs(X 3Y3 and X 4Y4). X3Y3 is clearly a case of perfect correlation where a single outlier brings down the coefficient significantly. The last graph is complete opposite, the correlation coefficient becomes a high positive number because of a single outlier. Conclusively, this turns out to be the biggest concern with correlation coefficient, it is highly influenced by the outliers.

Check your potential: Should I become a Data Scientist?

4: Does causation imply correlation?

If you have read our above three answers, I am sure you will be able to answer this one. The answer is No, because causation can also lead to a non-linear relationship. Let’s understand how!

Below is the graph showing density of water from 0 to 12 degree Celsius. We know that density is an effect of changing temperature. But, density can reach its maximum value at 4 degree Celsius. Therefore, it will not be linearly correlated to the temperature.

5: Difference between Correlation and Simple Linear Regression

These two are really close. So let’s start with a few things which are common for both.

The square of Pearson’s correlation coefficient is the same as the one in simple linear regression
Neither simple linear regression nor correlation answer questions of causality directly. This point is important, because I’ve met people thinking that simple regression can magically allow an inference that X causes. That’s preposterous belief.

What’s the difference between correlation and simple linear regression?

Now let’s think of few differences between the two. Simple linear regression gives much more information about the relationship than Pearson Correlation. Here are a few things which regression will give but correlation coefficient will not.

The slope in a linear regression gives the marginal change in output/target variable by changing the independent variable by unit distance. Correlation has no slope.
The intercept in a linear regression gives the value of target variable if one of the input/independent variable is set zero. Correlation does not have this information.
Linear regression can give you a prediction given all the input variables. Correlation analysis does not predict anything.

6: Pearson vs. Spearman

The simplest answer here is
Pearson captures how linearly dependent are the two variables whereas Spearman captures the monotonic behavior of the relation between the variables.

For instance consider following relationship :y = exp ( x )

Here you will find Pearson coefficient to be 0.25 but the Spearman coefficient to be 1. As a thumb rule, you should only begin with Spearman when you have some initial hypothesis of the relation being non-linear. Otherwise, we generally try Pearson first and if that is low, try Spearman. This way you know whether the variables are linearly related or just have a monotonic behavior.

7: Correlation vs. co-variance 【相關係數和共變數】

If you skipped the mathematical formula of correlation at the start of this article, now is the time to revisit the same.

Correlation is simply the normalized co-variance with the standard deviation of both the factors. This is done to ensure we get a number between +1 and -1. Co-variance is very difficult to compare as it depends on the units of the two variable. It might come out to be the case that marks of student is more correlated to his toe nail in mili-meters than it is to his attendance rate.

This is just because of the difference in units of the second variable. Hence, we see a need to normalize this co-variance with some spread to make sure we compare apples with apples. This normalized number is known as the correlation.

End Notes

Questions on correlation are very common in interviews.
The key is to know that correlation is an estimate of linear dependence of the two variables.
Correlation is transitive for a limited range of correlation pairs.
Correlation is also highly influenced by outliers.
We learnt that neither Correlation imply Causation nor vice-versa.

【延伸】有「相關」，不代表是「因果」！

【出處及參考】http://www.analyticsvidhya.com/blog/2015/06/correlation-common-questions/?utm_source=FBPage&utm_medium=Social&utm_campaign=150624

Jason

The Dance of Disorder (Fluctuations of Entropy)

Jason 發表在痞客邦留言(0) 人氣( 0 )

個人分類： Research & Publish

▲top

The Dance of Disorder (Fluctuations of Entropy)

Casual notes consist of scattered perspectives on the Existence.
宇宙中估計有無數孤單的波茲曼大腦漂浮在無序中，對於宇宙來說，觀測者更有可能是種隨機漲落出現的意識。

1: Correlation vs. Dependency 【相關和相依】

2: Is Correlation Transitive? 【相關性是否滿足遞移律？】

【Pearson相關係數是否對異常值的敏感程度？】

4: Does causation imply correlation?

5: Difference between Correlation and Simple Linear Regression

6: Pearson vs. Spearman

7: Correlation vs. co-variance 【相關係數和共變數】

End Notes

熱門文章

參觀人氣

The Dance of Disorder (Fluctuations of Entropy)

Casual notes consist of scattered perspectives on the Existence.宇宙中估計有無數孤單的波茲曼大腦漂浮在無序中，對於宇宙來說，觀測者更有可能是種隨機漲落出現的意識。