矩阵协方差如何计算?
[参考]
http://en.wikipedia.org/wiki/Covariance_matrix
看图最容易理解, 其实算的是每两列的协方差.
例如 x[2,3] 这个位置计算的是: 第二行和第三列的协方差.
Definition[edit]
Throughout this article, boldfaced unsubscripted X and Y are used to refer to random vectors, and unboldfaced subscripted Xi and Yi are used to refer to random scalars.
If the entries in the column vector
are random variables, each with finite variance, then the covariance matrix Σ is the matrix whose (i, j) entry is the covariance
where
is the expected value of the ith entry in the vector X. In other words,
下面使用R来进行验证.
使用rpois产生一些离散值, 使用array产生一个矩阵.
> x=array(rpois(25,lambda=10), dim=c(5,5))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 6 10 12 8 8
[2,] 10 9 10 12 10
[3,] 10 5 8 8 7
[4,] 13 7 12 9 8
[5,] 14 4 11 7 12
计算这个矩阵的协方差
> y=cov(x)
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 9.80 -6.00 0.05 -0.85 3.25
[2,] -6.00 6.50 1.75 2.75 -1.50
[3,] 0.05 1.75 2.80 -0.35 0.75
[4,] -0.85 2.75 -0.35 3.70 0.00
[5,] 3.25 -1.50 0.75 0.00 4.00
y[1,1]是算的第一列和第一列的协方差.
> y[1,1]
[1] 9.8
使用cov(vector,vector)函数验证一下. 如下 :
> x[,1]
[1] 6 10 10 13 14
> cov(x[,1], x[,1])
[1] 9.8
y[1,2]和y[2,1]是一样的, 计算的是第一列和第二列的协方差. 验证如下 :
> cov(x[,1], x[,2])
[1] -6
> cov(x[,2], x[,1])
[1] -6
以此类推.
那么问题来了, 如果输入的数组行列数不一样的话, 例如5行7列, 或者7行5列会怎么样呢?
下面来测试一下 :
> x=array(rpois(35,lambda=10), dim=c(5,7))
> x
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 9 12 12 11 7 14 12
[2,] 11 8 8 11 14 8 7
[3,] 12 9 9 5 13 6 9
[4,] 9 7 12 11 9 14 10
[5,] 6 13 11 8 11 9 14
因为即使矩阵协方差时, 计算的是列和列的关系, 所以只和列有关, 因此以上数组虽然是5行7列的, 但是计算结果是7行7列.
如下 :
> y=cov(x)
> y
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] 5.30 -3.90 -2.70 -1.35 3.10 -3.35 -5.45
[2,] -3.90 6.70 1.60 -1.20 -2.55 0.30 5.85
[3,] -2.70 1.60 3.30 1.65 -4.90 5.65 3.55
[4,] -1.35 -1.20 1.65 7.20 -3.45 7.20 -0.60
[5,] 3.10 -2.55 -4.90 -3.45 8.20 -9.45 -4.65
[6,] -3.35 0.30 5.65 7.20 -9.45 13.20 3.40
[7,] -5.45 5.85 3.55 -0.60 -4.65 3.40 7.30
其中y[7,6]指第七列和第六列的协方差. 验证如下 :
> y[7,6]
[1] 3.4
> cov(x[,7], x[,6])
[1] 3.4
> x[,7]
[1] 12 7 9 10 14
> x[,6]
[1] 14 8 6 14 9
对于行比列多的输入也无妨, 计算结果都只和列有关.
> x=array(rpois(35,lambda=10), dim=c(7,5))
> x
[,1] [,2] [,3] [,4] [,5]
[1,] 6 11 8 10 5
[2,] 7 9 12 11 9
[3,] 12 15 11 11 10
[4,] 12 7 10 14 11
[5,] 14 14 18 11 10
[6,] 16 13 14 8 13
[7,] 14 11 12 9 10
> cov(x)
[,1] [,2] [,3] [,4] [,5]
[1,] 13.952381 4.2142857 7.404762 -1.8809524 7.6904762
[2,] 4.214286 7.9523810 4.261905 -2.7857143 0.8095238
[3,] 7.404762 4.2619048 10.142857 -1.2619048 4.0476190
[4,] -1.880952 -2.7857143 -1.261905 3.6190476 -0.3095238
[5,] 7.690476 0.8095238 4.047619 -0.3095238 5.9047619
> cov(x[,5], x[,4])
[1] -0.3095238
其实, 我们就把每列想象成一组采样数据好了.
例如第一列是每天的下载量, 第二列是每天的收入, 第三列是每天的点击率, 等等.
计算他们的相关度, 和以上计算方法很相似.
> cor(x[,5], x[,4])
[1] -0.06695696
> y= cor(x)
> y
[,1] [,2] [,3] [,4] [,5]
[1,] 1.0000000 0.4000840 0.6224533 -0.26470155 0.84728179
[2,] 0.4000840 1.0000000 0.4745424 -0.51926713 0.11813534
[3,] 0.6224533 0.4745424 1.0000000 -0.20828082 0.52301998
[4,] -0.2647016 -0.5192671 -0.2082808 1.00000000 -0.06695696
[5,] 0.8472818 0.1181353 0.5230200 -0.06695696 1.00000000
我们注意到对角线是1, 因为对角线的位置刚好是自己和自己的相关度, 必然是1.
例如y[1,1], 指第一列和第一列的相关度, y[2,2], 指第2列和第2列的相关度. 当然都是1.
[其他]