Statistical Measures with R

Refer to R Tutorial andExercise Solution

Mean, 平均值

The mean of an observation variable is a numerical measure of the central location of the data values. It is the sum of its data values divided by data count.

Hence, for a data sample of size n, its sample mean is defined as follows:

> duration = faithful$eruptions     # the eruption durations  
> mean(duration)                    # apply the mean function  
[1] 3.4878

 

Median, 中位数

The median of an observation variable is the value at the middle when the data is sorted in ascending order. It is an ordinal measure of the central location of the data values.

> duration = faithful$eruptions     # the eruption durations  
> median(duration)                  # apply the median function  
[1] 4

 

 

Quartile, 四分位数, 中位数即第二四分位数

There are several quartiles of an observation variable.

The first quartile, or lower quartile, is the value that cuts off the first 25% of the data when it is sorted in ascending order.

The second quartile, or median, is the value that cuts off the first 50%.

The third quartile, or upper quartile, is the value that cuts off the first 75%.

> duration = faithful$eruptions     # the eruption durations  
> quantile(duration)                # apply the quantile function  
    0%    25%    50%    75%   100%  
1.6000 2.1627 4.0000 4.4543 5.1000

 

Percentile, 百分位数

The nth percentile of an observation variable is the value that cuts off the first n percent of the data values when it is sorted in ascending order.

Find the 32nd, 57th and 98th percentiles

> duration = faithful$eruptions     # the eruption durations  
> quantile(duration, c(.32, .57, .98))  
   32%    57%    98%  
2.3952 4.1330 4.9330

 

Range

The range of an observation variable is the difference of its largest and smallest data values. It is a measure of how far apart the entire data spreads in value.

> duration = faithful$eruptions     # the eruption durations  
> max(duration) − min(duration)     # apply the max and min functions  
[1] 3.5

 

Interquartile Range, 四分位距

The interquartile range of an observation variable is the difference of its upper and lower quartiles. It is a measure of how far apart the middle portion of data spreads in value.

 

> duration = faithful$eruptions     # the eruption durations  
> IQR(duration)                     # apply the IQR function  
[1] 2.2915

 

Box Plot, 箱线图

The box plot of an observation variable is a graphical representation based on its quartiles, as well as its smallest and largest values. It attempts to provide a visual shape of the data distribution.

> duration = faithful$eruptions       # the eruption durations  
> boxplot(duration, horizontal=TRUE)  # horizontal box plot

The box plot of the eruption duration is:

这个图就是用图形化来表示四分位数, box的三条边表示第一, 二, 三四分位数, 那条最粗的就是第二四分位数, 即中位数

    0%    25%    50%    75%   100%  
1.6000 2.1627 4.0000 4.4543 5.1000

从这个图可以看出数据的分布...

 

Variance, 方差

The variance is a numerical measure of how the data values is dispersed around the mean. In particular, the sample variance is defined as:

 

> duration = faithful$eruptions    # the eruption durations  
> var(duration)                    # apply the var function  
[1] 1.3027

 

Standard Deviation, 标准偏差

The standard deviation of an observation variable is the square root of its variance.

> duration = faithful$eruptions    # the eruption durations  
> sd(duration)                     # apply the sd function  
[1] 1.1414

 

Covariance, 协方差

The covariance of two variables x and y in a data sample measures how the two are linearly related. A positive covariancewould indicates a positive linear relationship between the variables, and a negative covariance would indicate the opposite.

The sample covariance is defined in terms of the sample means as:

> duration = faithfuleruptions   # the eruption durations   > waiting = faithfuleruptions   # the eruption durations   > waiting = faithfulwaiting      # the waiting period  
> cov(duration, waiting)          # apply the cov function  
[1] 13.978

 

Correlation Coefficient, 相关系数

The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individualstandard deviations. It is a normalized measurement of how the two are linearly related.

Formally, the sample correlation coefficient is defined by the following formula, where sx and sy are the sample standard deviations, and sxy is the sample covariance.

If the correlation coefficient is close to 1, it would indicates that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope.

For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope.

And for zero, it would indicates a weak linear relationship between the variables.

> duration = faithfuleruptions   # the eruption durations   > waiting = faithfuleruptions   # the eruption durations   > waiting = faithfulwaiting      # the waiting period  
> cor(duration, waiting)          # apply the cor function  
[1] 0.90081

说明喷发时间和等待时间成正比, 等的越久就喷的越久...

 

协方差和相关系数

1、协方差是一个用于测量投资组合中某一具体投资项目相对于另一投资项目风险的统计指标,通俗点就是投资组合中两个项目间收益率的相关程度,正数说明两个项目一个收益率上升,另一个也上升,收益率呈同方向变化。如果是负数,则一个上升另一个下降,表明收益率是反方向变化。协方差的绝对值越大,表示这两种资产收益率关系越密切;绝对值越小表明这两种资产收益率的关系越疏远。 
2、由于协方差比较难理解,所以将协方差除以两个投资方案投资收益率的标准差之积,得出一个与协方差具有相同性质却没有量化的数。这个数就是相关系数。计算公式为相关系数=协方差/两个项目标准差之积。

 

Central Moment, 中心矩

The kth central moment (or moment about the mean) of a data sample is:

For example, the second central moment of a population is its variance.

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> moment(duration, order=3, central=TRUE)  
[1] −0.6149

 

Skewness, 偏斜度

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

Intuitively, the skewness is a measure of symmetry.

Negative skewness indicates that the mean of the data values is less than the median, and the data distribution is left-skewed;

Positive skewness would indicates that the mean of the data values is larger than the median, and the data distribution is right-skewed. Of course, this rule applies only to unimodal distributions whose histograms have a single peak.

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> skewness(duration)                # apply the skewness function  
[1] -0.41584

 

Kurtosis, 峰态

The kurtosis of a univariate population is defined by the following formula, where μ2 and μ4 are the second and fourthcentral moments.

Intuitively, the kurtosis is a measure of the peakedness of the data distribution.

Negative kurtosis would indicates a flat distribution, which is said to be platykurtic(平顶).

Positive kurtosis would indicates a peaked distribution, which is said to be leptokurtic(尖顶).

Finally, the normal distribution has zero kurtosis, and is said to be mesokurtic(常态峰的).

> library(moments)                  # load the moments package  
> duration = faithful$eruptions     # the eruption durations  
> kurtosis(duration) - 3            # apply the kurtosis function  
[1] -1.5006

本文章摘自博客园,原文发布日期:2012-02-15

时间: 2024-10-03 02:25:59

Statistical Measures with R的相关文章

如何进阶为数据科学家

数据科学并没有一个独立的学科体系,统计学,机器学习,数据挖掘,数据库,分布式计算,云计算,信息可视化等技术或方法来对付数据. 但从狭义上来看,我认为数据科学就是解决三个问题: 1. data pre-processing;(数据预处理) 2. data interpretation:(数据解读) 3.data modeling and analysis.(数据建模与分析) 这也就是我们做数据工作的三个大步骤: 1.原始数据要经过一连串收集.提取.清洗.整理等等的预处理过程,才能形成高质量的数据:

技术书单整理

算法 算法导论 Introduction to Algorithms, Second Edition, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein 算法概论  Algorithms, S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani Python Algorithms-Mastering Basic Algorithms in

如何成为一名数据分析师:必备技能 TOP5

什么是数据分析师(Data Analyst)? 数据分析师指的是不同行业中,专门从事行业数据搜集.整理.分析,并依据数据做出行业研究.评估和预测的专业人员. 他们知道如何提出正确的问题,善于数据分析,数据可视化和数据呈现;辅助公司商业决策,帮助降低成本,提高收益,改进产品,留住客户,发现新的商业机会等.总的来说,根据数据分析师的级别,他们主要的工作内容可能包括以下: 与IT部门,管理部门,数据科学家(Data Scientist)合作,决定整个公司的团队目标 从primary research和

数据科学 怎样进行大数据的入门级学习?

数据科学并没有一个独立的学科体系,统计学,机器学习,数据挖掘,数据库,分布式计算,云计算,信息可视化等技术或方法来对付数据. 但从狭义上来看,我认为数据科学就是解决三个问题: 1. data pre-processing;(数据预处理) 2. data interpretation:(数据解读) 3.data modeling and analysis.(数据建模与分析) 这也就是我们做数据工作的三个大步骤: 1.原始数据要经过一连串收集.提取.清洗.整理等等的预处理过程,才能形成高质量的数据:

Machine and Deep Learning with Python

Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstitions cheat sheet Introduction to Deep Learning with Python How to implement a neural network How to build and run your first deep learning network Neur

PivotalR between R & PostgreSQL-like Databases(for exp : Greenplum, hadoop access by hawq)

PivotalR是R的一个包, 这个包提供了将R翻译成SQL语句的能力, 即对大数据进行挖掘的话. 用户将大数据存储在数据库中, 例如PostgreSQL , Greenplum.  用户在R中使用R的语法即可, 不需要直接访问数据库, 因为PivotalR 会帮你翻译成SQL语句, 并且返回结果给R. 这个过程不需要传输原始数据到R端, 所以可以完成R不能完成的任务(因为R是数据在内存中的运算, 如果数据量超过内存会有问题) PivotalR还封装了MADlib, 里面包含了大量的机器学习的函

SAS vs. R (vs. Python) – which tool should I learn?

原文  :  http://www.analyticsvidhya.com/blog/2014/03/sas-vs-vs-python-tool-learn/ We love comparisons! From Samsung vs. Apple vs. HTC in smartphones; iOS vs. Android vs. Windows in mobile OS to comparing candidates for upcoming elections or selecting c

R语言为Hadoop注入统计血脉

R是GNU的一个开源工具,具有S语言血统,擅长统计计算和统计制图.由Revolution Analytics发起的一个开源项目RHadoop将R语言与Hadoop结合在一起,很好发挥了R语言特长.广大R语言爱好者借助强大工具RHadoop,可以在大数据领域大展拳脚,这对R语言程序员来说无疑是个喜讯.作者从一个程序员的角度对R语言和Hadoop做了一次详细的讲解. 以下为原文: 前言 写过几篇关于RHadoop的技术性文章,都是从统计的角度,介绍如何让R语言利用Hadoop处理大数据.今天决定反过

R语言数据挖掘

数据分析与决策技术丛书 R语言数据挖掘 Learning Data Mining with R [哈萨克斯坦]贝特·麦克哈贝尔(Bater Makhabel) 著 李洪成 许金炜 段力辉 译 图书在版编目(CIP)数据 R语言数据挖掘 / (哈)贝特·麦克哈贝尔(Bater Makhabel)著:李洪成,许金炜,段力辉译. -北京:机械工业出版社,2016.9 (数据分析与决策技术丛书) 书名原文:Learning Data Mining with R ISBN 978-7-111-54769-