Mining of Massive Datasets – Data Mining

1 What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data.

 

1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.”

Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution(ex. Gaussian distribution) from which the visible data is drawn.

 

1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning.

Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes netssupport-vector machinesdecision treeshidden Markov models, and many others.

The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data.

For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.

机器学习适用于这种搞不清什么样的规则能被mining, 所以只需要把数据feed给ML算法, 它就可以替你做出判断, 而你不用关心这个具体的过程.

 

1.3 Computational Approaches to Modeling

前面谈了两种模型, 用什么方法来discovery模型?

There are many different approaches to modeling data, 这里介绍两种,

1. Summarizing the data succinctly and approximately

2. Extracting the most prominent features of the data and ignoring the rest

下面就来具体谈谈这两种方法.

 

1.4 Summarization

One of the most interesting forms of summarization is the PageRank idea, which made Google successful. The entire complex structure of the Web is summarized by a single number for each page.

Another important form of summary – Clustering. 书中举了个'Plotting cholera cases on a map of London’的例子, 通过简单的手工的plotting, 对点经行聚类建模, 挖掘出了靠近路口更容易得病的规则.

 

1.5 Feature Extraction

A complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections.

 

2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events hidden within massive amounts of data.

但是数据挖掘技术也不是总是有效的, 下面介绍Bonferroni’s Principle来避免滥用这种技术.

 

2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity.

当然bush这个计划由于隐私问题最终被议会否决了, 但是这儿只是作为个例子来讨论数据挖掘技术是否有效.

 

2.2 Bonferroni’s Principle

Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus.

In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time. 
如果我们通过数据挖掘技术, 每天都挖掘出大量的, 上百万起的恐怖事件, 那么这样的技术就是无效的, 哪怕其中确实有若干恐怖事件... 

3 Things Useful to Know

如果你学习数据挖掘, 下面这些基本概念很重要,

1. The TF.IDF measure of word importance. 
2. Hash functions and their use. 
3. Secondary storage (disk) and its effect on running time of algorithms. 
4. The base e of natural logarithms and identities involving that constant. 
5. Power laws.

 

3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic.

这是个相当典型的数据挖掘问题...topic keywords extraction

而最基本的技术就是, TF.IDF (Term Frequency times Inverse Document Frequency).

词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化,以防止它偏向长的文件

逆向文件频率(inverse document frequency,IDF)是一个词语普遍重要性的度量。某一特定词语的IDF,可以由总文件数目除以包含该词语之文件的数目,再将得到的商取对数得到

 

3.5 The Base of Natural Logarithms (自然对数)

The constant e = 2.7182818 · · · has a number of useful special properties. In particular, e is the limit of (1 + 1/x)xas x goes to infinity.

e的用处很大, 我不学数学不明白, 这儿通过e可以用近似来简化计算,

1. Consider (1+a)b, where a is small, We can thus approximate (1 + a)b as eab.

2. ex = 1 + x + x2/2 + x3/6 + x4/24 + · · ·

 

3.6 Power Laws (幂法则)

There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables.

 

这章就是综述性质的,

谈了什么是数据挖掘, 常用的方法思路是什么, 挖掘技术的局限是什么, 常用的基本概念.

本文章摘自博客园,原文发布日期:2011-07-06

时间: 2024-10-30 20:41:29

Mining of Massive Datasets – Data Mining的相关文章

Mining of Massive Datasets – Mining Data Streams

Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make another assumption: data arrives in a stream or streams, and if it is not pr

Mining of Massive Datasets – Finding similar items

怎样finding similar items--   1 Applications of Near-Neighbor Search The Jaccard similarity of sets S and T is |S ∩ T |/|S ∪ T |, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard simi

Mining of Massive Datasets – Link Analysis

5.1 PageRank 5.1.1 Early Search Engines and Term Spam As people began to use search engines to find their way around the Web, unethical people saw the opportunity to fool search engines into leading people to their page. Techniques for fooling search

做Data Mining,其实大部分时间都花在清洗数据

前言:很多初学的朋友对大数据挖掘第一直观的印象,都只是业务模型,以及组成模型背后的各种算法原理.往往忽视了整个业务场景建模过程中,看似最普通,却又最精髓的特征数据清洗.可谓是平平无奇,却又一掌定乾坤,稍有闪失,足以功亏一篑.  大数据圈里的一位扫地僧 说明:这篇文章很早就想写了,但是切入点一直拿捏不准,要讲的内容比较大众化,却又是重中之重. 一.数据清洗的那些事 构建业务模型,在确定特征向量以后,都需要准备特征数据在线下进行训练.验证和测试.同样,部署发布离线场景模型,也需要每天定时跑P加工模型

Alibaba Cloud and UK Met Office to Co-organise Tianchi Data Mining Contest

Learn more about The Computing Conference 2017 Global competition for solutions to climate challenges for future generations Hangzhou, China, 12 October 2017 – Alibaba Cloud, the cloud computing arm of Alibaba Group, today announced at the Computing

大数据的那些事儿

资源列表:   关系数据库管理系统(RDBMS)   框架   分布式编程   分布式文件系统   文件数据模型   Key -Map 数据模型   键-值数据模型   图形数据模型   NewSQL数据库   列式数据库   时间序列数据库   类SQL处理   数据摄取   服务编程   调度   机器学习   基准测试   安全性   系统部署   应用程序   搜索引擎与框架   MySQL的分支和演化   PostgreSQL的分支和演化   Memcached的分支和演化   嵌入式

机器学习经典书籍介绍

机器学习经典书籍小结 <数学之美>:作者吴军大家都很熟悉.这本书主要的作用是引起了我对机器学习和自然语言处理的兴趣.里面以极为通俗的语言讲述了数学在这两个领域的应用. <Programming Collective Intelligence>(中译本<集体智慧编程>):作者Toby Segaran也是<BeautifulData : The Stories Behind Elegant Data Solutions>(<数据之美:解密优雅数据解决方案背

书单推荐 | 数据挖掘和统计科学自学十大必备读物

本文讲的是书单推荐 | 数据挖掘和统计科学自学十大必备读物 还有什么比免费的机器学习和数据科学读物更适合用来享受秋天的呢? 下面的免费书单中从统计学基础知识,到机器学习的基本概念,再到更重点的大框架内容,对于高深的话题也有所涉猎,最后以一本总结性的书结尾.既有经典名著,也有当代的作品,希望你能在其中找到一些有趣的新内容. 1.用统计学的方式思考 Think Stats: Probability and Statistics for Programmers 作者:Allen B. Downey <

史上最全“大数据”学习资源整理

史上最全"大数据"学习资源整理 2016-05-17 Hadoop技术博文 当前,整个互联网正在从IT时代向DT时代演进,大数据技术也正在助力企业和公众敲开DT世界大门.当今"大数据"一词的重点其实已经不仅在于数据规模的定义,它更代表着信息技术发展进入了一个新的时代,代表着爆炸性的数据信息给传统的计算技术和信息技术带来的技术挑战和困难,代表着大数据处理所需的新的技术和方法,也代表着大数据分析和应用所带来的新发明.新服务和新的发展机遇.     资源列表:   关系数