Mining of Massive Datasets – Data Mining

1 What is Data Mining?

The most commonly accepted definition of “data mining” is the discovery of “models” for data.

1.1 Statistical Modeling

Statisticians were the first to use the term “data mining.”

Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution(ex. Gaussian distribution) from which the visible data is drawn.

1.2 Machine Learning

There are some who regard data mining as synonymous with machine learning. There is no question that some data mining appropriately uses algorithms from machine learning.

Machine-learning practitioners use the data as a training set, to train an algorithm of one of the many types used by machine-learning practitioners, such as Bayes nets, support-vector machines, decision trees, hidden Markov models, and many others.

The typical case where machine learning is a good approach is when we have little idea of what we are looking for in the data.

For example, it is rather unclear what it is about movies that makes certain movie-goers like or dislike it.

机器学习适用于这种搞不清什么样的规则能被mining, 所以只需要把数据feed给ML算法, 它就可以替你做出判断, 而你不用关心这个具体的过程.

1.3 Computational Approaches to Modeling

前面谈了两种模型, 用什么方法来discovery模型?

There are many different approaches to modeling data, 这里介绍两种,

1. Summarizing the data succinctly and approximately

2. Extracting the most prominent features of the data and ignoring the rest

下面就来具体谈谈这两种方法.

1.4 Summarization

One of the most interesting forms of summarization is the PageRank idea, which made Google successful. The entire complex structure of the Web is summarized by a single number for each page.

Another important form of summary – Clustering. 书中举了个'Plotting cholera cases on a map of London’的例子, 通过简单的手工的plotting, 对点经行聚类建模, 挖掘出了靠近路口更容易得病的规则.

1.5 Feature Extraction

A complex relationship between objects is represented by finding the strongest statistical dependencies among these objects and using only those in representing all statistical connections.

2 Statistical Limits on Data Mining

A common sort of data-mining problem involves discovering unusual events hidden within massive amounts of data.

但是数据挖掘技术也不是总是有效的, 下面介绍Bonferroni’s Principle来避免滥用这种技术.

2.1 Total Information Awareness

In 2002, the Bush administration put forward a plan to mine all the data it could find, including credit-card receipts, hotel records, travel data, and many other kinds of information in order to track terrorist activity.

当然bush这个计划由于隐私问题最终被议会否决了, 但是这儿只是作为个例子来讨论数据挖掘技术是否有效.

2.2 Bonferroni’s Principle

Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus.

In a situation like searching for terrorists, where we expect that there are few terrorists operating at any one time.
如果我们通过数据挖掘技术, 每天都挖掘出大量的, 上百万起的恐怖事件, 那么这样的技术就是无效的, 哪怕其中确实有若干恐怖事件...

3 Things Useful to Know

如果你学习数据挖掘, 下面这些基本概念很重要,

1. The TF.IDF measure of word importance.
2. Hash functions and their use.
3. Secondary storage (disk) and its effect on running time of algorithms.
4. The base e of natural logarithms and identities involving that constant.
5. Power laws.

3.1 Importance of Words in Documents

In several applications of data mining, we shall be faced with the problem of categorizing documents (sequences of words) by their topic. Typically, topics are identified by finding the special words that characterize documents about that topic.

这是个相当典型的数据挖掘问题...topic keywords extraction

而最基本的技术就是, TF.IDF (Term Frequency times Inverse Document Frequency).

词频（term frequency，TF）指的是某一个给定的词语在该文件中出现的次数。这个数字通常会被正规化，以防止它偏向长的文件

逆向文件频率（inverse document frequency，IDF）是一个词语普遍重要性的度量。某一特定词语的IDF，可以由总文件数目除以包含该词语之文件的数目，再将得到的商取对数得到

3.5 The Base of Natural Logarithms (自然对数)

The constant e = 2.7182818 · · · has a number of useful special properties. In particular, e is the limit of (1 + 1/x)^xas x goes to infinity.

e的用处很大, 我不学数学不明白, 这儿通过e可以用近似来简化计算,

1. Consider (1+a)^b, where a is small, We can thus approximate (1 + a)^b as e^ab.

2. e^x = 1 + x + x²/2 + x³/6 + x⁴/24 + · · ·

3.6 Power Laws (幂法则)

There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables.

这章就是综述性质的,

谈了什么是数据挖掘, 常用的方法思路是什么, 挖掘技术的局限是什么, 常用的基本概念.

本文章摘自博客园，原文发布日期：2011-07-06

时间： 2024-10-30 20:41:29

Mining of Massive Datasets – Data Mining

1 What is Data Mining?

1.1 Statistical Modeling

1.2 Machine Learning

Mining of Massive Datasets – Data Mining的相关文章

Mining of Massive Datasets – Mining Data Streams

Mining of Massive Datasets – Finding similar items

Mining of Massive Datasets – Link Analysis

做Data Mining，其实大部分时间都花在清洗数据

Alibaba Cloud and UK Met Office to Co-organise Tianchi Data Mining Contest

大数据的那些事儿

机器学习经典书籍介绍

书单推荐 | 数据挖掘和统计科学自学十大必备读物

史上最全“大数据”学习资源整理