Mining of Massive Datasets – Mining Data Streams

Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it.

In this chapter, we shall make another assumption: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing.

big data的处理, 尤其是针对Twitter, 对于streaming data (流数据)的mining是需要面对的主要问题.

本章从最一般流数据的处理上给出了一些方法和策略.

 

1 The Stream Data Model

1.1 A Data-Stream-Management System

Any number of streams can enter the system. Each stream can provide elements at its own schedule; they need not have the same 
data rates or data types, and the time between elements of one stream need not be uniform. The fact that the rate of arrival of stream elements is not under the control of the system distinguishes stream processing from the processing of data that goes on within a database-management system.

 

1.3 Stream Queries

所谓的query, 根据stream提供的数据, 用户期望可以得到某些信息, 所以他可以提出query, 而系统使用数据挖掘技术来分析stream, 并回答问题.

There are two ways that queries get asked about streams.

Standing queries, 不变的, 事先定好的query, 这个是我们需要考虑的大部分情况, 比如对于从气象站收集的气温数据流, 统计每天的最高温度, 我们系统可以事先写好程序去统计. 问题是, 这个系统只能回答这个问题, 如果我又问平均温度是多少, 他不知道...

ad-hoc queries, a question asked once about the current state of a stream or streams. 临时提的问题, 事先没有准备的, 这个比较难搞. 事先也没办法知道具体的query, 我们也不可能把所有的stream都存下来, 然后临时计算.

A common approach is to store a sliding window of each stream in the working store.

A sliding window can be the most recent n elements of a stream, for some n, or it can be all the elements that arrived within the last t time units, e.g., one day.

 

1.4 Issues in Stream Processing

Streams often deliver elements very rapidly. We must process elements in real time, or we lose the opportunity to process them at all, without accessing the archival storage.

It is much more efficient to get an approximate answer to our problem than an exact solution.

 

2 Sampling Data in a Stream

As our first example of managing streaming data, we shall look at extracting reliable samples from a stream.

为什么要抽样, 因为Stream往往非常large, 为了达到较高的performance, 不可能处理全部数据, 所以抽样以产生近似结果成为常见的思路.

尤其是对于ad-hoc query, 也可以采用基于抽样去answer的方法.

 

2.3 The General Sampling Problem

Our stream consists of tuples with n components. A subset of the components are the key components, on which the selection of the sample will be based.

To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept the tuple for the sample if the hash value is less than a.

 

3 Filtering Streams

Another common process on streams is selection, or filtering.

In this section, we shall discuss the technique known as “Bloom filtering” as a way to eliminate most of the tuples that do not meet the criterion.

这一步是在big data处理中非常重要的, 如果对于big data, 我们能过滤和挑选出自己真正需要的data(往往并不big), 这样会大大提供处理的效率. 
这也是big data mining的新思路, 以前一味的说增加服务器, 用云计算技术来增强处理能力. 从另一方面来看, big data全是你需要的吗, 也许不...

 

3.1 A Motivating Example

Suppose we have a set S of one billion allowed email addresses – those that we will allow through because we believe them not to be spam. 面对海量的email地址, 需要过滤掉所有不包含在内的地址.

Since the typical email address is 20 bytes or more, it is not reasonable to store S in main memory.

可以使用Bloom filter, In the technique known as Bloom filtering, we use that main memory as a bit array.

 

3.2 The Bloom Filter

A Bloom filter consists of: 
1. An array of n bits, initially all 0’s. 
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n buckets, corresponding to the n bits of the bit-array. 
3. A set S of m key values.

The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while rejecting most of the stream elements whose keys are not in S.

 

To initialize the bit array, begin with all bits 0.

Take each key value in S and hash it using each of the k hash functions. Set to 1 each bit that is hi(K) for some hash function hi and some key value K in S.

To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . , hk(K) are 1’s in the bit-array.

If all are 1’s, then let the stream element through.

If one or more of these bits are 0, then K could not be in S, so reject the stream element.

 

3.3 Analysis of Bloom Filtering

Bloom Filtering, 有两个问题, 一个是'false positive’问题, 还有一个就是无法删除记录的问题.

对于无法删除的问题, 可以采用计数的方式来解决, 但计数就需要用多个bit来表示一个bucket, 需要耗费更多的内存, 这就需要balance.

 

If a key value is in S, then the element will surely pass through the Bloom filter. However, if the key value is not in S, it might still pass, call ‘false positive’

The model to use is throwing darts at targets. Suppose we have x targets and y darts.

y个箭可以随机射到任一个靶子上, 全射完后...

The probability that a given dart will not hit a given target is (x − 1)/x

The probability that none of the y darts will hit a given target is ((x−1)/x )y
We can write this expression as (1 − 1/x )x(y/x)

根据第一章中自然对数的转化, 当x很大时, (1 − 1/x )x = 1/e

So the probability that none of the y darts will hit a given target is e-(y/x)

We can apply the rule to the more general situation, where set S has m members, the array has n bits, and there are k hash functions.

The number of targets is x = n, and the number of darts is y = km. Thus, the probability that a bit remains 0 is e−km/n.

 

In general, the probability of a false positive is the probability of a 1 bit, which is1 − e−km/n, raised to the kth power, (1 − e−km/n)k.

这儿可以看出, 影响false positive的因素有两点, km/n 和 k

当km/n越小, 就是n越大的时候, false positive越小. 增加n, 增加了存储空间

当k越大, false positive越小. 增加hash函数, 增加了复杂度

所以需要balance, 越高的准确率就需要越多的资源...

 

4 Counting Distinct Elements in a Stream

4.1 The Count-Distinct Problem

Suppose stream elements are chosen from some universal set. We would like to know how many different elements have appeared in the stream, counting either from the beginning of the stream or from some known time in the past.

 

The obvious way to solve the problem is to keep in main memory a list of all the elements seen so far in the stream. Keep them in an efficient search structure such as a hash table or search tree, so one can quickly add new elements and check whether or not the element that just arrived on the stream was already seen.

但如果stream的data太大, 这个方法就不合适了.

 

4.2 The Flajolet-Martin Algorithm

http://www.pittsburgh.intel-research.net/people/gibbons/papers/distinct-values-chapter.pdf

The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the stream, the more different hash-values we shall see.

总之, 思路就是当数据量比较大时, 我们一个个去统计个数比较低效, 好的方法是, 通过某些数据特征计算出一个近似个数.

那么如果对stream中的data进行hash, 越是多的不用的data, 那么hash出来的值差异性也越大, 所以我们可以根据hash值的差异性来估算这个不同item的大小.

Whenever we apply a hash function h to a stream element a, the bit string h(a) will end in some number of 0’s, possibly none. Call this number the tail length for a and h.

Let R be the maximum tail length of any a seen so far in the stream.

Then we shall use estimate 2R for the number of distinct elements seen in the stream.

 

这个算法我不太明白为什么, 书里面说的也不清楚, 详细的可以看看上面那篇文章.

 

5 Estimating Moments

In this section we consider a generalization of the problem of counting distinct elements in a stream. The problem, called computing “moments,” involves the distribution of frequencies of different elements in the stream.

5.1 Definition of Moments

Suppose a stream consists of elements chosen from a universal set. Assume the universal set is ordered so we can speak of the ith element for any i. Let mibe the number of occurrences of the ith element for any i. Then the kth-order moment (or just kth moment) of the stream is the sum over all i of (mi)k.

对于这个k阶moment, 不太清楚干吗用的...所以等以后用到再研究吧.

 

6 Counting Ones in a Window

We now turn our attention to counting problems for streams. Suppose we have a window of length N on a binary stream. We want at all times to be able to answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N.

这个问题是个很普遍的问题, 统计stream data某段时间(windows内)内item出现次数, 比如对于tweet流, 统计在一天内, 某个用户的发tweet 的次数, 或者谈论某一主题的tweet数目...等. 比较简单的方法, 就是把stream data先存在数据库里面, 然后需要的时候用select统计就ok了, 这个没啥技术含量, 对于一般的应用也就足够了, 但是我们要讨论的是不一般的情况, 有技术含量的做法.

如果你不能存下所有的流数据, 怎么来统计这个count了?

我没看明白...以后用到再来研究

个人觉得这本书, 对问题的分析不够透彻, 过于泛泛引用一些算法, 没有一些深度的分析...

 

7 Decaying Windows

7.1 The Problem of Most-Common Elements

Suppose we have a stream whose elements are the movie tickets purchased all over the world, with the name of the movie as part of the element. We want to keep a summary of the stream that is the most popular movies “currently.”

问题是这个currently一般都不太明确, 也不太好衡量, 到底多久算currently比较合适...

a movie that sold n tickets in each of the last 10 weeks is probably more popular than a movie that sold 2n tickets last week but nothing in previous weeks.

解决的方式就是采用模糊的界限来代替明确的界限...

采用的是decaying的时间窗口, 越接近current的权值越高, 随着时间往前权值不断的衰减decay.

本文章摘自博客园,原文发布日期:2011-08-30

时间: 2024-08-03 18:28:27

Mining of Massive Datasets – Mining Data Streams的相关文章

Mining of Massive Datasets – Data Mining

1 What is Data Mining? The most commonly accepted definition of "data mining" is the discovery of "models" for data.   1.1 Statistical Modeling Statisticians were the first to use the term "data mining." Now, statisticians vi

Mining of Massive Datasets – Finding similar items

怎样finding similar items--   1 Applications of Near-Neighbor Search The Jaccard similarity of sets S and T is |S ∩ T |/|S ∪ T |, that is, the ratio of the size of the intersection of S and T to the size of their union. We shall denote the Jaccard simi

Mining of Massive Datasets – Link Analysis

5.1 PageRank 5.1.1 Early Search Engines and Term Spam As people began to use search engines to find their way around the Web, unethical people saw the opportunity to fool search engines into leading people to their page. Techniques for fooling search

Sharing, Storing, and Computing Massive Amounts of Data

Background Data is crucial to the operation of any business. Businesses often collect large numbers of logs so that they can better understand their own services and the people who are using them. As time goes by, the number and activity of users con

技术书单整理

算法 算法导论 Introduction to Algorithms, Second Edition, by Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest and Clifford Stein 算法概论  Algorithms, S. Dasgupta, C. H. Papadimitriou, and U. V. Vazirani Python Algorithms-Mastering Basic Algorithms in

大数据的那些事儿

资源列表:   关系数据库管理系统(RDBMS)   框架   分布式编程   分布式文件系统   文件数据模型   Key -Map 数据模型   键-值数据模型   图形数据模型   NewSQL数据库   列式数据库   时间序列数据库   类SQL处理   数据摄取   服务编程   调度   机器学习   基准测试   安全性   系统部署   应用程序   搜索引擎与框架   MySQL的分支和演化   PostgreSQL的分支和演化   Memcached的分支和演化   嵌入式

机器学习经典书籍介绍

机器学习经典书籍小结 <数学之美>:作者吴军大家都很熟悉.这本书主要的作用是引起了我对机器学习和自然语言处理的兴趣.里面以极为通俗的语言讲述了数学在这两个领域的应用. <Programming Collective Intelligence>(中译本<集体智慧编程>):作者Toby Segaran也是<BeautifulData : The Stories Behind Elegant Data Solutions>(<数据之美:解密优雅数据解决方案背

书单推荐 | 数据挖掘和统计科学自学十大必备读物

本文讲的是书单推荐 | 数据挖掘和统计科学自学十大必备读物 还有什么比免费的机器学习和数据科学读物更适合用来享受秋天的呢? 下面的免费书单中从统计学基础知识,到机器学习的基本概念,再到更重点的大框架内容,对于高深的话题也有所涉猎,最后以一本总结性的书结尾.既有经典名著,也有当代的作品,希望你能在其中找到一些有趣的新内容. 1.用统计学的方式思考 Think Stats: Probability and Statistics for Programmers 作者:Allen B. Downey <

史上最全“大数据”学习资源整理

史上最全"大数据"学习资源整理 2016-05-17 Hadoop技术博文 当前,整个互联网正在从IT时代向DT时代演进,大数据技术也正在助力企业和公众敲开DT世界大门.当今"大数据"一词的重点其实已经不仅在于数据规模的定义,它更代表着信息技术发展进入了一个新的时代,代表着爆炸性的数据信息给传统的计算技术和信息技术带来的技术挑战和困难,代表着大数据处理所需的新的技术和方法,也代表着大数据分析和应用所带来的新发明.新服务和新的发展机遇.     资源列表:   关系数