Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it.
In this chapter, we shall make another assumption: data arrives in a stream or streams, and if it is not processed immediately or stored, then it is lost forever. Moreover, we shall assume that the data arrives so rapidly that it is not feasible to store it all in active storage (i.e., in a conventional database), and then interact with it at the time of our choosing.
big data的处理, 尤其是针对Twitter, 对于streaming data (流数据)的mining是需要面对的主要问题.
本章从最一般流数据的处理上给出了一些方法和策略.
1 The Stream Data Model
1.1 A Data-Stream-Management System
Any number of streams can enter the system. Each stream can provide elements at its own schedule; they need not have the same
data rates or data types, and the time between elements of one stream need not be uniform. The fact that the rate of arrival of stream elements is not under the control of the system distinguishes stream processing from the processing of data that goes on within a database-management system.
1.3 Stream Queries
所谓的query, 根据stream提供的数据, 用户期望可以得到某些信息, 所以他可以提出query, 而系统使用数据挖掘技术来分析stream, 并回答问题.
There are two ways that queries get asked about streams.
Standing queries, 不变的, 事先定好的query, 这个是我们需要考虑的大部分情况, 比如对于从气象站收集的气温数据流, 统计每天的最高温度, 我们系统可以事先写好程序去统计. 问题是, 这个系统只能回答这个问题, 如果我又问平均温度是多少, 他不知道...
ad-hoc queries, a question asked once about the current state of a stream or streams. 临时提的问题, 事先没有准备的, 这个比较难搞. 事先也没办法知道具体的query, 我们也不可能把所有的stream都存下来, 然后临时计算.
A common approach is to store a sliding window of each stream in the working store.
A sliding window can be the most recent n elements of a stream, for some n, or it can be all the elements that arrived within the last t time units, e.g., one day.
1.4 Issues in Stream Processing
Streams often deliver elements very rapidly. We must process elements in real time, or we lose the opportunity to process them at all, without accessing the archival storage.
It is much more efficient to get an approximate answer to our problem than an exact solution.
2 Sampling Data in a Stream
As our first example of managing streaming data, we shall look at extracting reliable samples from a stream.
为什么要抽样, 因为Stream往往非常large, 为了达到较高的performance, 不可能处理全部数据, 所以抽样以产生近似结果成为常见的思路.
尤其是对于ad-hoc query, 也可以采用基于抽样去answer的方法.
2.3 The General Sampling Problem
Our stream consists of tuples with n components. A subset of the components are the key components, on which the selection of the sample will be based.
To take a sample of size a/b, we hash the key value for each tuple to b buckets, and accept the tuple for the sample if the hash value is less than a.
3 Filtering Streams
Another common process on streams is selection, or filtering.
In this section, we shall discuss the technique known as “Bloom filtering” as a way to eliminate most of the tuples that do not meet the criterion.
这一步是在big data处理中非常重要的, 如果对于big data, 我们能过滤和挑选出自己真正需要的data(往往并不big), 这样会大大提供处理的效率.
这也是big data mining的新思路, 以前一味的说增加服务器, 用云计算技术来增强处理能力. 从另一方面来看, big data全是你需要的吗, 也许不...
3.1 A Motivating Example
Suppose we have a set S of one billion allowed email addresses – those that we will allow through because we believe them not to be spam. 面对海量的email地址, 需要过滤掉所有不包含在内的地址.
Since the typical email address is 20 bytes or more, it is not reasonable to store S in main memory.
可以使用Bloom filter, In the technique known as Bloom filtering, we use that main memory as a bit array.
3.2 The Bloom Filter
A Bloom filter consists of:
1. An array of n bits, initially all 0’s.
2. A collection of hash functions h1, h2, . . . , hk. Each hash function maps “key” values to n buckets, corresponding to the n bits of the bit-array.
3. A set S of m key values.
The purpose of the Bloom filter is to allow through all stream elements whose keys are in S, while rejecting most of the stream elements whose keys are not in S.
To initialize the bit array, begin with all bits 0.
Take each key value in S and hash it using each of the k hash functions. Set to 1 each bit that is hi(K) for some hash function hi and some key value K in S.
To test a key K that arrives in the stream, check that all of h1(K), h2(K), . . . , hk(K) are 1’s in the bit-array.
If all are 1’s, then let the stream element through.
If one or more of these bits are 0, then K could not be in S, so reject the stream element.
3.3 Analysis of Bloom Filtering
Bloom Filtering, 有两个问题, 一个是'false positive’问题, 还有一个就是无法删除记录的问题.
对于无法删除的问题, 可以采用计数的方式来解决, 但计数就需要用多个bit来表示一个bucket, 需要耗费更多的内存, 这就需要balance.
If a key value is in S, then the element will surely pass through the Bloom filter. However, if the key value is not in S, it might still pass, call ‘false positive’
The model to use is throwing darts at targets. Suppose we have x targets and y darts.
y个箭可以随机射到任一个靶子上, 全射完后...
The probability that a given dart will not hit a given target is (x − 1)/x
The probability that none of the y darts will hit a given target is ((x−1)/x )y.
We can write this expression as (1 − 1/x )x(y/x)根据第一章中自然对数的转化, 当x很大时, (1 − 1/x )x = 1/e
So the probability that none of the y darts will hit a given target is e-(y/x)
We can apply the rule to the more general situation, where set S has m members, the array has n bits, and there are k hash functions.
The number of targets is x = n, and the number of darts is y = km. Thus, the probability that a bit remains 0 is e−km/n.
In general, the probability of a false positive is the probability of a 1 bit, which is1 − e−km/n, raised to the kth power, (1 − e−km/n)k.
这儿可以看出, 影响false positive的因素有两点, km/n 和 k
当km/n越小, 就是n越大的时候, false positive越小. 增加n, 增加了存储空间
当k越大, false positive越小. 增加hash函数, 增加了复杂度
所以需要balance, 越高的准确率就需要越多的资源...
4 Counting Distinct Elements in a Stream
4.1 The Count-Distinct Problem
Suppose stream elements are chosen from some universal set. We would like to know how many different elements have appeared in the stream, counting either from the beginning of the stream or from some known time in the past.
The obvious way to solve the problem is to keep in main memory a list of all the elements seen so far in the stream. Keep them in an efficient search structure such as a hash table or search tree, so one can quickly add new elements and check whether or not the element that just arrived on the stream was already seen.
但如果stream的data太大, 这个方法就不合适了.
4.2 The Flajolet-Martin Algorithm
http://www.pittsburgh.intel-research.net/people/gibbons/papers/distinct-values-chapter.pdf
The idea behind the Flajolet-Martin Algorithm is that the more different elements we see in the stream, the more different hash-values we shall see.
总之, 思路就是当数据量比较大时, 我们一个个去统计个数比较低效, 好的方法是, 通过某些数据特征计算出一个近似个数.
那么如果对stream中的data进行hash, 越是多的不用的data, 那么hash出来的值差异性也越大, 所以我们可以根据hash值的差异性来估算这个不同item的大小.
Whenever we apply a hash function h to a stream element a, the bit string h(a) will end in some number of 0’s, possibly none. Call this number the tail length for a and h.
Let R be the maximum tail length of any a seen so far in the stream.
Then we shall use estimate 2R for the number of distinct elements seen in the stream.
这个算法我不太明白为什么, 书里面说的也不清楚, 详细的可以看看上面那篇文章.
5 Estimating Moments
In this section we consider a generalization of the problem of counting distinct elements in a stream. The problem, called computing “moments,” involves the distribution of frequencies of different elements in the stream.
5.1 Definition of Moments
Suppose a stream consists of elements chosen from a universal set. Assume the universal set is ordered so we can speak of the ith element for any i. Let mibe the number of occurrences of the ith element for any i. Then the kth-order moment (or just kth moment) of the stream is the sum over all i of (mi)k.
对于这个k阶moment, 不太清楚干吗用的...所以等以后用到再研究吧.
6 Counting Ones in a Window
We now turn our attention to counting problems for streams. Suppose we have a window of length N on a binary stream. We want at all times to be able to answer queries of the form “how many 1’s are there in the last k bits?” for any k ≤ N.
这个问题是个很普遍的问题, 统计stream data某段时间(windows内)内item出现次数, 比如对于tweet流, 统计在一天内, 某个用户的发tweet 的次数, 或者谈论某一主题的tweet数目...等. 比较简单的方法, 就是把stream data先存在数据库里面, 然后需要的时候用select统计就ok了, 这个没啥技术含量, 对于一般的应用也就足够了, 但是我们要讨论的是不一般的情况, 有技术含量的做法.
如果你不能存下所有的流数据, 怎么来统计这个count了?
我没看明白...以后用到再来研究
个人觉得这本书, 对问题的分析不够透彻, 过于泛泛引用一些算法, 没有一些深度的分析...
7 Decaying Windows
7.1 The Problem of Most-Common Elements
Suppose we have a stream whose elements are the movie tickets purchased all over the world, with the name of the movie as part of the element. We want to keep a summary of the stream that is the most popular movies “currently.”
问题是这个currently一般都不太明确, 也不太好衡量, 到底多久算currently比较合适...
a movie that sold n tickets in each of the last 10 weeks is probably more popular than a movie that sold 2n tickets last week but nothing in previous weeks.
解决的方式就是采用模糊的界限来代替明确的界限...
采用的是decaying的时间窗口, 越接近current的权值越高, 随着时间往前权值不断的衰减decay.
本文章摘自博客园,原文发布日期:2011-08-30