[20141228]关于bloom filter.txt

[20141228]关于bloom filter.txt

--系统升级到11.2.0.4 ,执行计划经常出现bloom filter,自己对这些一点都不了解.
--自己做了google,能查到的资料不多,eygle的blog讲解,我水平有限,至少第1次没看懂,不过里面指向一个链接:
--http://antognini.ch/papers/BloomFilters20080620.pdf

--我仔细阅读,感觉还是不好理解,决定把里面的测试例子执行1次,后面做一个测试:

--实际上bloom filter很早就已经存在,1970年由Burton H.Bloom发明的.我转载里面的介绍:

A bloom filter is a data structure used to support membership queries.Simply put,a bloom filter is
used to test whether an element is a member of a giver set or not. Its main properties are:

. The amount of space needed to store the bloom filter is small compared to the amount of data
  belonging to the set being tested.
. The time needed to check whether an element is a member of a given set is independent of
  the number of elements contained in the set.
. False negatives not possible.
. False positives are possible, bit their frequency can be controlled. In practice,it is a tradeoff
  between space/time efficiency and the false positive frequency.

A bloom filter is based on an array of an bits (b1, b2, ..., bm) that are initially set to 0. To understand
how a bloom filter works, it is essential to describe how these bits are set and checked. For this
purpose, k independent hash functions (hi, h2, ..., hk), each returning a value between 1 and m,
are used. In order to "store" a given element into the bit array, each hash function must be
applied to it and, based on the return value r of each function (r1, r2, ..., rk), the bit with the offset r
is set to 1. Since there are k hash functions, up to k bits in the bit array are set to 1 (it might be
less because several hash functions might return the same value). The following figure is an
example where m = 16, k=4 and e is the element to be "stored" in the bit array.

r1 = h1(e) = 8  -> set b8 to 1
r2 = h2(e) = 1  -> set b1 to 1
r3 = h3(e) = 6  -> set b6 to 1
r4 = h4(e) = 13 -> set b13 to 1

To check whether an element is /"stored" in the bit array, the process is similar. The only
difference is that instead of setting k bits in the array, it is checked whether any of them are set to
0. If yes, this implies that the element is not /"stored" in the bit array.

Let's take a look at an example that illustrates how a bloom filter might be used in practice. An
application receives input data at a high throughput. Before processing it, it has to check whether
that data belongs to a given set stored in a database table. Note that the rejection rate of this
validation is very high. To perform the validation, the application has to call a remote service.
Unfortunately, the time needed for the network roundtrip and to actually do the lookup is far too
long. Since it is not possible to further optimize it (for example, the network latency is subject to
physical limitations), another method must be implemented. ln such cases, caches come to mind.
The general idea is to duplicate the data stored remotely in a local cache and then to do the
validation locally. In this specific case, let's say that it is not practicable. Not enough resources are
available locally to store the data. In such a situation, a bloom filter might be useful. In fact, as
described above, bloom filters need much less space than the actual set. Since the bloom filter is
subject to false positives, the idea is to use it to discard the maximum amount of input data before
calling the remote service. As a result, the original validation is not made faster but it is used
much less frequently.

--还是通过例子来说明问题:

/* Formatted on 2014/12/28 21:47:07 (QP5 v5.227.12220.39754) */
CREATE OR REPLACE PACKAGE bloom_filter
IS
   PROCEDURE init (p_m IN BINARY_INTEGER, p_n IN BINARY_INTEGER);

   FUNCTION add_value (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER;

   FUNCTION contain (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER;
END bloom_filter;
/

CREATE OR REPLACE PACKAGE BODY bloom_filter
IS
   TYPE t_bitarray IS TABLE OF BOOLEAN;

   g_bitarray   t_bitarray;
   g_m          BINARY_INTEGER;
   g_k          BINARY_INTEGER;

   PROCEDURE init (p_m IN BINARY_INTEGER, p_n IN BINARY_INTEGER)
   IS
   BEGIN
      g_m := p_m;
      g_bitarray := t_bitarray ();
      g_bitarray.EXTEND (p_m);
      FOR i IN g_bitarray.FIRST .. g_bitarray.LAST
      LOOP
         g_bitarray (i) := FALSE;
         --dbms_output.put_line(i);
      END LOOP;

      g_k := CEIL (p_m / p_n * LN (2));
   END init;

   FUNCTION add_value (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER
   IS
   BEGIN
      dbms_random.seed (p_value);

      FOR i IN 0 .. g_k
      LOOP
         g_bitarray (dbms_random.VALUE (1, g_m)) := TRUE;
      END LOOP;

      RETURN 1;
   END add_value;

   FUNCTION contain (p_value IN VARCHAR2)
      RETURN BINARY_INTEGER
   IS
      l_ret   BINARY_INTEGER := 1;
   BEGIN
      dbms_random.seed (p_value);

      FOR i IN 0 .. g_k
      LOOP
         IF NOT g_bitarray (dbms_random.VALUE (1, g_m))
         THEN
            l_ret := 0;
            EXIT;
         END IF;
      END LOOP;

      RETURN l_ret;
   END contain;
END bloom_filter;
/

create table t as select dbms_random.string('u',100) as value from dual connect by level

SCOTT@test> execute bloom_filter.init(16484,1000);
PL/SQL procedure successfully completed.

SCOTT@test> select count(bloom_filter.add_value(value)) from t where rownumCOUNT(BLOOM_FILTER.ADD_VALUE(VALUE))
------------------------------------
                                1000

SCOTT@test> select count(*) from t where bloom_filter.contain(value) =1;
  COUNT(*)
----------
      1006

--执行初始化bloom_filter.init,就是设置g_bitarray的16384个二进制位为0.并且设置 g_k := CEIL (p_m / p_n * LN (2))=12;
SCOTT@test> select ceil(16384/1000*ln(2)) from dual ;
CEIL(16384/1000*LN(2))
----------------------
                    12

--  bloom_filter.add_value 就是通过函数(这里是取随机数)变化加入g_bitarray中,相应的二进制位设置1.

-- 后面 执行的select count(*) from t where bloom_filter.contain(value) =1;检查1e4个字串转化的位是否在g_bitarray中,当然存
-- 在并不一定正确.
-- 可以发现存在6个冲突的.

-- 如果能把例子执行,很容易理解bloom filter算法.实际上利用少量的空间保存大量的信息.前面的例子使用16384位=2k的信息保存
1000*100的信息,当然也存在一定的冲突,实际上取决于位图数量以及保存信息的数量.

-

时间: 2024-09-25 08:15:16

[20141228]关于bloom filter.txt的相关文章

爬虫技术之bloom filter(含java代码)

在爬虫系统中,在内存中维护着两个关于URL的队列,ToDo队列和Visited队列,ToDo队列存放的是 爬虫从已经爬取的网页中解析出来的即将爬取的URL,但是网页是互联的,很可能解析出来的URL是已经 爬取到的,因此需要VIsited队列来存放已经爬取过的URL.当爬虫从ToDo队列中取出一个URL的时候, 先和Visited队列中的URL进行对比,确认此URL没有被爬取后就可以下载分析来.否则舍弃此URL,从 Todo队列取出下一个URL继续工作. 然后,我们知道爬虫在爬取网页时,网页的量是

PHP中实现Bloom Filter算法

 这篇文章主要介绍了PHP中实现Bloom Filter算法,本文直接给出实现代码,代码中给出详细注释,Bloom Filter算法介绍等内容,需要的朋友可以参考下     ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

wiredtiger LSM trees bloom filter 设计原理

最近在做的调研中,其中包括了mongodb wiredtiger LSM trees bloom filter 部分,秉承着好记性不如烂笔头的原则, 做个简单的记录,不保证完全正确,欢迎讨论交流. bloom filter 来源 wiredtiger 也提供了LSM trees 的结构作为持久化存储, 当in-memory tree 大小达到阈值后, 新的in-memory tree会被创建, 旧的in-memory tree会被刷到磁盘.LSM trees 天然支持更快写操作,然而对读操作就不

MyRocks之bloom filter

title: MySQL · mysql · myrocks之Bloom filter author: 张远 Bloom filter 简介 Bloom filter用于判断一个元素是不是在一个集合里,当一个元素被加入集合时,通过k个散列函数将这个元素映射成一个位数组中的k个点,把它们置为1.检索时如果这些点有任何一个为0,则被检元素一定不在:如果都是1,则被检元素很可能在.这就是布隆过滤器的基本思想. 优点:布隆过滤器存储空间和插入/查询时间都是常数O(k). 缺点:有一定的误算率,同时标准的

MySQL · myrocks · myrocks之Bloom filter

Bloom filter 简介 Bloom filter用于判断一个元素是不是在一个集合里,当一个元素被加入集合时,通过k个散列函数将这个元素映射成一个位数组中的k个点,把它们置为1.检索时如果这些点有任何一个为0,则被检元素一定不在:如果都是1,则被检元素很可能在.这就是布隆过滤器的基本思想. 优点:布隆过滤器存储空间和插入/查询时间都是常数O(k). 缺点:有一定的误算率,同时标准的Bloom Filter不支持删除操作. Bloom Filter通过极少的错误换取了存储空间的极大节省. 设

Bloom Filter Python

http://bitworking.org/news/380/bloom-filter-resources The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of a set. False positives are possible

布隆过滤器(Bloom Filter)的Java实现方法_java

布隆过滤器原理很简单:就是把一个字符串哈希成一个整数key,然后选取一个很长的比特序列,开始都是0,在key把此位置的0变为1:下次进来一个字符串,哈希之后的值key,如果在此比特位上的值也是1,那么就说明这个字符串存在了. 如果按照上面的做法,那就和哈希算法没有什么区别了,哈希算法还有重复的呢. 布隆过滤器是将一个字符串哈希成多个key,我还是按照书上的说吧. 先建立一个16亿二进制常量,然后将这16亿个二进制位全部置0.对于每个字符串,用8个不同的随机产生器(F1,F2,.....,F8)产

PHP中实现Bloom Filter算法_php技巧

<?php /*Bloom Filter算法来去重过滤. 介绍下Bloom Filter的基本处理思路:申请一批空间用于保存0 1信息,再根据一批哈希函数确定元素对应的位置,如果每个哈希函数对应位置的值为全部1,说明此元素存在.相反,如果为0,则要把对应位置的值设置为1.由于不同的元素可能会有相同的哈希值,即同一个位置有可能保存了多个元素的信息,从而导致存在一定的误判率. 如果申请空间太小,随着元素的增多,1会越来越多,各个元素冲突的机会越来越来大,导致误判率会越来越大.另外哈希函数的选择及个数

bloom filter概念讲解以及代码分析_C 语言

一. 简介1.什么是bloom filter?Bloom filter 是由 Howard Bloom 在 1970 年提出的二进制向量数据结构,它具有很好的空间和时间效率,被用来检测一个元素是不是集合中的一个成员,这种检测只会对在集合内的数据错判,而不会对不是集合内的数据进行错判,这样每个检测请求返回有"在集合内(可能错误)"和"不在集合内(绝对不在集合内)"两种情况,可见 Bloom filter 是牺牲了正确率换取时间和空间. 2.bloom filter的计