MapReduce with MongoDB and Python[ZT]

MapReduce with MongoDB and Python

从 Artificial Intelligence in Motion 作者:Marcel Pinheiro Caraciolo (由于Artificial Intelligence in Motion发布的图在墙外,所以将图换到cnblogs)

Hi all,

 In this post, I'll present a demonstration of a map-reduce example with MongoDB and server side JavaScript.  Based on the fact that I've been working  with this technology recently, I thought it would be useful to present here a simple example of  how it works and how to integrate with Python.

But What is MongoDb ?

For you, who doesn't know what is and the basics of how to use MongoDB, it is important to explain a little bit about the No-SQL movement. Currently, there are several databases that break with the requirements present in the traditional relational database systems. I present as follows the main keypoints shown at several No-SQL databases:

  • SQL commands are not used as query API (Examples of APIs used include JSON, BSON, etc.)
  • Doesn't guarantee atomic operations.
  • Distributed and horizontally scalable.
  • It doesn't have to predefine schemas. (Non-Schema)
  • Non-tabular data storing (eg; key-value, object, graphs, etc).

Although it is not so obvious, No-SQL is an abbreviation  to Not Only SQL. The effort and development of this new approach have been doing a lot of noise since 2009. You can find more information about it here and here.  It is important to notice that the non-relational databases does not represent a complete replacement for relational databases. It is necessary to know the pros and cons of each approach and decide the most appropriate for your needs in the scenario that you're facing.

MongoDB is one of the most popular No-SQL today and what this article will focus on. It is a schemaless, document oriented, high performance, scalable database  that uses the key-values concepts to store documents as JSON structured documents. It also includes some relational database features such as indexing models and dynamic queries. It is used today in production in over than 40 websites, including web services such as SourceForgeGitHubEletronic Arts and The New York Times..

One of the best functionalities that I like in MongoDb is the Map-Reduce. In the next section I will explain  how it works illustrated with a simple example using MongoDb and Python.

If you want to install MongoDb or get more information, you can download it here and read a nice tutorial here.

Map- Reduce 

MapReduce is a programming model for processing and generating large data sets. It is a framework introduced by Google for support parallel computations large data sets spread over clusters of computers.  Now MapReduce is considered a popular model in distributed computing, inspired by the functions map and reduce commonly used in functional programming.  It can be considered  'Data-Oriented' which process data in two primary steps: Map and Reduce.  On top of that, the query is now executed on simultaneous data sources. The process of mapping the request of the input reader to the data set is called 'Map', and the process of aggregation of the intermediate results from the mapping function in a consolidated result is called 'Reduce'.  The paper about the MapReduce with more details it can be read here.

Today there are several implementations of MapReduce such as Hadoop, Disco, Skynet, etc. The most famous isHadoop and is implemented in Java as an open-source project.  In MongoDB there is also a similar implementation in spirit like Hadoop with all input coming from a collection and output going to a collection. For a practical definition, Map-Reduce in MongoDB is useful for batch manipulation of data and aggregation operations.  In real case scenarios, in a situation where  you would have used GROUP BY in SQL,  map/reduce is the equivalent tool in MongoDB.

Now thtat we have introduced Map-Reduce, let's see how access the MongoDB by Python.

PyMongo

PyMongo is a Python distribution containing tools for working with MongoDB, and is the recommended way to work with MongoDB from Python. It's easy to install and to use. See here how to install  and use it.

Map-Reduce in Action

Now let's see Map-Reduce in action. For demonstrate the map-reduce I've decided to used of the classical problems solved using it: Word Frequency count across a series of documents. It's a simple problem and is suited to being solved by a map-reduce query.

I've decided to use two samples for this task. The first one is a list of simple sentences to illustrate how the map reduce works.  The second one is the 2009 Obama's Speech at his election for president. It will be used to show a real example illustrated by the code.

Let's consider the diagram below in order to help demonstrate how the map-reduce could be distributed. It shows four sentences that are split  in words and grouped by the function map and after reduced independently (aggregation)  by the function reduce. This is interesting as it means our query can be distributed into separate nodes (computers), resulting in faster processing in word count frequency runtime. It's also important to notice the example below shows a balanced tree, but it could be unbalanced or even show some redundancy.

Map-Reduce Distribution

 

Some notes you need to know before developing your map and reduce functions:

  •  The MapReduce engine may invoke reduce functions iteratively; thus; these functions must be idempotent. That is, the following must hold for your reduce function:

                 for all k,vals : reduce( k, [reduce(k,vals)] ) == reduce(k,vals)

  •  Currently, the return value from a reduce function cannot be an array (it's typically an object or a number)
  • If you need to perform an operation only once, use a finalize function.

 Let's go now to the code. For this task, I'll use the Pymongo framework, which has support for Map/Reduce. As I said earlier, the input text will be the Obama's speech, which has by the way many repeated words. Take a look at the tags cloud (cloud of words which each word fontsize is evaluated based on its frequency) of Obama's Speech.

 

Obama's Speech in 2009

 

 

For writing our map and reduce functions, MongoDB allows clients to send JavaScript map and reduce implementations that will get evaluated and run on the server. Here is our map function.

 

 

wordMap.js

 

As you can see the 'this' variable refers to the context from which the function is called. That is, MongoDB will call the map function on each document in the collection we are querying, and it will be pointing to document where it will have the access the key of a document such as 'text', by callingthis.text.  The map function doesn't return a list, instead it calls an emit function which it expects to be defined. This parameters of this function (key, value) will be grouped with others  intermediate results from another map evaluations that have the same key (key, [value1, value2]) and passed to the function reduce that we will define now. 

 

wordReduce.js

 

 

The reduce function must reduce a list of a chosen type to a single value of that same type; it must be transitive so it doesn't matter how the mapped items are grouped.

Now let's code our word count example using the Pymongo client and passing the map/reduce functions to the server.

 

 

mapReduce.py

 

 

Let's see the result now:

 

And it works! :D

With Map-Reduce function the word frequency count is extremely efficient and even performs better in a distributed environment. With this brief experiment we  can see the potential of map-reduce model for distributed computing, specially on large data sets.

All code used in this article can be download here.

My next posts will be about  performance evaluation on machine learning techniques.  Wait for news!

Marcel Caraciolo

References 

 

时间: 2024-08-01 18:36:35

MapReduce with MongoDB and Python[ZT]的相关文章

Motor 0.3.2 发布,MongoDB 的 Python 驱动

Motor 0.3.2 发布,此版本兼容 http://www.aliyun.com/zixun/aggregation/13461.html">MongoDB 2.2,2.4 和 2.6,最低要求 PyMongo 2.7.1. 此版本修复了在 "copy_database" 方法的 socket 泄漏,重写了 "Let Us Now Praise ResourceWarnings" 里面的问题和 bug. 获得最新版本:pip install --

mongoDB VS PostgreSQL dml performance use python (pymongo & py-postgresql)

前面测试了mongodb和postgresql的插入性能对比, 参考如下 :  1. http://blog.163.com/digoal@126/blog/static/16387704020151435825593/ 2. http://blog.163.com/digoal@126/blog/static/1638770402015142858224/ 3. http://blog.163.com/digoal@126/blog/static/16387704020151210840303

python读取json文件并将数据插入到mongodb的方法_python

本文实例讲述了python读取json文件并将数据插入到mongodb的方法.分享给大家供大家参考.具体实现方法如下: #coding=utf-8 import sunburnt import urllib from pymongo import Connection from bson.objectid import ObjectId import logging from datetime import datetime import json from time import mktime

使用Python脚本操作MongoDB的教程_python

连接数据库 MongoClient VS Connection class MongoClient(pymongo.common.BaseObject) | Connection to MongoDB. | | Method resolution order: | MongoClient | pymongo.common.BaseObject | __builtin__.object | class Connection(pymongo.mongo_client.MongoClient) | C

Mongodb中MapReduce实现数据聚合方法详解_MongoDB

Mongodb是针对大数据量环境下诞生的用于保存大数据量的非关系型数据库,针对大量的数据,如何进行统计操作至关重要,那么如何从Mongodb中统计一些数据呢? 在Mongodb中,给我们提供了三种用于数据聚合的方式: (1)简单的用户聚合函数: (2)使用aggregate进行统计: (3)使用mapReduce进行统计: 今天我们首先来讲讲mapReduce是如何统计,在后续的文章中,将另起文章进行相关说明. MapReduce是啥呢?以我的理解,其实就是对集合中的各个满足条件的文档进行预处理

MongoDB资料汇总

与大家共勉~ 1.MongoDB是什么 MongoDB介绍PPT分享 MongoDB GridFS介绍PPT两则 初识 MongoDB GridFS MongoDB GridFS 介绍 一个NoSQL与MongoDB的介绍PPT MongoDB:下一代MySQL? 写给Python程序员的MongoDB介绍 又一篇给Python程序员的MongoDB教程 MongoDB源码研究系列文章 白话MongoDB系列文章 MongoDB Tailable Cursors 特性介绍 MongoDB 文档阅

Awesome Python

    Awesome Python      A curated list of awesome Python frameworks, libraries, software and resources. Inspired by awesome-php. Awesome Python Environment Management Package Management Package Repositories Distribution Build Tools Interactive Interp

Machine and Deep Learning with Python

Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstitions cheat sheet Introduction to Deep Learning with Python How to implement a neural network How to build and run your first deep learning network Neur

Node.js和MongoDB实现简单日志分析系统

  Node.js和MongoDB实现简单日志分析系统  这篇文章主要介绍了Node.js和MongoDB实现简单日志分析系统,本文给出了服务器端.客户端.图表生成.Shell自动执行等功能的实现代码,需要的朋友可以参考下     在最近的项目中,为了便于分析把项目的日志都存成了JSON格式.之前日志直接存在了文件中,而MongoDB适时闯入了我的视线,于是就把log存进了MongoDB中.log只存起来是没有意义的,最关键的是要从日志中发现业务的趋势.系统的性能漏洞等.之前有一个用Java写的