今天我们由浅入深的看看ipython, 本文作为读者的你已经知道ipython并且用了一段时间了.
%run
这是一个magic命令, 能把你的脚本里面的代码运行, 并且把对应的运行结果存入ipython的环境变量中:
$cat t.py
# coding=utf-8
l = range(5)
$ipython
In [1]: %run t.py # `%`可加可不加
In [2]: l # 这个l本来是t.py里面的变量, 这里直接可以使用了
Out[2]: [0, 1, 2, 3, 4]
alias
In [3]: %alias largest ls -1sSh | grep %s
In [4]: largest to
total 42M
20K tokenize.py
16K tokenize.pyc
8.0K story.html
4.0K autopep8
4.0K autopep8.bak
4.0K story_layout.html
PS 别名需要存储的, 否则重启ipython就不存在了:
In [5]: %store largest
Alias stored: largest (ls -1sSh | grep %s)
下次进入的时候%store -r
bookmark - 对目录做别名
In [2]: %pwd
Out[2]: u'/home/vagrant'
In [3]: %bookmark dongxi ~/shire/dongxi
In [4]: %cd dongxi
/home/vagrant/shire/dongxi_code
In [5]: %pwd
Out[5]: u'/home/vagrant/shire/dongxi_code'
ipcluster - 并行计算
其实ipython提供的方便的并行计算的功能. 先回答ipython做并行计算的特点:
1.
$wget http://www.gutenberg.org/files/27287/27287-0.txt
第一个版本是直接的, 大家习惯的用法.
In [1]: import re
In [2]: import io
In [3]: non_word = re.compile(r'[Wd]+', re.UNICODE)
In [4]: common_words = {
...: 'the','of','and','in','to','a','is','it','that','which','as','on','by',
...: 'be','this','with','are','from','will','at','you','not','for','no','have',
...: 'i','or','if','his','its','they','but','their','one','all','he','when',
...: 'than','so','these','them','may','see','other','was','has','an','there',
...: 'more','we','footnote', 'who', 'had', 'been', 'she', 'do', 'what',
...: 'her', 'him', 'my', 'me', 'would', 'could', 'said', 'am', 'were', 'very',
...: 'your', 'did', 'not',
...: }
In [5]: def yield_words(filename):
...: import io
...: with io.open(filename, encoding='latin-1') as f:
...: for line in f:
...: for word in line.split():
...: word = non_word.sub('', word.lower())
...: if word and word not in common_words:
...: yield word
...:
In [6]: def word_count(filename):
...: word_iterator = yield_words(filename)
...: counts = {}
...: counts = defaultdict(int)
...: while True:
...: try:
...: word = next(word_iterator)
...: except StopIteration:
...: break
...: else:
...: counts[word] += 1
...: return counts
...:
In [6]: from collections import defaultdict # 脑残了 忘记放进去了..
In [7]: %time counts = word_count(filename)
CPU times: user 88.5 ms, sys: 2.48 ms, total: 91 ms
Wall time: 89.3 ms
现在用ipython来跑一下:
ipcluster start -n 2 # 好吧, 我的Mac是双核的
先讲下ipython 并行计算的用法:
In [1]: from IPython.parallel import Client # import之后才能用%px*的magic
In [2]: rc = Client()
In [3]: rc.ids # 因为我启动了2个进程
Out[3]: [0, 1]
In [4]: %autopx # 如果不自动 每句都需要: `%px xxx`
%autopx enabled
In [5]: import os # 这里没autopx的话 需要: `%px import os`
In [6]: print os.getpid() # 2个进程的pid
[stdout:0] 62638
[stdout:1] 62636
In [7]: %pxconfig --targets 1 # 在autopx下 这个magic不可用
[stderr:0] ERROR: Line magic function `%pxconfig` not found.
[stderr:1] ERROR: Line magic function `%pxconfig` not found.
In [8]: %autopx # 再执行一次就会关闭autopx
%autopx disabled
In [10]: %pxconfig --targets 1 # 指定目标对象, 这样下面执行的代码就会只在第2个进程下运行
In [11]: %%px --noblock # 其实就是执行一段非阻塞的代码
....: import time
....: time.sleep(1)
....: os.getpid()
....:
Out[11]: <AsyncResult: execute>
In [12]: %pxresult # 看 只返回了第二个进程的pid
Out[1:21]: 62636
In [13]: v = rc[:] # 使用全部的进程, ipython可以细粒度的控制那个engine执行的内容
In [14]: with v.sync_imports(): # 每个进程都导入time模块
....: import time
....:
importing time on engine(s)
In [15]: def f(x):
....: time.sleep(1)
....: return x * x
....:
In [16]: v.map_sync(f, range(10)) # 同步的执行
Out[16]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
In [17]: r = v.map(f, range(10)) # 异步的执行
In [18]: r.ready(), r.elapsed # celery的用法
Out[18]: (True, 5.87735)
In [19]: r.get() # 获得执行的结果
Out[19]: [0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
入正题:
In [20]: def split_text(filename):
....: text = open(filename).read()
....: lines = text.splitlines()
....: nlines = len(lines)
....: n = 10
....: block = nlines//n
....: for i in range(n):
....: chunk = lines[i*block:(i+1)*(block)]
....: with open('count_file%i.txt' % i, 'w') as f:
....: f.write('n'.join(chunk))
....: cwd = os.path.abspath(os.getcwd())
....: fnames = [ os.path.join(cwd, 'count_file%i.txt' % i) for i in range(n)] # 不用glob是为了精准
....: return fnames
In [21]: from IPython import parallel
In [22]: rc = parallel.Client()
In [23]: view = rc.load_balanced_view()
In [24]: v = rc[:]
In [25]: v.push(dict(
....: non_word=non_word,
....: yield_words=yield_words,
....: common_words=common_words
....: ))
Out[25]: <AsyncResult: _push>
In [26]: fnames = split_text(filename)
In [27]: def count_parallel():
.....: pcounts = view.map(word_count, fnames)
.....: counts = defaultdict(int)
.....: for pcount in pcounts.get():
.....: for k, v in pcount.iteritems():
.....: counts[k] += v
.....: return counts, pcounts
.....:
In [28]: %time counts, pcounts = count_parallel() # 这个时间包含了我再聚合的时间
CPU times: user 47.6 ms, sys: 6.67 ms, total: 54.3 ms # 是不是比直接运行少了很多时间?
Wall time: 106 ms # 这个时间是
In [29]: pcounts.elapsed, pcounts.serial_time, pcounts.wall_time
Out[29]: (0.104384, 0.13980499999999998, 0.104384)
更多地关于并行计算请看这里: nbviewer.ipython.org/url/www.astro.washington.edu/users/vanderplas/Astr599/notebooks/21_IPythonParallel.ipynb
学习下写ipython的magic命令. 好, magic是什么? 它是ipython自带的一些扩展命令, 类似%history, %prun, %logstart..
想查看全部的magic可以使用ismagic, 列出可用的全部magics
%lsmagic
magic分为2类:
line magic: 一些功能命令
cell magic: 主要是渲染ipython notebook页面效果以及执行某语言的代码
idb - python db.py shell extension
idb是我最近写的一个magic. 主要是给ipython提供db.py的接口,我们直接分析代码(我只截取有代表性的一段):
import os.path
from functools import wraps
from operator import attrgetter
from urlparse import urlparse
from db import DB # db.py提供的接口
from IPython.core.magic import Magics, magics_class, line_magic # 这三个就是我们需要做magic插件的组件
def get_or_none(attr):
return attr if attr else None
def check_db(func):
@wraps(func)
def deco(*args):
if args[0]._db is None: # 每个magic都需要首页实例化过db,so 直接加装饰器来判断
print '[ERROR]Please make connection: `con = %db_connect xx` or `%use_credentials xx` first!' # noqa
return
return func(*args)
return deco
@magics_class # 每个magic都需要加这个magics_class装饰器
class SQLDB(Magics): # 要继承至Magics
_db = None # 每次打开ipython都是一次实例化
@line_magic('db_connect') # 这里用了line_magic 表示它是一个line magic.(其他2种一会再说) magic的名字是db_connect. 注意 函数名不重要
# 最后我们用 %db_connect而不是%conn
def conn(self, parameter_s): # 每个这样的方法都接收一个参数 就是你在ipython里输入的内容
"""Conenct to database in ipython shell.
Examples::
%db_connect
%db_connect postgresql://user:pass@localhost:port/database
"""
uri = urlparse(parameter_s) # 剩下的都是解析parameter_s的逻辑
if not uri.scheme:
params = {
'dbtype': 'sqlite',
'filename': os.path.join(os.path.expanduser('~'), 'db.sqlite')
}
elif uri.scheme == 'sqlite':
params = {
'dbtype': 'sqlite',
'filename': uri.path
}
else:
params = {
'username': get_or_none(uri.username),
'password': get_or_none(uri.password),
'hostname': get_or_none(uri.hostname),
'port': get_or_none(uri.port),
'dbname': get_or_none(uri.path[1:])
}
self._db = DB(**params) # 这里给_db赋值
return self._db # return的结果就会被ipython接收,显示出来
@line_magic('db') # 一个新的magic 叫做%db -- 谨防取名冲突
def db(self, parameter_s):
return self._db
@line_magic('table')
@check_db
def table(self, parameter_s):
p = parameter_s.split() # 可能传进来的是多个参数,但是对ipython来说,传进来的就是一堆字符串,所以需要按空格分隔下
l = len(p)
if l == 1:
if not p[0]:
return self._db.tables
else:
return attrgetter(p[0])(self._db.tables)
else:
data = self._db.tables
for c in p:
if c in ['head', 'sample', 'unique', 'count', 'all', 'query']:
data = attrgetter(c)(data)()
else:
data = attrgetter(c)(data)
return data
def load_ipython_extension(ipython): # 注册一下. 假如你直接去ipython里面加 就不需要这个了
ipython.register_magics(SQLDB)
PS:
调试中可以使用%reloa_ext idb 的方式重启magic
%install_ext 之后默认放在你的ipython自定义目录/extensions里. 我这里是~/.ipython/extensions
好了,大家是不是觉得ipython的magic也不是很难嘛
来了解ipython都提供了什么?
magic装饰器的类型:
line_magic # 刚才我们见识了, 就是%xx, xx就是magic的名字
cell_magic # 就是%%xx
line_cell_magic # 可以是%xx, 也可以是%%xx
先说cell_magic 来个例子,假如我想执行个ruby,本来应该是:
In [1]: !ruby -e 'p "hello"'
"hello"
In [2]: %%ruby # 也可以这样
...: p "hello"
...:
"hello"
再说个notebook的:
In [3]: %%javascript
...: require.config({
...: paths: {
...: chartjs: '//code.highcharts.com/highcharts'
...: }
...: });
...:
<IPython.core.display.Javascript object>
});
然后再说line_cell_magic:
In [4]: %time 2**128
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.01 µs
Out[4]: 340282366920938463463374607431768211456L
In [5]: %%time
...: 2**128
...:
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 9.06 µs
Out[5]: 340282366920938463463374607431768211456L
Ps: line_cell_magic方法的参数是2个:
@line_cell_magic
def xx(self, line='', cell=None):
带参数的magic(我直接拿ipython源码提供的magic来说明):
一共2种风格:
使用getopt: self.parse_options
使用argparse: magic_arguments
self.parse_options
@line_cell_magic
def prun(self, parameter_s='', cell=None):
opts, arg_str = self.parse_options(parameter_s, 'D:l:rs:T:q',
list_all=True, posix=False)
...
getopt用法可以看这里 http://pymotw.com/2/getopt/index.html#module-getopt
我简单介绍下’D:l:rs:T:q’就是可以使用 -D, -l, -r, -s, -T, -q这些选项. :号是告诉你是否需要参数,split下就是: D:,l:,r,s:,T:,q 也就是-r和-q不需要参数其他的都是参数 类似 %prun -D
magic_arguments
@magic_arguments.magic_arguments() # 最上面
@magic_arguments.argument('--breakpoint', '-b', metavar='FILE:LINE',
help="""
Set break point at LINE in FILE.
"""
) # 这种argument可以有多个
@magic_arguments.argument('statement', nargs='*',
help="""
Code to run in debugger.
You can omit this in cell magic mode.
"""
)
@line_cell_magic
def debug(self, line='', cell=None):
args = magic_arguments.parse_argstring(self.debug, line) # 要保持第一个参数等于这个方法名字,这里就是self.debug
...
还有个magic方法集: 用于并行计算的magics: https://github.com/ipython/ipython/blob/master/IPython/parallel/client/magics.py