Extracting Information from Text With NLTK

因为现实中的数据多为‘非结构化数据’,比如一般的txt文档,或是‘半结构化数据’,比如html,对于这样的数据需要采用一些技术才能从中提取 出有用的信息。如果所有数据都是‘结构化数据’,比如Xml或关系数据库,那么就不需要特别去提取了,可以根据元数据去任意取到你想要的信息。

那么就来讨论一下用NLTK来实现文本信息提取的方法,

first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . Next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

可见这儿描述的信息提取的过程,包含4步,分词,词性标注,命名实体识别,实体关系识别,对于分词和词性标注前面已经介绍过了,那么就详细来看看named entity recognition 怎么来实现的。

Chunking

The basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。

实体识别最基本的技术就是chunking,即分块,可以理解为把多个token组成词组。

Noun Phrase Chunking

我们就先以名词词组从chunking为例,即NP-chunking

One of the most useful sources of information for NP-chunking is part-of-speech tags.

>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}" #Tag Patterns,定语(0或1个)形容词(任意个)名词(1个) 
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print result
(S
(NP the/DT little/JJ yellow/JJ dog/NN) #NP-chunking, the little yellow dog 
barked/VBD
at/IN
(NP the/DT cat/NN)) #NP-chunking, # NP-chunking, the cat 
上面的这个方法就是用Regular Expressions来表示tag pattern,从而找到NP-chunking

再给个例子,tag patterns可以加上多条,可以变的更复杂

grammar = r"""NP: {<DT|PP/>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns                                {<NNP>+} # chunk sequences of proper nouns                    """ cp = nltk.RegexpParser(grammar) sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns                                {<NNP>+} # chunk sequences of proper nouns                    """ cp = nltk.RegexpParser(grammar) sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(sentence)
(S
(NP Rapunzel/NNP) #NP-chunking, Rapunzel 
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN)) #NP-chunking, her long golden hair

下面给个例子看看怎么从语料库中找到匹配的词性组合,

>>> cp = nltk.RegexpParser(''CHUNK: {<V.*> <TO> <V.*>}'') #找‘动词 to 动词’的组合 
>>> brown = nltk.corpus.brown
>>> for sent in brown.tagged_sents():
...         tree = cp.parse(sent)
...         for subtree in tree.subtrees():
...             if subtree.node == ''CHUNK'': print subtree
...
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)

本文章摘自博客园,原文发布日期:2011-07-04

时间: 2024-09-20 19:26:09

Extracting Information from Text With NLTK的相关文章

Classify Text With NLTK

Classification is the task of choosing the correct class label for a given input. A classifier is called supervised if it is built based on training corpora containing the correct label for each input. 这里就以一个例子来说明怎样用nltk来实现分类器训练和分类 一个简单的分类任务,给定一个名字,判

Awesome Python

    Awesome Python      A curated list of awesome Python frameworks, libraries, software and resources. Inspired by awesome-php. Awesome Python Environment Management Package Management Package Repositories Distribution Build Tools Interactive Interp

Machine and Deep Learning with Python

Machine and Deep Learning with Python Education Tutorials and courses Supervised learning superstitions cheat sheet Introduction to Deep Learning with Python How to implement a neural network How to build and run your first deep learning network Neur

用Google API 提取名片信息

介绍 我们每个人或多或少都会使用到名片.但是如果名片数量一大,管理它们就显得非常麻烦.因此我产生用这篇文章的案例来管理他们. 这里,我先用手机对每张名片拍照,并按以下流程进行处理: 把获得的名片图像交给我们的应用程序,抽取人名,公司名称,地址等信息.这里我使用了Google Vision API 和 自然语言(Natural Language )API,因为这两个API简单易用,并且性能也很不错. 我使用Python来编写我的这个应用程序,来调用 Google Vision API 和 Natu

IsoAlgo User Guide

IsoAlgo User Guide V1.0 IsoAlgo@gmail.com April 28, 2014 Disclaimer This program is free software, you can redistribute it and/or modify it. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the im

安全技巧:配置 IIS 4.0 证书鉴定

配置 IIS 4.0 证书鉴定 Ramon AliWindows NT杂志 - 1999 年2月 使用证书服务器1.0(Certificate Server 1.0)来作为证书权威 如果你能赋予可信用户透明地访问加密网站的权限,岂不是很好?这样的话,每当这些用户在进入你的站点的时候,就不必总是需要输入他们的用户名和口令了.用户可以同你的网站上的加密部分进行通信,而无须提供用户名和口令的一个方法是,在配置IIS(Internet Information Server,互连网信息服务器)要求质询/响

1997-2007,KDD CUP的二十年

2017年8月13-17日,第23届KDD大会在加拿大哈利法克斯召开.KDD CUP是ACM SIGKDD组织的有关数据挖掘和知识发现领域的年度赛事,作为KDD年会的重要组成部分,从1997年至今,已有二十年的历史,是目前数据挖掘领域最有影响力的赛事.今天,我们就一起来回顾下这二十年的KDD CUP吧. KDD Cup 1997 Direct marketing for lift curve optimization 预测出最可能的善款捐赠人 Intro:This year's challeng

【哈夫曼编码】HDU1053-Entropy

Entropy Time Limit: 2000/1000 MS (Java/Others)    Memory Limit: 65536/32768 K (Java/Others) Total Submission(s): 4178    Accepted Submission(s): 1707 Problem Description An entropy encoder is a data encoding method that achieves lossless data compres

FFMPeg代码分析:AVCodecContext结构体

在调用avformat_open_input打开文件后,下一步调用av_find_stream_info函数从文件中读取音视频流的信息,而后AVFormatContext的结构体将会将这些信息保存在其中.在找到AVFormatContext的视频stream后,获取其codec保存到指向AVCodecContext的指针: // Find the first video stream for(i=0; i<pFormatCtx->nb_streams; i++) { if(pFormatCtx