Extracting Information from Text With NLTK

因为现实中的数据多为‘非结构化数据’，比如一般的txt文档，或是‘半结构化数据’，比如html，对于这样的数据需要采用一些技术才能从中提取出有用的信息。如果所有数据都是‘结构化数据’，比如Xml或关系数据库，那么就不需要特别去提取了，可以根据元数据去任意取到你想要的信息。

那么就来讨论一下用NLTK来实现文本信息提取的方法，

first, the raw text of the document is split into sentences using a sentence segmenter, and each sentence is further subdivided into words using a tokenizer . Next, each sentence is tagged with part-of-speech tags , which will prove very helpful in the next step,named entity recognition . In this step, we search for mentions of potentially interesting entities in each sentence. Finally, we use relation recognition to search for likely relations between different entities in the text.

可见这儿描述的信息提取的过程，包含4步，分词，词性标注，命名实体识别，实体关系识别，对于分词和词性标注前面已经介绍过了，那么就详细来看看named entity recognition 怎么来实现的。

Chunking

The basic technique we will use for entity recognition is chunking, which segments and labels multitoken sequences。

实体识别最基本的技术就是chunking，即分块，可以理解为把多个token组成词组。

Noun Phrase Chunking

我们就先以名词词组从chunking为例，即NP-chunking

One of the most useful sources of information for NP-chunking is part-of-speech tags.

>>> sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"), ("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
>>> grammar = "NP: {<DT>?<JJ>*<NN>}" #Tag Patterns，定语（0或1个）形容词（任意个）名词（1个）
>>> cp = nltk.RegexpParser(grammar)
>>> result = cp.parse(sentence)
>>> print result
(S
(NP the/DT little/JJ yellow/JJ dog/NN) #NP-chunking, the little yellow dog
barked/VBD
at/IN
(NP the/DT cat/NN)) #NP-chunking, # NP-chunking, the cat
上面的这个方法就是用Regular Expressions来表示tag pattern，从而找到NP-chunking

再给个例子，tag patterns可以加上多条，可以变的更复杂

grammar = r"""NP: {<DT|PP/>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP>?<JJ>*<NN>} # chunk determiner/possessive, adjectives and nouns {<NNP>+} # chunk sequences of proper nouns """ cp = nltk.RegexpParser(grammar) sentence = [("Rapunzel", "NNP"), ("let", "VBD"), ("down", "RP"), ("her", "PP"), ("long", "JJ"), ("golden", "JJ"), ("hair", "NN")]
>>> print cp.parse(sentence)
(S
(NP Rapunzel/NNP) #NP-chunking, Rapunzel
let/VBD
down/RP
(NP her/PP$ long/JJ golden/JJ hair/NN)) #NP-chunking, her long golden hair

下面给个例子看看怎么从语料库中找到匹配的词性组合，

>>> cp = nltk.RegexpParser(''CHUNK: {<V.*> <TO> <V.*>}'') ＃找‘动词 to 动词’的组合
>>> brown = nltk.corpus.brown
>>> for sent in brown.tagged_sents():
...       tree = cp.parse(sent)
...         for subtree in tree.subtrees():
...             if subtree.node == ''CHUNK'': print subtree
...
(CHUNK combined/VBN to/TO achieve/VB)
(CHUNK continue/VB to/TO place/VB)
(CHUNK serve/VB to/TO protect/VB)
(CHUNK wanted/VBD to/TO wait/VB)
(CHUNK allowed/VBN to/TO place/VB)
(CHUNK expected/VBN to/TO become/VB)

本文章摘自博客园，原文发布日期：2011-07-04

时间： 2024-09-20 19:26:09

Extracting Information from Text With NLTK

Extracting Information from Text With NLTK的相关文章

Classify Text With NLTK

Awesome Python

Machine and Deep Learning with Python

用Google API 提取名片信息

IsoAlgo User Guide

安全技巧：配置 IIS 4.0 证书鉴定

1997-2007，KDD CUP的二十年

【哈夫曼编码】HDU1053-Entropy

FFMPeg代码分析：AVCodecContext结构体