boilerpipe(Boilerplate Removal and Fulltext Extraction from HTML pages) 源码分析

开源Java模块boilerpipe(1.1.0), http://code.google.com/p/boilerpipe/

使用例子,
URL url = new URL("http://www.example.com/some-location/index.html ");
// NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
String text = ArticleExtractor .INSTANCE.getText(url);
那就从ActicleExtractor开始分析, 这个类用了singleton的design pattern, 使用INSTANCE取得唯一的实例, 实际处理如下步骤

HTML Parser 
The HTML Parser is based upon CyberNeko 1.9.13. It is called internally from within the Extractors.
The parser takes an HTML document and transforms it into a TextDocument , consisting of one or more TextBlocks . It knows about specific HTML elements (SCRIPT, OPTION etc.) that are ignored automatically.
Each TextBlock stores a portion of text from the HTML document. Initially (after parsing) almost every TextBlock represents a text section from the HTML document, except for a few inline elements that do not separate per defintion (for example '<A>'anchor tags).
The TextBlock objects also store shallow text statistics for the block's content such as the number of words and the number of words in anchor text.

Extractors 
Extractors consist of one or more pipelined Filters . They are used to get the content of a webpage. Several different Extractors exist, ranging from a generic DefaultExtractor to extractors specific for news article extraction (ArticleExtractor).
ArticleExtractor.process() 就包含了这个pipeline filter, 这个design做的非常具有可扩展性, 把整个处理过程分成若干小的步骤分别实现, 在用的时候象搭积木一样搭成一个处理流. 当想扩展或改变处理过程时, 非常简单, 只需加上或替换其中的一块就可以了.
这样也非常方便于多语言扩展, 比如这儿用的english包里的相应的处理函数,
import de.l3s.boilerpipe.filters.english.IgnoreBlocksAfterContentFilter;
import de.l3s.boilerpipe.filters.english.KeepLargestFulltextBlockFilter;
如果要扩展到其他语言, 如韩文, 只需在filters包里面加上个korean包, 分别实现这些filter处理函数, 然后只需要修改import, 就可以实现对韩语的support.

 TerminatingBlocksFinder.INSTANCE.process(doc)
                | new DocumentTitleMatchClassifier(doc.getTitle()).process(doc)
                | NumWordsRulesClassifier.INSTANCE.process(doc)
                | IgnoreBlocksAfterContentFilter.DEFAULT_INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | BoilerplateBlockFilter.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
                | KeepLargestFulltextBlockFilter.INSTANCE.process(doc)
                | ExpandTitleToContentFilter.INSTANCE.process(doc);
下面具体看一下处理流的每个环节.

TerminatingBlocksFinder
Finds blocks which are potentially indicating the end of an article text and marks them with {@link DefaultLabels#INDICATES_END_OF_TEXT}. This can be used in conjunction with a downstream {@link IgnoreBlocksAfterContentFilter}.(意思是IgnoreBlocksAfterContentFilter必须作为它的 downstream)

原理很简单, 就是判断这个block, 在tb.getNumWords() < 20的情况下是否满足下面的条件,
text.startsWith("Comments")
                        || N_COMMENTS.matcher(text).find() //N_COMMENTS = Pattern.compile("(?msi)^[0-9]+ (Comments|users responded in)")
                        || text.contains("What you think...")
                        || text.contains("add your comment")
                        || text.contains("Add your comment")
                        || text.contains("Add Your Comment")
                        || text.contains("Add Comment")
                        || text.contains("Reader views")
                        || text.contains("Have your say")
                        || text.contains("Have Your Say")
                        || text.contains("Reader Comments")
                        || text.equals("Thanks for your comments - this feedback is now closed")
                        || text.startsWith(" Reuters")
                        || text.startsWith("Please rate this")
如果满足就认为这个block为artical的结尾, 并加上标记tb.addLabel(DefaultLabels.INDICATES_END_OF_TEXT);

DocumentTitleMatchClassifier
这个很简单, 就是根据'<title>'的内容去页面中去标注title的位置, 做法就是根据'<title>'的内容产生一个potentialTitles列表, 然后去匹配block, 匹配上就标注成DefaultLabels.TITLE

NumWordsRulesClassifier
Classifies {@link TextBlock}s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features" (WSDM 2010), particularly using number of words per block and link density per block.
这个模块实现了个分类器, 用于区分content/not-content , 分类器的构建参见上面这篇文章的4.3节.
分类器使用Decision Trees算法, 用标注过的google news作为训练集, 接着对训练完的Decision Trees经行剪枝, Applying reduced-error pruning we were able to simplify the decision tree to only use 6 dimensions (2 features each for current, previous and next block) without a significant loss in accuracy.
最后用伪码描述出Decision Trees的decision过程, 这就是使用Decision Trees的最大好处, 它的decision rules是可以理解的, 所以可以用各种语言描述出来.
这个模块实现的是Algorithm 2 Classifier based on Number of Words

curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_numWords <= 16
| | | next_numWords <= 15
| | | | prev_numWords <= 4: BOILERPLATE
| | | | prev_numWords > 4: CONTENT
| | | next_numWords > 15: CONTENT
| | curr_numWords > 16: CONTENT
| prev_linkDensity > 0.555556
| | curr_numWords <= 40
| | | next_numWords <= 17: BOILERPLATE
| | | next_numWords > 17: CONTENT
| | curr_numWords > 40: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

有了Classifies, 接下来的事情就是对于所有block进行分类并标注.

IgnoreBlocksAfterContentFilter
Marks all blocks as "non-content" that occur after blocks that have been marked {@link DefaultLabels#INDICATES_END_OF_TEXT}. These marks are ignored unless a minimum number of words in content blocks occur before this mark (default: 60). This can be used in conjunction with an upstream {@link TerminatingBlocksFinder}.

这个模块是TerminatingBlocksFinder模块的downstream, 就是说必须在它后面做, 简单的很, 找到DefaultLabels#INDICATES_END_OF_TEXT, 后面的内容全标为BOILERPLATE.
除了前面正文length不到minimum number of words(default: 60), 还需要继续抓点文字凑数.

BlockProximityFusion
Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.
这个模块用来合并block的, 合并的依据主要是根据两个block的offset的差值不大于2, 也就是说中间最多只能隔一个block.
当要求contentOnly时, 会check两个block都标注为content时才会fusion.
int diffBlocks = block.getOffsetBlocksStart() - prevBlock.getOffsetBlocksEnd() - 1;
if (diffBlocks <= maxBlocksDistance)

那么block的offset怎么来的了, 查一下block构造的时候的代码
BoilerpipeHTMLContentHandler .flushBlock()
TextBlock tb = new TextBlock(textBuffer.toString().trim(), currentContainedTextElements, numWords, numLinkedWords, numWordsInWrappedLines, numWrappedLines, offsetBlocks);
offsetBlocks++;

TextBlock构造函数
this.offsetBlocksStart = offsetBlocks;
this.offsetBlocksEnd = offsetBlocks;
可以看出初始情况下, block的offset就是递增的, 并且再没有做过fusion的情况下, offsetBlocksStart和offsetBlocksEnd是相等的.
所以象注释讲的那样, 只有当upstream filter remove了部分blocks以后, 这个模块的合并依据才是有意义的, 不然在没有任何删除的情况下, 所有block都满足fusion条件.

看完这段代码, 我很奇怪, Paper中fusion是根据text density的, 而这儿只是根据block的offset, 有所减弱.
There, adjacent text fragments of similar text density (interpreted as /similar class") are iteratively fused until the blocks' densities (and therefore the text classes) are distinctive
enough.
而且我更加不理解的是, 在ArticleExtractor关于这个模块的用法如下,
                  BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | BoilerplateBlockFilter.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)
调用了BlockProximityFusion两次, 分别在BoilerplateBlockFilter(含义在下节)的down,upstream, 对于BlockProximityFusion.MAX_DISTANCE_1_CONTENT_ONLY.process(doc)的调用我还是能理解 的, 再删除完非content的block后, 对剩下的block做一下fusion, 比如原来两个block中间隔了个广告. 不过这儿根据offset, 而不根据text density, 个人觉得功能有所减弱.
可是对于BlockProximityFusion.MAX_DISTANCE_1.process(doc)的调用, 可能是我没看懂, 实在无法理解, 为什么要加这步, 唯一的解释是想将一些没有标注为content的block fusion到content里面去. 奇怪的是这儿fusion是无条件的(在没有删除block的情况下,判断offset无效), 只需要当前的block是content是就和Prev进行fusion. 而且为什么只判断当前block, Prevblock是content是否也应该fusion.个人觉得这边逻辑完全不合理......

BoilerplateBlockFilter
Removes {@link TextBlock}s which have explicitly been marked as "not content"
没啥好说的, 就是遍历每个block, 把没有标注为"content"的都删掉.

KeepLargestFulltextBlockFilter
Keeps the largest {@link TextBlock} only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked "not content" and flagged as {@link DefaultLabels#MIGHT_BE_CONTENT}
很好理解, 找出最大的文本block作为正文, 其他的标注为DefaultLabels#MIGHT_BE_CONTENT

ExpandTitleToContentFilter
Marks all {@link TextBlock}s "content" which are between the headline and the part that has already been marked content, if they are marked {@link DefaultLabels#MIGHT_BE_CONTENT}. This filter is quite specific to the news domain.
逻辑是找出标注为DefaultLabels.TITLE的block, 和content开始的那个block, 把这两个block之间的标注为MIGHT_BE_CONTENT的都改标注为Content.

TextDocument.getContent() 
最后需要做的一步, 是把抽取的内容输出成文本. 遍历每一个标注为content的block, 把内容append并输出.

DefaultExtractor 
下面再看看除了ArticleExtractor (针对news)以外, 很常用的DefaultExtractor
SimpleBlockFusionProcessor.INSTANCE.process(doc)
                | BlockProximityFusion.MAX_DISTANCE_1.process(doc)
                | DensityRulesClassifier.INSTANCE.process(doc);
相对比较简单, 就三步, 第二步很奇怪, 前面没有任何upstream会标注content, 那么这步就什么都不会做

SimpleBlockFusionProcessor
Merges two subsequent blocks if their text densities are equal.
遍历每一个block, 两个block的text densities相同就merge

DensityRulesClassifier
Classifies {@link TextBlock}s as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper "Boilerplate Detection using Shallow Text Features", particularly using text densities and link densities.
参照NumWordsRulesClassifier , 这儿实现了Paper里面的Algorithm 1 Densitometric Classifier
curr_linkDensity <= 0.333333
| prev_linkDensity <= 0.555556
| | curr_textDensity <= 9
| | | next_textDensity <= 10
| | | | prev_textDensity <= 4: BOILERPLATE
| | | | prev_textDensity > 4: CONTENT
| | | next_textDensity > 10: CONTENT
| | curr_textDensity > 9
| | | next_textDensity = 0: BOILERPLATE
| | | next_textDensity > 0: CONTENT
| prev_linkDensity > 0.555556
| | next_textDensity <= 11: BOILERPLATE
| | next_textDensity > 11: CONTENT
curr_linkDensity > 0.333333: BOILERPLATE

如果有兴趣, 你可以学习其他extractor, 或自己design合适自己的extractor.

本文章摘自博客园,原文发布日期:2011-07-05

时间: 2024-11-02 08:56:27

boilerpipe(Boilerplate Removal and Fulltext Extraction from HTML pages) 源码分析的相关文章

详解ABP框架中的日志管理和设置管理的基本配置

日志管理 Server side(服务器端) ASP.NET Boilerplate使用Castle Windsor's logging facility日志记录工具,并且可以使用不同的日志类库,比如:Log4Net, NLog, Serilog... 等等.对于所有的日志类库,Castle提供了一个通用的接口来实现,我们可以很方便的处理各种特殊的日志库,而且当业务需要的时候,很容易替换日志组件. 译者注释:Castle是什么:Castle是针对.NET平台的一个开源项目,从数据访问框架ORM到

MultiDex工作原理分析和优化方案

动态加载技术(插件化)系列已经坑了有一段时间了,不过UP主我并没有放弃治疗哈,相信在不就的未来就可以看到"系统Api Hook模式"和插件化框架Frontia的更新了.今天要讲的是动态加载技术的亲戚 -- MultiDex.他们的核心原理之一都是dex文件的加载. MultiDex是Google为了解决"65535方法数超标"以及"INSTALL_FAILED_DEXOPT"问题而开发的一个Support库,具体如何使用MultiDex现在市面已

Guava 是个风火轮之基础工具(2)

前言 Guava 是 Java 开发者的好朋友.虽然我在开发中使用 Guava 很长时间了,Guava API 的身影遍及我写的生产代码的每个角落,但是我用到的功能只是 Guava 的功能集中一个少的可怜的真子集,更别说我一直没有时间认真的去挖掘 Guava 的功能,没有时间去学习 Guava 的实现.直到最近,我开始阅读 Getting Started with Google Guava,感觉有必要将我学习和使用 Guava 的一些东西记录下来. Splitter Guava 提供了 Join

HDFS源码分析数据块复制监控线程ReplicationMonitor(一)

        ReplicationMonitor是HDFS中关于数据块复制的监控线程,它的主要作用就是计算DataNode工作,并将复制请求超时的块重新加入到待调度队列.其定义及作为线程核心的run()方法如下: /** * Periodically calls computeReplicationWork(). * 周期性调用computeReplicationWork()方法 */ private class ReplicationMonitor implements Runnable

ABP框架中的日志功能完全解析_实用技巧

ASP.NET Boilerplate使用Castle Windsor's logging facility日志记录工具,并且可以使用不同的日志类库,比如:Log4Net, NLog, Serilog... 等等.对于所有的日志类库,Castle提供了一个通用的接口来实现,我们可以很方便的处理各种特殊的日志库,而且当业务需要的时候,很容易替换日志组件. 译者注释:Castle是什么:Castle是针对.NET平台的一个开源项目,从数据访问框架ORM到IOC容器,再到WEB层的MVC框架.AOP,

详解ABP框架中的日志管理和设置管理的基本配置_ASP编程

日志管理Server side(服务器端)ASP.NET Boilerplate使用Castle Windsor's logging facility日志记录工具,并且可以使用不同的日志类库,比如:Log4Net, NLog, Serilog... 等等.对于所有的日志类库,Castle提供了一个通用的接口来实现,我们可以很方便的处理各种特殊的日志库,而且当业务需要的时候,很容易替换日志组件. 译者注释:Castle是什么:Castle是针对.NET平台的一个开源项目,从数据访问框架ORM到IO

Lucene的分析资料【转】

  Lucene 源码剖析 1 目录 2 Lucene是什么 2.1.1 强大特性 2.1.2 API组成- 2.1.3 Hello World! 2.1.4 Lucene roadmap 3 索引文件结构 3.1 索引数据术语和约定 - 3.1.1 术语定义 3.1.2 倒排索引(inverted indexing) 3.1.3 Fields的种类 3.1.4 片断(segments) 3.1.5 文档编号(document numbers) 3.1.6 索引结构概述 3.1.7 索引文件中定

MySQL · 引擎特性 · InnoDB Fulltext简介

前言 从MySQL5.6版本开始支持InnoDB引擎的全文索引,语法层面上大多数兼容之前MyISAM的全文索引模式. 所谓全文索引,是一种通过建立倒排索引,快速匹配文档的方式.MySQL支持三种模式的全文检索模式: 第一种是自然语言模式(IN NATURAL LANGUAGE MODE),即通过MATCH AGAINST 传递某个特定的字符串来进行检索. 第二种是布尔模式(IN BOOLEAN MODE),可以为检索的字符串增加操作符,例如"+"表示必须包含,"-"

MySQL 全文搜索 FULLTEXT match

MySQL 全文搜索 FULLTEXT match 到 3.23.23 时,MySQL 开始支持全文索引和搜索.全文索引在 MySQL 中是一个 FULLTEXT 类型索引.FULLTEXT 索引用于 MyISAM 表,可以在 CREATE TABLE 时或之后使用 ALTER TABLE 或 CREATE INDEX 在 CHAR.VARCHAR 或 TEXT 列上创建.对于大的数据库,将数据装载到一个没有 FULLTEXT 索引的表中,然后再使用 ALTER TABLE (或 CREATE