Java 程序在解析 HTML 文档时,相信大家都接触过 htmlparser 这个开源项目,我曾经在 IBM DW 上发表过两篇关于 htmlparser 的文章,分别是:从HTML中攫取你所需的信息 和扩展 HTML">Parser 对自定义标签的处理能力。但现在我已经不再使用 htmlparser 了,原因是 htmlparser 很少更新,但最重要的是有了 jsoup 。
jsoup 是一款 Java 的HTML 解析器,可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API,可通过DOM,CSS以及类似于jQuery的操作方法来取出和操作数据。
jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.
·parse HTML from a URL, file, or string
·find and extract data, using DOM traversal or CSS selectors
·manipulate the HTML elements, attributes, and text
·clean user-submitted content against a safe white-list, to prevent XSS
jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.
jsoup的主要功能如下:
1. 从一个URL,文件或字符串中解析HTML;
2. 使用DOM或CSS选择器来查找、取出数据;
3. 可操作HTML元素、属性、文本;
jsoup是基于MIT协议发布的,可放心使用于商业项目。
该版本增加单个 pass 选择器用于所有复杂查询,另外使用 CSS 选择器从 DOM 中提取元素的性能得到显著的提升,修复了 Scala 支持的bug,提供新的 HTML 操作特性以及bug修复。
Features, fixes, and improvements
·Added ability to change an element's tag with Element.tagName(String), and to change many at once with Elements.tagName(String).
·Added Node.wrap(String), Node.before(String), and Node.after(String), to allow HTML to be easily added to all nodes. These functions were previously supported on Elements only.
·Added TextNode.splitText(int), which allows a text node to be split into two nodes at a specified index point. This is convenient if you need to surround some text in an element.
·Updated Jsoup.Connect so that cookies set on a redirect response will be included on both the redirected request and response.
·Infinite redirection loops in Jsoup.connect are now prevented.
·Allow Jsoup.connect to parse application/xml and application/xhtml+xml responses.
·Modified Jsoup.connect to always follow relative links, regardless of the underlying HTTP sub-system.
·Defined U (underline) element as an inline tag.
·Force strict entity matching (must be &xxx; and not &xxx) in element attributes.
·Implemented Elements.clone() (contributed by knz).
·Fixed tokeniser optimisation when scanning for missing data element close tags.
·Fixed issue when using descendant regex attribute selectors.
下载地址:
jsoup-1.5.1.jar core library jsoup-1.5.1-sources.jar optional sources jar jsoup-1.5.1-javadoc.jar optional javadoc jar