使用 XPATH 和 HTML Cleaner 解析 HTML/XML

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

太阳火神的美丽人生 (http://blog.csdn.net/opengl_es)

本文遵循“署名-非商业用途-保持一致”创作公用协议

转载请保留此句:太阳火神的美丽人生 -  本博客专注于 敏捷开发及移动和物联设备研究:iOS、Android、Html5、Arduino、pcDuino,否则,出自本博客的文章拒绝转载或再转载,谢谢合作。

使用 XPATH 和 HTML Cleaner 解析 HTML/XML
(Using XPATH and HTML Cleaner to parse HTML / XML)

JANUARY 5, 2010

tags: androidexamplesHTMLparsescrapingXMLXPATH

大家好
Hey everyone,

有时我发现有一种能力十分有用,尤其在 Web 相关的应用中,那就是从 web 站点获取 HTML 并且从 HTML 解析数据,或是任何你要想得到的内容(对于我的情况大多总是数据)。
So something that I’ve found to be extremely useful (especially in web related applications) is the ability to retrieve HTML from websites and parse their HTML for data or whatever you may be looking for (in my case it is almost always data).

I actually use this technique to do the real time stock/option imports for my Black-Scholes/Implied Volatility applications, so if you’re looking for an example on how to retrieve and parse HTML and run “queries” over it using, say, XPATH, then this post is for you.

Now, before we begin, in order to do this you will have to reference an external JAR in your project’s build path. The JAR that I use comes from HtmlCleaner which even gives you an example of how they use it here HtmlCleaner Example, but in addition to that I’ll show you an example of how I use it.


1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

public class OptionScraper {

 

    // EXAMPLE XPATH QUERIES IN THE FORM OF STRINGS - WILL BE USED LATER

    private static final String NAME_XPATH = "//div[@class='yfi_quote']/div[@class='hd']/h2";

 

    private static final String TIME_XPATH = "//table[@id='time_table']/tbody/tr/td[@class='yfnc_tabledata1']";

 

    private static final String PRICE_XPATH = "//table[@id='price_table']//tr//span";

 

    // TAGNODE OBJECT, ITS USE WILL COME IN LATER

    private static TagNode node;

 

    // A METHOD THAT HELPS ME RETRIEVE THE STOCK OPTION'S DATA BASED OFF THE NAME (I.E. GOUAA IS ONE OF GOOGLE'S STOCK OPTIONS)

    public static Option getOptionFromName(String name) throws XPatherException, ParserConfigurationException,SAXException, IOException, XPatherException {

 

        // THE URL WHOSE HTML I WANT TO RETRIEVE AND PARSE

        String option_url = "http://finance.yahoo.com/q?s=" + name.toUpperCase();

 

        // THIS IS WHERE THE HTMLCLEANER COMES IN, I INITIALIZE IT HERE

        HtmlCleaner cleaner = new HtmlCleaner();

        CleanerProperties props = cleaner.getProperties();

        props.setAllowHtmlInsideAttributes(true);

        props.setAllowMultiWordAttributes(true);

        props.setRecognizeUnicodeChars(true);

        props.setOmitComments(true);

 

        // OPEN A CONNECTION TO THE DESIRED URL

        URL url = new URL(option_url);

        URLConnection conn = url.openConnection();

 

        //USE THE CLEANER TO "CLEAN" THE HTML AND RETURN IT AS A TAGNODE OBJECT

        node = cleaner.clean(new InputStreamReader(conn.getInputStream()));

 

        // ONCE THE HTML IS CLEANED, THEN YOU CAN RUN YOUR XPATH EXPRESSIONS ON THE NODE, WHICH WILL THEN RETURN AN ARRAY OF TAGNODE OBJECTS (THESE ARE RETURNED AS OBJECTS BUT GET CASTED BELOW)

        Object[] info_nodes = node.evaluateXPath(NAME_XPATH);

        Object[] time_nodes = node.evaluateXPath(TIME_XPATH);

        Object[] price_nodes = node.evaluateXPath(PRICE_XPATH);

 

        // HERE I JUST DO A SIMPLE CHECK TO MAKE SURE THAT MY XPATH WAS CORRECT AND THAT AN ACTUAL NODE(S) WAS RETURNED

        if (info_nodes.length > 0) {

            // CASTED TO A TAGNODE

            TagNode info_node = (TagNode) info_nodes[0];

            // HOW TO RETRIEVE THE CONTENTS AS A STRING

            String info = info_node.getChildren().iterator().next().toString().trim();

 

            // SOME METHOD THAT PROCESSES THE STRING OF INFORMATION (IN MY CASE, THIS WAS THE STOCK QUOTE, ETC)

            processInfoNode(o, info);

        }

 

        if (time_nodes.length > 0) {

            TagNode time_node = (TagNode) time_nodes[0];

            String date = time_node.getChildren().iterator().next().toString().trim();

 

            // DATE RETURNED IN 15-JAN-10 FORMAT, SO THIS IS SOME METHOD I WROTE TO JUST PARSE THAT STRING INTO THE FORMAT THAT I USE

            processDateNode(o, date);

        }

 

        if (price_nodes.length > 0) {

            TagNode price_node = (TagNode) price_nodes[0];

            double price = Double.parseDouble(price_node.getChildren().iterator().next().toString().trim());

            o.setPremium(price);

        }

 

        return o;

    }

}

So that’s it! Once you include the JAR in your build path, everything else is pretty easy! It’s a great tool to use. However, it does require knowledge of XPATH but XPATH isn’t too hard to pick up and is useful to know so if you don’t know it then take a look at the link.

Now, a warning to everyone. It’s documented that the XPATH expressions recognized by HtmlCleaner is not complete in the sense that only “basic” XPATH is recognized. What’s excluded? For instance, you can’t use any of the “axes” operators (i.e. parent, ancestor, following, following-sibling, etc), but in my experience everything else is fair game. Yes, it sucks, and many times it can make your life a little bit harder, but usually it just requires you to be a tad more clever with your XPATH expressions before you can pull the desired information.

And of course, this technique works for XML documents as well!

Hope this was helpful to everyone. Let me know if you’re confused anywhere.

- jwei

时间: 2024-10-28 01:53:15

使用 XPATH 和 HTML Cleaner 解析 HTML/XML的相关文章

python自定义解析简单xml格式文件的方法

  这篇文章主要介绍了python自定义解析简单xml格式文件的方法,涉及Python解析XML文件的相关技巧,非常具有实用价值,需要的朋友可以参考下: 因为公司内部的接口返回的字串支持2种形式:php数组,xml;结果php数组python不能直接用,而xml字符串的格式不是标准的,所以也不能用标准模块解析.[不标准的地方是某些节点会的名称是以数字开头的],所以写个简单的脚步来解析一下文件,用来做接口测试. ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

利用SAX解析读取XML文件

xml     这是我的第一个BLOG,今天在看<J2EE应用开发详解>一书,书中讲到XML编程,于是就按照书中的步骤自己测试了起来,可是怎么测试都不成功,后来自己查看了一遍源码,发现在读取XML文件的位置时有误,于是进行了更改,还真行了,心中涌出一中成就感,现将源码贴出来与给位分享: 使用XML文件连接MYSQL数据库,database.conf.xml文件如下: <database-conf><datasource> <driver>com.mysql.

Liferay 用 ServiceBuilder 类来解析 service.xml

Liferay中有 service builder机制,它可以为持久层和服务层产生一组接口和类以及配置文件,当我们写一个service.xml后并且调用ant target为build-service时,框架就会用ServiceBuilder类来解析service.xml文件: 比如,我们创建的service.xml为: <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE service-builder

死磕Tomcat7源码之一:解析web.xml

原创作品,允许转载,转载时请务必以超链接形式标明文章 原始出处 .作者信息和本声明.否则将追究法律责任.http://dba10g.blog.51cto.com/764602/1775723 熟悉java web开发的同学都清楚,tomcat作为一款非常流行的servlet容器,开源,流行,配置简单,不需要赘述.个人认为,web.xml作为webapp的入口,弄清楚该文件的底层解析过程,进而可以窥探tomcat的底层工作机制,搞明白tomcat对servlert规范的实现机理. 通过本文,可以知

c#-C#怎么解析以下xml,以表格形式文本输出

问题描述 C#怎么解析以下xml,以表格形式文本输出 解决方案 C#实现XML与DataTable互转http://www.cnblogs.com/JuneZhang/archive/2011/04/14/2016068.html 解决方案二: 本人小白,能给出代码吗 谢谢! 解决方案三: 解析XML[C#] 解决方案四: 通过DOM的方式进行解析

编码-dom4j解析spring.xml,ref这种通过id调用是怎么解析的?

问题描述 dom4j解析spring.xml,ref这种通过id调用是怎么解析的? spring.xml如下: <?xml version="1.0" encoding="UTF-8"?> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.springframework.org/schema/beans h

bean-利用dom4j解析spring.xml

问题描述 利用dom4j解析spring.xml <?xml version="1.0" encoding="UTF-8"?> xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/s

pull-PULL解析多层结构XML的问题

问题描述 PULL解析多层结构XML的问题 如图所示:我希望解析出 省名.省id .市名.市id.县名.县id XmlPullParserFactory factory=XmlPullParserFactory.newInstance(); XmlPullParser xmlPullParser=factory.newPullParser(); xmlPullParser.setInput(new StringReader(response)); int eventType =xmlPullPa

求大牛们帮忙,介绍一下java解析多层xml

问题描述 求大牛们帮忙,介绍一下java解析多层xml 假设xml文档如下 那么该如何解析输出如下 求大牛们能给出java dom解析的代码,特别是判断文本内容的部分,也就是循环到哪里就能输出文本内容 解决方案 <?xml version="1.0" encoding="UTF-8"?> yy00000000000372 zz00000000000152 解决方案二: <?xml version="1.0" encoding=&