HTML解析HtmlAgilityPack

原文:HTML解析HtmlAgilityPack

//解析页面源代码

Uri surl = new Uri(url);

Uri uriCategory = null;

HttpWebRequest requst = (HttpWebRequest)WebRequest.Create(url);

WebResponse response = requst.GetResponse();

Stream stream = response.GetResponseStream();

StreamReader read = new StreamReader(stream, Encoding.GetEncoding("gb2312"));

return read.ReadToEnd();

//定位到需要截取的部分

HtmlAgilityPack.HtmlDocument html = new HtmlAgilityPack.HtmlDocument();

html.LoadHtml(str);

HtmlNode rootNode = html.DocumentNode;

HtmlNodeCollection categoryNodeList = rootNode.SelectNodes("//html[1]/body[1]/div[9]/div[1]/div[1]/div[1]/ul/li");

HtmlNode temp = null;

List<Category> list = new List<Category>();

//截取部分循环

foreach (HtmlNode categoryNode in categoryNodeList)

{

temp = HtmlNode.CreateNode(categoryNode.OuterHtml);

HtmlNode singleNode = temp.SelectSingleNode(“//li/a[1]”);

Category category = new Category();

category.IndexUrl = singleNode.Attributes["href"].Value.ToString();

category.Subject = singleNode.Attributes["title"].Value.ToString();

list.Add(category);

}

public class Category
{
public string Subject { get; set; }
public string IndexUrl { get; set; }
}

////html[1]/body[1]/div[9]/div[1]/div[1]/div[1]/ul/li 截取规则

Articles/Article[1]：选取属于Articles子元素的第一个Article元素。
/Articles/Article[last()]：选取属于Articles子元素的最后一个Article元素。
/Articles/Article[last()-1]：选取属于Articles子元素的倒数第二个Article元素。
/Articles/Article[position()<3]：选取最前面的两个属于 bookstore 元素的子元素的Article元素。
//title[@lang]：选取所有拥有名为lang的属性的title元素。
//CreateAt[@type='zh-cn']：选取所有CreateAt元素，且这些元素拥有值为zh-cn的type属性。
/Articles/Article[Order>2]：选取Articles元素的所有Article元素，且其中的Order元素的值须大于2。
/Articles/Article[Order<3]/Title：选取Articles元素中的Article元素的所有Title元素，且其中的Order元素的值须小于3。

时间： 2025-01-30 14:51:03

HTML解析HtmlAgilityPack

HTML解析HtmlAgilityPack的相关文章

html解析-HtmlAgilityPack 怎么正确提取包含小于号“&amp;lt;”等类似html标签符号的内容？

HtmlAgilityPack解析问题

[置顶]C#+HtmlAgilityPack+XPath带你采集数据(以采集天气数据为例子)

wIndows phone 7 解析Html数据

HTML Agility Pack 搭配 ScrapySharp，彻底解除Html解析的痛苦

c#中的jQuery——HtmlAgilityPack

windows phone 7,sliverlight 下载网页的解析,关于wp7 gb2312编码

解析xHTML源码的DLL组件AngleSharp介绍_实用技巧

多线程用mshtml解析html时，内存暴涨，程序中断，如何处理？