代码如下
代码如下 | 复制代码 |
public Encoding GetEncoding(string CharacterSet) { switch (CharacterSet) { case "gb2312": return Encoding.GetEncoding("gb2312"); case "utf-8": return Encoding.UTF8; default: return Encoding.Default; } } public string HttpGet(string url) { string responsestr = ""; HttpWebRequest req = HttpWebRequest.Create(url) as HttpWebRequest; req.Accept = "*/*"; req.Method = "GET"; req.UserAgent = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1"; using (HttpWebResponse response = req.GetResponse() as HttpWebResponse) { Stream stream; if (response.ContentEncoding.ToLower().Contains("gzip")) { stream = new GZipStream(response.GetResponseStream(), CompressionMode.Decompress); } else if (response.ContentEncoding.ToLower().Contains("deflate")) { stream = new DeflateStream(response.GetResponseStream(), CompressionMode.Decompress); } else { stream = response.GetResponseStream(); } using (StreamReader reader = new StreamReader(stream, GetEncoding(response.CharacterSet))) { responsestr = reader.ReadToEnd(); stream.Dispose(); } } return responsestr; } |
调用HttpGet就可以获取网址的源码了,得到源码后, 现在用一个利器HtmlAgility来解析html了,不会正则不要紧,此乃神器啊。老板再也不用担心我的正则表达式了。
至于这个神器的用法,园子文章很多,写的也都挺详细的,在此不赘余了。
下面是抓取园子首页的文章列表:
代码如下 | 复制代码 |
string html = HttpGet("http://www.111cn.net/"); HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); //获取文章列表 var artlist = doc.DocumentNode.SelectNodes("//div[@class='post_item']"); foreach (var item in artlist) { HtmlDocument adoc = new HtmlDocument(); adoc.LoadHtml(item.InnerHtml); var html_a = adoc.DocumentNode.SelectSingleNode("//a[@class='titlelnk']"); htm = htm&(string.Format("标题为:{0},链接为:{1}<br>",html_a.InnerText,html_a.Attributes["href"].Value)); } |
好了运行有中文的网页是没有问题的哦,取出来的截图我就不介绍了,当然大家可进行一些调整处理,本文章主要是介绍解决中文乱码问题。
时间: 2024-09-11 12:17:59