问题描述
- jsoup解析网页时“www”变成“m”的问题 1C
-
Document doc = Jsoup.connect(website).get();其中 website=""http://www.huxiu.com/photo"".这个网址可以打开。但是解析后报这样的错:
org.jsoup.HttpStatusException: HTTP error fetching URL. Status=404 URL=http://m.huxiu.com/photo
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:435)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:446)
at org.jsoup.helper.HttpConnection$Response.execute(HttpConnection.java:410)
at org.jsoup.helper.HttpConnection.execute(HttpConnection.java:164)
at org.jsoup.helper.HttpConnection.get(HttpConnection.java:153)
at com.coship.crawler.crawler.parser.huxiu.HuxiuHomeProcessor.processor(HuxiuHomeProcessor.java:38)
at com.coship.crawler.crawler.work.FetchWorker.startDealJob(FetchWorker.java:76)
at com.coship.crawler.crawler.work.FetchWorker.run(FetchWorker.java:37)
at java.lang.Thread.run(Thread.java:662)
问题来了:明明是“http://www.huxiu.com/photo”,怎么就变成了“http://m.huxiu.com/photo”了呢?
解决方案
应该是网站的bug可以尝试如下代码跳过该问题:
Jsoup.connect(""http://www.huxiu.com/photo"").header(""User-Agent""Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML like Gecko) Chrome/40.0.2214.111 Safari/537.36"").get()