ruby抓取web页面

一种方法是Net::HTTP.new方法,返回resp码和实际的data:

require 'net/http'

h = Net::HTTP.new("www.baidu.com",80)
resp,data = h.get("/")

puts resp
puts data

不过resp可以取到,但data返回nil值,换其他网页同样如此.后来发现那是早期的方法返回值,新的ruby只返回一个值,我们可以用resp.body来访问网页内容,坑爹啊:

h = Net::HTTP.new("www.baidu.com",80)
resp = h.get "/"

puts resp.body

还可以用以下方法效果类似:

require 'uri'

resp = Net::HTTP.get_response(URI("http://www.baidu.com/"))
puts resp.body

注意用URI生成的url字符串要以http://开头,否则貌似有错.不过实际中我们要加错误处理和超时处理,否则你就且等吧:

#!/usr/bin/ruby

require 'uri'
require 'timeout'
require 'net/http'

$resp = $data = nil

begin
	timeout(5) {
		h = Net::HTTP.new(ARGV[0],80)
		$resp = h.get("/")
		#$resp = Net::HTTP.get_response(URI("http://"+ARGV[0]+"/"))
	}
rescue => e
	puts e.inspect
	exit
end
puts $resp.body

运行结果如下:

wisy@wisy-ThinkPad-X61:~/src/ruby_src$ ./x.rb www.baidu.com|head -c 2000
<!DOCTYPE html><!--STATUS OK--><html><head><meta http-equiv="content-type" content="text/html;charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=Edge"><meta content="always" name="referrer"><link rel="dns-prefetch" href="//s1.bdstatic.com"/><link rel="dns-prefetch" href="//t1.baidu.com"/><link rel="dns-prefetch" href="//t2.baidu.com"/><link rel="dns-prefetch" href="//t3.baidu.com"/><link rel="dns-prefetch" href="//t10.baidu.com"/><link rel="dns-prefetch" href="//t11.baidu.com"/><link rel="dns-prefetch" href="//t12.baidu.com"/><link rel="dns-prefetch" href="//b1.bdstatic.com"/><title>百度一下，你就知道</title>
<style index="index"  id="css_index">html,body{height:100%}html{overflow-y:auto}#wrapper{position:relative;_position:;min-height:100%}#head{padding-bottom:100px;text-align:center;*z-index:1}#ftCon{height:100px;position:absolute;bottom:44px;text-align:center;width:100%;margin:0 auto;z-index:0;overflow:hidden}#ftConw{width:720px;margin:0 auto}body{font:12px arial;text-align:;background:#fff}body,p,form,ul,li{margin:0;padding:0;list-style:none}body,form,#fm{position:relative}td{text-align:left}img{border:0}a{color:#00c}a:active{color:#f60}.bg{background-image:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_3bfb8e45.png);background-repeat:no-repeat;_background-image:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_f72fb1cc.gif)}.bg_tuiguang_browser{width:16px;height:16px;background-position:-600px 0;display:inline-block;vertical-align:text-bottom;font-style:normal;overflow:hidden;margin-right:5px}.bg_tuiguang_browser_big{width:56px;height:56px;position:absolute;left:10px;top:10px;background-position:-600px -24px}
.bg_tuiguang_weishi{width:56px;height:56px;position:absolute;left:10px;top:10px;background-position:-672px -24px}.c-icon{display:inline-block;width:14px;height:14px;vertical-align:text-bottom;font-style normal;overflow:hidden;background:url(http://s1.bdstatic.com/r/www/cache/static/global/img/icons_3bfb8e45../x.rb:19:in `write': Broken pipe @ io_write - <STDOUT> (Errno::EPIPE)
	from ./x.rb:19:in `puts'
	from ./x.rb:19:in `puts'
	from ./x.rb:19:in `<main>'

时间： 2025-01-06 14:31:46

ruby抓取web页面的相关文章

如何抓取WEB页面

文章转载自: http://blog.binux.me/2013/09/howto-crawl-web/ 1. HTTP协议 WEB内容是通过HTTP协议传输的,实际上,任何的抓取行为都是在模拟浏览器的HTTP请求.那么,首先通过 http://zh.wikipedia.org/wiki/ 超文本传输协议来对HTTP协议来进行初步的了解: * HTTP通常通过创建到服务器80端口的TCP连接进行通信 * HTTP协议的内容包括请求方式(method), url,header,bod

利用curl抓取远程页面内容的示例代码

利用curl抓取远程页面内容的一个小示例,需要的朋友可以过来参考下最基本的操作如下复制代码代码如下: $curlPost = 'a=1&b=2';//模拟POST数据 $ch = curl_init(); curl_setopt($ch, CURLOPT_HTTPHEADER, array('X-FORWARDED-FOR:0.0.0.0', 'CLIENT-IP:0.0.0.0')); //构造IP curl_setopt($ch, CURLOPT_REFERER, "ht

php5-怎么用php抓取一个页面的文章标题和标题对应的内容导入数据库，T_T

问题描述怎么用php抓取一个页面的文章标题和标题对应的内容导入数据库,T_T 解决方案 http://www.jb51.net/article/48923.htm 把数据先抓回来,然后筛选你想要的,存入数据库就可以了. 解决方案二: 用xml工具解析,或正则

phantomjs 抓取html页面中所有h3标签

问题描述 phantomjs 抓取html页面中所有h3标签 var page = require('webpage').create();phantom.outputEncoding='gbk';page.open('http://baidu.com/s?wd=javascript'function(status) { console.log(page.title); page.evaluate(function(){ var len=document.getElementsByTagName

利用curl抓取远程页面内容的示例代码_php技巧

最基本的操作如下复制代码代码如下: $curlPost = 'a=1&b=2';//模拟POST数据$ch = curl_init();curl_setopt($ch, CURLOPT_HTTPHEADER, array('X-FORWARDED-FOR:0.0.0.0', 'CLIENT-IP:0.0.0.0')); //构造IPcurl_setopt($ch, CURLOPT_REFERER, "http://www.jb51.net/"); //构造来路 cur

如何用java抓取ajax页面的内容？

问题描述如何用java抓取ajax页面的内容?例如这个页面: http://app.abchina.com/branch/ 中的营业结构查询思路是什么,希望大家给出解答,谢谢! 问题补充:wangqj 写道解决方案用htmlparser就可以了,你不用管ajax,和正常页面一样抓就行.实际上你只要知道你要抓取的页面的网址就可以了

C#抓取AJAX页面的内容

原文 C#抓取AJAX页面的内容现在的网页有相当一部分是采用了AJAX技术,所谓的AJAX技术简单一点讲就是事件驱动吧(当然这种说法可能很不全面),在你提交了URL后,服务器发给你的并不是所有是页面内容,而有一大部分是JS脚本,即用<JAVASCRIPT标签表示的,这其中有些是链接了外部的JS文件,有些是内置的JS脚本,这些脚本是在客户端加载了服务器发回来的源码后才执行的,所以不管是采用C#中的WebClient还是HttpRequest都得不到正确的结果,因为这些脚本是在服务器发送完毕后才执

Nodejs抓取html页面内容（推荐）_node.js

废话不多说,直接给大家贴node.js抓取html页面内容的核心代码了. 具体代码如下所示: var http = require("http"); var iconv = require('iconv-lite'); var option = { hostname: "stockdata.stock.hexun.com", path: "/gszl/s601398.shtml" }; var req = http.request(option,

#.NET分别以GET和POST方式抓取远程页面

代码引入命名空间using System.IO;using System.Net;using System.Text;using System.Text.RegularExpressions; //以GET方式抓取远程页面内容 public string Get_Http(string tUrl) { string strResult; try { HttpWebRequest hwr = (HttpWebRequ