scrapy爬虫自动爬取的实例

Spider爬取过程

以初始的URL初始化Request，并设置回调函数。当该request下载完毕并返回时，将生成response，并作为参数传递给该回调函数。
spider中初始的request是通过调用start_requests()来获取的。start_request()读取start_urls中的URL，并以parse为回调函数生成Request。
在回调函数内分析返回的(网页)内容，返回 Item 对象或者 Request 或者一个包括二者的可迭代容器。返回的Request对象之后会经过Scrapy处理，下载相应的内容并调用设置的callback函数(函数可相同)。
在回调函数内，您可以使用选择器(Selector、BeautifulSoup、lxml等)来分析网页内容，并根据分析的数据生成item。
最后，由spider返回的item将被存到数据库(由某些Item Pipeline处理)或使用 Feed exports存入到文件中。
Spider样例

代码如下

复制代码

import scrapy
from myproject.items import MyItem

class MySpider(scrapy.Spider):
    '''
    一个回调函数中返回多个Request对象和Item
    '''
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = [
        'http://www.example.com/1.html',
        'http://www.example.com/2.html',
        'http://www.example.com/3.html',
    ]

    def parse(self, response):
        sel = scrapy.Selector(response)
        for h3 in response.xpath('//h3').extract():
            yield MyItem(title=h3)

for url in response.xpath('//a/@href').extract():
yield scrapy.Request(url, callback=self.parse)

CrawlSpider样例

代码如下

复制代码

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    # 在如下的规则中，第一条表示匹配category.php但是不匹配subsection.php（没有callback意味着follow为True表示跟进链接)

    # 在如下的规则中，第二条表示表示匹配item.php，并使用spider的parse_item方法进行分析。

    rules = (
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)

        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

时间： 2024-09-17 04:47:07

scrapy爬虫自动爬取的实例

scrapy爬虫自动爬取的实例的相关文章

scrapy爬虫不能自动爬取所有页面

使用python scrapy框架写爬虫如何爬取搜狐新闻的参与人数？

python 爬虫，爬取google搜索结果，爬一段时间就被噤掉了，怎么破？

Python简易爬虫，爬取斗鱼颜值美女！！

[python爬虫] Selenium爬取新浪微博内容及用户信息

网络爬虫-wget爬取网页失败问题

scrapy自动多网页爬取CrawlSpider类（五）

[python爬虫] Selenium定向爬取海量精美图片及搜索引擎杂谈

使用Scrapy爬取知乎网站