node.js基础模块http、网页分析工具cherrio实现爬虫_node.js

一、前言
说是爬虫初探，其实并没有用到爬虫相关第三方类库，主要用了node.js基础模块http、网页分析工具cherrio。使用http直接获取url路径对应网页资源，然后使用cherrio分析。这里我主要学习过的案例自己敲了一遍，加深理解。在coding的过程中，我第一次把jq获取后的对象直接用forEach遍历，直接报错，是因为jq没有对应的这个方法，只有js数组可以调用。

二、知识点
①：superagent抓去网页工具。我暂时未用到。
②：cherrio 网页分析工具，你可以理解其为服务端的jQuery，因为语法都一样。
效果图

1、抓取整个网页

2、分析后的数据，提供的示例为案例实现的例子。

爬虫初探源码分析

var http=require('http');
var cheerio=require('cheerio');

var url='http://www.imooc.com/learn/348';

/****************************
打印得到的数据结构
[{
 chapterTitle:'',
 videos:[{
  title:'',
  id:''
 }]
}]
********************************/
function printCourseInfo(courseData){
 courseData.forEach(function(item){
  var chapterTitle=item.chapterTitle;
  console.log(chapterTitle+'\n');
  item.videos.forEach(function(video){
   console.log(' 【'+video.id+'】'+video.title+'\n');
  })
 });
}

/*************
分析从网页里抓取到的数据
**************/
function filterChapter(html){
 var courseData=[];

 var $=cheerio.load(html);
 var chapters=$('.chapter');
 chapters.each(function(item){
  var chapter=$(this);
  var chapterTitle=chapter.find('strong').text(); //找到章节标题
  var videos=chapter.find('.video').children('li');

  var chapterData={
   chapterTitle:chapterTitle,
   videos:[]
  };

  videos.each(function(item){
   var video=$(this).find('.studyvideo');
   var title=video.text();
   var id=video.attr('href').split('/video')[1];

   chapterData.videos.push({
    title:title,
    id:id
   })
  })

  courseData.push(chapterData);
 });

 return courseData;
}

http.get(url,function(res){
 var html='';

 res.on('data',function(data){
  html+=data;
 })

 res.on('end',function(){
  var courseData=filterChapter(html);
  printCourseInfo(courseData);
 })
}).on('error',function(){
 console.log('获取课程数据出错');
})

参考资料：
https://github.com/alsotang/node-lessons/tree/master/lesson3

http://www.imooc.com/video/7965

以上是小编为您精心准备的的内容，在的博客、问答、公众号、人物、课程等栏目也有的相关内容，欢迎继续使用右上角搜索按钮进行搜索node.js基础模块http
node.js爬虫
nodejs 爬虫模块、cherrio、cherriojs、nodejs 爬虫、node 爬虫，以便于您获取更多的相关知识。

时间： 2024-11-09 00:44:32

node.js基础模块http、网页分析工具cherrio实现爬虫_node.js

node.js基础模块http、网页分析工具cherrio实现爬虫_node.js的相关文章

Node.js抓取中文网页乱码问题和解决方法_node.js

20个最好的实时网页分析工具

node+experss实现爬取电影天堂爬虫_node.js

利用Node.js制作爬取大众点评的爬虫_node.js

Node.js配合node-http-proxy解决本地开发ajax跨域问题_node.js

node.js连接mongoDB数据库快速搭建自己的web服务_node.js

使用node.js中的Buffer类处理二进制数据的方法_node.js

node.js中module.exports与exports用法上的区别_node.js

Node.js实用代码段之获取Buffer对象字节长度_node.js