CSDN 博客备份工具

- 前言
- 核心
  - 登录模块
  - 备份模块
  - 博文扫描模块
- 演示
  - 如何使用
  - 效果
- 总结

前言

近段时间以来，听群友博友都在谈论着一件事：“CSDN博客怎么没有备份功能啊？”。这其实也在一定程度上表征着大家对于文章这种知识性产品的重视度越来越高，也对于数据的安全提高了重视。

所以我就尝试着写了这么一个工具。专门用来备份CSDN博友的博客。

核心

说起来是核心，其实也就那么回事吧。严格来说也就是一对代码，不能称之为核心啦。

登录模块

为什么需要登陆模块可能是正在看这篇文章的你的第一个疑惑之处。

其实原因是这样的，如果没有登录的话，从博文接口那里是获取不到相关的文章内容的。所以为了更省事，就添加了一个获取登录之后的session来帮助我们爬取文章内容。

不过也不用担心账号密码的安全性什么的，这个工具不会记忆关于您的任何信息。可以放心使用（不信可以看看代码哈）。

登录模块的代码部分也很简单，就是一个模拟登陆CSDN的逻辑实现。

# coding: utf8

# @Author: 郭 璞
# @File: login.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: CSDN login for returning the same session for backing up the blogs.

import requests
from bs4 import BeautifulSoup
import json

class Login(object):
    """
    Get the same session for blog's backing up. Need the special username and password of your account.
    """
    def __init__(self, username, password):
        if username and password:
            self.username = username
            self.password = password
            # the common headers for this login operation.
            self.headers = {
                'Host': 'passport.csdn.net',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
            }
        else:
            raise Exception('Need Your username and password!')
    def login(self):
        loginurl = 'https://passport.csdn.net/account/login'
        # get the 'token' for webflow
        self.session = requests.Session()
        response = self.session.get(url=loginurl, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')
        # Assemble the data for posting operation used in logining.
        self.token = soup.find('input', {'name': 'lt'})['value']

        payload = {
            'username': self.username,
            'password': self.password,
            'lt': self.token,
            'execution': soup.find('input', {'name': 'execution'})['value'],
            '_eventId': 'submit'
        }
        response = self.session.post(url=loginurl, data=payload, headers=self.headers)

        # get the session
        return self.session if response.status_code==200 else None

    def getSource(self, url):
        """
        测试内容， 可删去，(*^__^*) 嘻嘻……
        :param url:
        :return:
        """
        username, id = url.split('/')[3], url.split('/')[-1]
        # print(username, id)
        backupurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        tempheaders = self.headers
        tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
        tempheaders['Host'] = 'write.blog.csdn.net'
        tempheaders['X-Requested-With'] = 'XMLHttpRequest'
        response = self.session.get(url=backupurl, headers=tempheaders)
        soup = json.loads(response.text)
        return {
            'title': soup['data']['title'],
            'markdowncontent': soup['data']['markdowncontent'],
        }

通过模拟登陆，获取到一个已登录状态的session就可以了，接下来会用得到。

备份模块

一开始我想的是直接获取网页的源码，解析出相应的文章段内容，然后通过一些逻辑实现HTML代码到Markdown文件的转换，但是对于复杂内容的HTML代码，嵌套的层次也比较深，对于表格形式更是有点心有余而力不足。所以技术上还是有难度。

然后很偶然的发现了可以通过这么一个接口来获取到文章相关的json数据，里面包括了文章标题，文章初始的Markdown文件内容。

'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)

这简直是太方便了。然后下面是具体的备份逻辑。

# coding: utf8

# @Author: 郭 璞
# @File: backup.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Back up the blog for getting and stroaging the markdown file.
import json
import os
import re

class Backup(object):
    """
    Get the special url for getting markdown file.
    """
    def __init__(self, session, backupurl):
        self.headers = {
            'Referer': 'http://write.blog.csdn.net/mdeditor',
            'Host': 'passport.csdn.net',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }
        # constructor the url: get article id and the username
        # http://blog.csdn.net/marksinoberg/article/details/70432419
        username, id = backupurl.split('/')[3], backupurl.split('/')[-1]
        self.backupurl = 'http://write.blog.csdn.net/mdeditor/getArticle?id={}&username={}'.format(id, username)
        self.session = session
    def getSource(self):
        # get title and content for the assigned url.

        tempheaders = self.headers
        tempheaders['Referer'] = 'http://write.blog.csdn.net/mdeditor'
        tempheaders['Host'] = 'write.blog.csdn.net'
        tempheaders['X-Requested-With'] = 'XMLHttpRequest'
        response = self.session.get(url=self.backupurl, headers=tempheaders)
        soup = json.loads(response.text)
        return {
            'title': soup['data']['title'],
            'markdowncontent': soup['data']['markdowncontent'],
        }

    def downloadpic(self, picurl, outputpath):
        tempheaders = self.headers
        tempheaders['Host'] = 'img.blog.csdn.net'
        tempheaders['Upgrade-Insecure-Requests'] = '1'
        response = self.session.get(url=picurl, headers=tempheaders)
        print(response.status_code)
        # change the seperator of your OS
        outputpath = outputpath.replace(os.sep, '/')
        print(outputpath)
        if response.status_code == 200:
            with open(outputpath, 'wb') as f:
                f.write(response.content)
                f.close()
                print("{} saved in {} succeed!".format(picurl, outputpath))
        else:
            raise Exception("Picture Url: {} downloading failed!".format(picurl))

    def getpicurls(self):
        pattern = re.compile("\!\[.*?\]\((.*)?\)")
        markdowncontent = self.getSource()['markdowncontent']
        return re.findall(pattern=pattern, string=markdowncontent)

    def backup(self, outputpath='./'):
        try:
            source = self.getSource()
            foldername = source['title']
            foldername = os.path.join(outputpath, foldername)
            if not os.path.exists(foldername):
                os.mkdir(foldername)
            # write file
            filename = os.path.join(foldername, source['title'])

            with open(filename+".md", 'w', encoding='utf8') as f:
                f.write(source['markdowncontent'])
                f.close()
            # save pictures
            imgfolder = os.path.join(foldername, 'img')
            if not os.path.exists(imgfolder):
                os.mkdir(imgfolder)
            for index, picurl in enumerate(self.getpicurls()):
                imgpath = imgfolder + os.sep+str(index)+'.png'
                try:
                    self.downloadpic(picurl=picurl, outputpath=imgpath)
                except:
                    # 有可能出现： requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
                    pass
        except Exception as e:
            print('恩，又出错了。详细信息为：{}'.format(e))
            pass

博文扫描模块

博文扫描模块原理上是不用登录的，根据自己的用户名就可以一层层的获取到所有的博客链接。然后保存下来配合上面的备份逻辑，循环着跑一遍就可以了。

# coding: utf8

# @Author: 郭 璞
# @File: blogscan.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: Scan the domain of your blog domain, get the all links of your blogs.
import requests
from bs4 import BeautifulSoup
import re

class BlogScanner(object):
    """
    Scan for all blogs
    """
    def __init__(self, domain):
        self.username = domain
        self.rooturl = 'http://blog.csdn.net'
        self.bloglinks = []
        self.headers = {
            'Host': 'blog.csdn.net',
            'Upgrade - Insecure - Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.110 Safari/537.36',
        }

    def scan(self):
        # get the page count
        response = requests.get(url=self.rooturl+"/"+self.username, headers=self.headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        pagecontainer = soup.find('div', {'class': 'pagelist'})
        pages = re.findall(re.compile('(\d+)'), pagecontainer.find('span').get_text())[-1]

        # construnct the blog list. Likes: http://blog.csdn.net/Marksinoberg/article/list/2
        for index in range(1, int(pages)+1):
            # get the blog link of each list page
            listurl = 'http://blog.csdn.net/{}/article/list/{}'.format(self.username, str(index))
            response = requests.get(url=listurl, headers=self.headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            try:
                alinks = soup.find_all('span', {'class': 'link_title'})
                # print(alinks)
                for alink in alinks:
                    link = alink.find('a').attrs['href']
                    link = self.rooturl +link
                    self.bloglinks.append(link)
            except Exception as e:
                print('出现了点意外！\n'+e)
                continue

        return self.bloglinks

如此，三大模块就算是搞定了。

演示

接下来演示一下如何使用这个工具吧。

如何使用

第一步肯定是要先下载源代码了。
然后借鉴一下下面的代码

# coding: utf8

# @Author: 郭 璞
# @File: Main.py
# @Time: 2017/4/28
# @Contact: 1064319632@qq.com
# @blog: http://blog.csdn.net/marksinoberg
# @Description: The entrance of this blog backup tool.

from csdnbackup.login import Login
from csdnbackup.backup import Backup
from csdnbackup.blogscan import BlogScanner
import random
import time
import getpass

username = input('请输入账户名：')
password = getpass.getpass(prompt='请输入密码：')

loginer = Login(username=username, password=password)
session = loginer.login()

scanner = BlogScanner(username)
links = scanner.scan()

for link in links:
    backupper = Backup(session=session, backupurl=link)
    timefeed = random.choice([1,3,5,7,2,4,6,8])
    print('随即休眠{}秒'.format(timefeed))
    time.sleep(timefeed)
    backupper.backup(outputpath='./')

最后一步 python Main.py

效果

下面看下运行结果。

首先是“总览”（还没测试完，先下载了这几个）
然后是单篇文章
再是文章Markdown内容展示
单篇文章图片内容
图片查看

总结

最后来反思一下这个工具还有那些不足之处。

博客名称引起的创建文件夹异常：这点做了异常处理。
访问过快引起的服务器反制：添加了随机休眠时延，但不是治本之术。
还未添加日志模块，对于备份失败的文章应该予以记录。在文章备份操作完成后，对错误日志进行解析，再次尝试备份操作。
测试还不够充分，我自己这边虽然可以跑起来，但是对于其他人有可能会出现一些奇奇怪怪的问题。

最后，放下源码链接，有兴趣的给点个star咯。

https://github.com/guoruibiao/csdn-blog-backup-tool

时间： 2024-10-13 23:10:26

CSDN 博客备份工具的相关文章

自己动手编写CSDN博客备份工具-blogspider

来源:http://blog.csdn.net/gzshun 我之前一直在看lucene,nutch,发现有这么一个现成的小应用,特转来学习下!mark一下. 网络爬虫(又被称为网页蜘蛛,网络机器人),是一种按照一定的规则,自动的抓取万维网信息的程序或者脚本.另外一些不常使用的名字还有蚂蚁,自动索引,模拟程序或者蠕虫. 网络爬虫最重要的任务,就是从互联网搜索出需要的信息,将网页抓取下来并分析,很多搜索引擎,比如百度,谷歌,后台都有一只很强悍的网络爬虫,用来访问互联网上的网页,图片,视频等内容

【小工具】CSDN博客导出工具-Java集成Maven开发

CSDN博客导出工具之前一直想把CSDN的博客导入到自己的网站中,可是由于博客比较多,后面受朋友老郭启发,就找了个时间用Java开发了这款小工具. 转载请注明出处:http://chenhaoxiang.cn 本文源自[人生之旅_谙忆的博客] CSDNBlogExport CSDN博客导出工具之前一直想把CSDN的博客导入到自己的网站中,可是由于博客比较多,后面受朋友老郭启发,就找了个时间用Java开发了这款小工具. Had been trying to CSDN blog into the

博客备份工具Blog

"在网上飘,哪能不中招",天灾毒祸.密码丢失.服务器当机,甚至是BSP关门倒闭等系统性风险,都有可能让你辛辛苦苦建设起来的博客家园毁之一旦."有备无患",Blog_Backup就是这样一款博客备份工具. 软件名称:Blog_Backup(博客备份工具) 软件大小:2170.00KB 开发语言:Python 运行平台:Windows XP SP2/2003 一.绿色免安装操作简便软件作者采用绿色免安装的方式,比较对笔者的味口,对那些功能单一,却硬加个安装程序外套,

博客备份工具 Blog Backup v0.6.1 下载_常用工具

Blog_Backup 是一款功能完善的博客备份软件,支持国内所有大型BSP,可导出内容为多种格式.目前支持的BSP如下:百度空间,新浪博客,和讯博客,Donews博客,博客巴士,天涯博客,MSN空间,搜狐博客,QQ空间,Bokee博客,歪酷博客,网易博客,CSDN博客,ChinaUnix博客,F2Blog,PJBlog,Z-Blog,WordPress.导出内容的格式包括:RSS 1.0,RSS 2.0,Atom 0.3,单个网页(正序.反序),多个网页. 功能特点:支持多个博客,多个空间的同

CSDN博客导出工具 mac

需要先使用CSDN账号登录,可以导出所有的博客文章,添加YAML头信息的时候,会在头信息里面包含文章对应的标签和分类以及原创的标签自己写的mac版,以后可能会改成js版: GitHub地址

CSDN博客导出工具 Mac By Swift

写这个的目的主要是用于了解Swift语言本身,以及如何与Objc和第三方框架交互需要先使用CSDN账号来登录,可以导出所有的博客文章,添加YAML头信息的时候,会在头信息里面包含文章对应的标签和分类,以及对应的文章类型(原创.转载.翻译) 开发环境 OS X 10.10,Xcode6 Beta4,由于Beta4较之前版本对Swift更新较大,之前版本编译出报错使用的第三方框架 AFNetworking GTM RegExCategories Swift与Cocoa和ObjC交互比较简单,只用

2016年年终CSDN博客总结

2015年12月1日,结束了4个月的尚观嵌入式培训生涯,经过了几轮重重面试,最终来到了伟易达集团.经过了长达3个月的试用期,正式成为了伟易达集团的助理工程师. 回顾一年来的学习,工作,生活.各种酸甜苦辣,庆幸是有一群支持我的同事小伙伴,他们同样来自尚观IT培训机构,4年前,他们也是一样,怀着自己的理想考上了理想的大学,4年后,怀着自己的理想通过4个月的培训晋升,巩固自己的知识体系,最终也是找到了一份满意的工作,来到了VTECH, 这一年,收获还是非常大的,获得了公司的升职,同时自己的CSDN博客

【SSH网上商城项目实战15】线程、定时器同步首页数据（类似于CSDN博客定期更新排名）

版权声明:尊重博主原创文章,转载请注明出处哦~http://blog.csdn.net/eson_15/article/details/51387378 目录(?)[+] 上一节我们做完了首页UI界面,但是有个问题:如果我在后台添加了一个商品,那么我必须重启一下服务器才能重新同步后台数据,然后刷新首页才能同步数据.这明显不是我们想要的效果,一般这种网上商城首页肯定不是人为手动同步数据的,那么如何解决呢?我们需要用到线程和定时器来定时自动同步首页数据. 1. Timer和Timer

为你的CSDN博客添加CNZZ流量统计功能

一.流量统计介绍流量统计是指通过各种科学的方式,准确的纪录来访某一页面的访问者的流量信息,目前而言,必须具备可以统计. 1.简介统计独立的访问者数量(独立用户.独立访客): 可以统计独立的IP地址数量: 可以统计页面被刷新的数量. 访客数量,即来了多少访客?他们是哪里人?IP多少? 访客来源,即访客来自哪些网站?百度?天涯?还是163邮箱? 软文营销效果:我贴的链接和软文的效果到底怎么样? 访客的站内移动路径(即站内行为):访客进入网站后,浏览了哪些网页? 关键词广告的效果跟踪:百度竞价广告