python读取html中指定元素生成excle文件示例_python

Python2.7编写的读取html中指定元素,并生成excle文件

复制代码 代码如下:

#coding=gbk
import string
import codecs
import os,time
import xlwt
import xlrd
from bs4 import BeautifulSoup
from xlrd import open_workbook

class LogMsg:
        def __init__(self,logfile,Level=0):
                try:
                        import logging
                        #self.logger = None
                        self.logger = logging.getLogger()
                        self.hdlr = logging.FileHandler(logfile)
                        formatter = logging.Formatter("[%(asctime)s]: %(message)s","%Y%m%d %H:%M:%S")
                        self.hdlr.setFormatter(formatter)
                        self.logger.addHandler(self.hdlr)
                        #logger.setLevel()
                        if Level == 10:
                                self.logger.setLevel(logging.DEBUG)
                        elif Level == 20:
                                self.logger.setLevel(logging.INFO)
                        elif Level == 30:
                                self.logger.setLevel(logging.WARNING)
                        elif Level == 40:
                                self.logger.setLevel(logging.ERROR)
                        elif Level == 50:
                                self.logger.setLevel(logging.CRITICAL)
                        else:
                                self.logger.setLevel(logging.NOTSET)
                except:
                        print "log init error!"
                        exit(1)

        def output(self,logInfo):
                Level = self.logger.getEffectiveLevel()
                try:
                        if Level == 10:
                                self.logger.debug(logInfo)
                        elif Level == 20:
                                self.logger.info(logInfo)
                        elif Level == 30:
                                self.logger.warning(logInfo)
                        elif Level == 40:
                                self.logger.error(logInfo)
                        elif Level == 50:
                                self.logger.critical(logInfo)
                        else:
                                self.logger.info(logInfo)
                except:
                        print "log output error!"
                        exit(1)

        def close(self):
                try:
                #logging.shutdown([self.hdlr])
                        self.logger.removeHandler(self.hdlr)
                except:
                        print "log closed error!"
                        exit(1)

Logtime = time.strftime("%Y%m%d%H%M%S",time.localtime())
logFileTime = time.strftime("%Y%m%d",time.localtime())
Logfile = '/data/pyExample/logs/htmlparser_%s.log' % logFileTime
log = LogMsg(Logfile,20)

DATAPATH = '/data/pyExample/'
XLSname = 'dangjian_'+Logtime+'.xls'

if __name__ == '__main__':
   

    wbk = xlwt.Workbook(encoding = 'gbk')
    sheet = wbk.add_sheet('基本内容导入模板')
    sheet.write(0,0,'内容类型 ')
    sheet.write(0,1,'栏目名称')
    sheet.write(0,2,'栏目编号')
    sheet.write(0,3,'内容名称')
    sheet.write(0,4,'时长')
    sheet.write(0,5,'关键字')
    sheet.write(0,6,'看点')
    sheet.write(0,7,'作者')
    sheet.write(0,8,'来源')
    sheet.write(0,9,'子内容1')
    sheet.write(0,10,'子内容2')
    xlsContent = []  
    files = os.listdir(DATAPATH)
    k = 0
    for f in files: 
        if os.path.splitext(f)[1] == '.html':
            content=[]
            log.output('当前文件:'+f)
            htmlFile =codecs.open(DATAPATH+f,'r','gbk')
            lines = htmlFile.readlines()
            if not lines:
                log.output ('not line')
            for line in lines:
                if line.strip()=='\n':
                    log.output('该处是空行')
                else:
                    line = line.replace(' ','')
                    soup  = BeautifulSoup(line)
                    for tdd in soup.findAll('td'): 
                        #print tdd.text.encode("gbk")
                        content.append(tdd.text.encode("gbk"))      
                #print line.encode('gbk')
            htmlFile.close()   
            for i in content:
                print content.index(i),',',i
                log.output(i)
                log.output(content.index(i))
            print '----------------------------------------'
           

            folderName =  content[6]
            contentName=  content[4]      
            duration =    filter(str.isdigit, content[16])
            int_duration = string.atoi(duration)*60
            str_duration = "%i"%int_duration
            keyWord =     content[6]
            desciption =  content[36]
            videoName_1 = content[10]
            print folderName
            print contentName
            print str_duration
            print keyWord
            print desciption
            print videoName_1
            log.output('输出xls数据:'+','+folderName+',,'+contentName+','+str_duration+','+keyWord+','+desciption+',管理员,华数编辑,'+videoName_1+',,')
            print k           
            sheet.write(k+1,0,'')
            sheet.write(k+1,1,folderName)
            sheet.write(k+1,2,'')
            sheet.write(k+1,3,contentName)
            sheet.write(k+1,4,str_duration)
            sheet.write(k+1,5,keyWord)
            sheet.write(k+1,6,desciption)
            sheet.write(k+1,7,'管理员')
            sheet.write(k+1,8,'华数编辑')
            sheet.write(k+1,9,videoName_1)
            sheet.write(k+1,10,'')
            k+=1

    wbk.save(DATAPATH + XLSname)       

    print '=========================================' 

时间: 2024-09-22 07:39:57

python读取html中指定元素生成excle文件示例_python的相关文章

Python读取html中指定元素生成excle文件

#coding=gbk"'Created on 2013-7-23 @author: http://www.aliyun.com/zixun/aggregation/29781.html">chenc@wasu.com.cndescription: gen xls file from html file"' import stringimport codecsimport os,timeimport xlwtimport xlrdfrom bs4 import Beau

python统计字符串中指定字符出现次数的方法_python

本文实例讲述了python统计字符串中指定字符出现次数的方法.分享给大家供大家参考.具体如下: python统计字符串中指定字符出现的次数,例如想统计字符串中空格的数量 s = "Count, the number of spaces." print s.count(" ") x = "I like to program in Python" print x.count("i") PS:本站还提供了一个关于字符统计的工具,感兴

python查找目录下指定扩展名的文件实例_python

本文实例讲述了python查找目录下指定扩展名的文件.分享给大家供大家参考.具体如下: 这里使用python查找当前目录下的扩展名为.txt的文件 import os items = os.listdir(".") newlist = [] for names in items: if names.endswith(".txt"): newlist.append(names) print newlist 希望本文所述对大家的Python程序设计有所帮助. 以上是小编

Python中使用dom模块生成XML文件示例_python

在Python中解析XML文件也有Dom和Sax两种方式,这里先介绍如何是使用Dom解析XML,这一篇文章是Dom生成XML文件,下一篇文章再继续介绍Dom解析XML文件. 在生成XML文件中,我们主要使用下面的方法来完成. 主要方法 1.生成XML节点(node) 复制代码 代码如下: createElement("node_name") 2.给节点添加属性值(Attribute) 复制代码 代码如下: node.setAttribute("att_name",

jsp读取大对象CLOB并生成xml文件示例

js|xml|对象|生成xml|示例 <%@ page contentType="text/html; charset=gb2312" %><%@ page info="database handler"%><%@ page import="java.io.*"%><%@ page import="java.net.*"%><%@ page import="jav

jsp读取大对象CLOB并生成xml文件示例_JSP编程

<%@ page contentType="text/html; charset=gb2312" %> <%@ page info="database handler"%> <%@ page import="java.io.*"%> <%@ page import="java.net.*"%> <%@ page import="java.lang.*"%

Python解析xml中dom元素的方法_python

本文实例讲述了Python解析xml中dom元素的方法.分享给大家供大家参考.具体实现方法如下: 复制代码 代码如下: from xml.dom import minidom try:     xmlfile = open("path.xml", "a+")     #xmldoc = minidom.parse( sys.argv[1])     xmldoc = minidom.parse(xmlfile) except :     #updatelogger.

Python去除列表中重复元素的方法_python

本文实例讲述了Python去除列表中重复元素的方法.分享给大家供大家参考.具体如下: 比较容易记忆的是用内置的set l1 = ['b','c','d','b','c','a','a'] l2 = list(set(l1)) print l2 还有一种据说速度更快的,没测试过两者的速度差别 l1 = ['b','c','d','b','c','a','a'] l2 = {}.fromkeys(l1).keys() print l2 这两种都有个缺点,祛除重复元素后排序变了: ['a', 'c',

Python读取mp3中ID3信息的方法_python

本文实例讲述了Python读取mp3中ID3信息的方法.分享给大家供大家参考.具体分析如下: pyid3不好用,常常有不认识的. mutagen不错,不过默认带的easyid3不会读取注释,需要手工hack一下 Python代码如下: from mutagen.mp3 import MP3 import mutagen.id3 from mutagen.easyid3 import EasyID3 EasyID3.valid_keys["comment"]="COMM::'X