编程随想

会python真的可以为所欲为 by Python自学0(回) 397天前

这里还有人吗 by mikeKil1(回) 582天前

这里还有人吗 by mikeKil0(回) 582天前

每天面对着电脑屏幕，敲打键盘。我所面对的并不只是代码，而是一种生活方式。 by js特效0(回) 807天前

到处都是羊，不想上班 by Python自学0(回) 887天前

鸽子 by 张书娥0(回) 891天前

云代码 - python代码库

python网络数据采集17 docx

2016-07-07 作者： ME80举报

[python]代码库

from zipfile import ZipFile
from urllib.request import urlopen
from io import BytesIO
from bs4 import BeautifulSoup
 
wordFile = urlopen("http://pythonscraping.com/pages/AWordDocument.docx").read()
wordFile = BytesIO(wordFile) # 转成二进制文件对象
document = ZipFile(wordFile) # 解压文件
xml_content = document.read('word/document.xml')
 
wordObj = BeautifulSoup(xml_content.decode('utf-8'),'html.parser')
textStrings = wordObj.findAll("w:t")
for textElem in textStrings:
    closeTag = ""
    try:
        style = textElem.parent.previousSibling.find("w:pstyle") # 父标签的前一个
        if style is not None and style["w:val"] == "Title": # 如果<w:pstyle w:val="Title"></w:pstyle>存在
            print("<h1>")
            closeTag = "</h1>"
    except AttributeError:
        #不打印标签
        pass
    print(textElem.text)
    print(closeTag)

[代码运行效果截图]

网友评论 (发表评论)

暂无评论 (抢沙发)

发表评论：

评论须知：

1、评论每次加2分，每天上限为30；
2、请文明用语，共同创建干净的技术交流环境；
3、若被发现提交非法信息，评论将会被删除，并且给予扣分处理，严重者给予封号处理；
4、请勿发布广告信息或其他无关评论，否则将会删除评论并扣分，严重者给予封号处理。

用户注册

用户登录

发表随想

该用户最新代码

编程随想

python网络数据采集17 docx

[python]代码库

[代码运行效果截图]

网友评论 (发表评论)

发表评论：

评论须知：

扫码下载

输入口令后可复制整站源码