python中xpath怎样运用呢?不知道的小伙伴来看看小编昨天的分享吧!
一、xpath简介
XPath 是一门在 XML 文档中查找信息的言语。XPath 可用来在 XML 文档中对元素和属性进行遍历。XPath 是 W3C XSLT 尺度的首要元素,而且 XQuery 和 XPointer 都构建于 XPath 抒发之上。
二、安置
pip3 install lxml
三、运用
1、导入
from lxml import etree
2、基础运用
from lxml import etree
wb_data = """
div>
ul>
li class="item-0">a href="link1.html">first item/a>/li>
li class="item-1">a href="link2.html">second item/a>/li>
li class="item-inactive">a href="link3.html">third item/a>/li>
li class="item-1">a href="link4.html">fourth item/a>/li>
li class="item-0">a href="link5.html">fifth item/a>
/ul>
/div>
"""
html = etree.HTML(wb_data)
print(html)
result = etree.tostring(html)
print(result.decode("utf-8"))
从下面的了局来看,我们打印机html其实便是一个python对象,etree.tostring(html)则是不全里html的基础写法,补全了缺胳膊少腿的标签。
Element html at 0x39e58f0>
html>body>div>
ul>
li class="item-0">a href="link1.html">first item/a>/li>
li class="item-1">a href="link2.html">second item/a>/li>
li class="item-inactive">a href="link3.html">third item/a>/li>
li class="item-1">a href="link4.html">fourth item/a>/li>
li class="item-0">a href="link5.html">fifth item/a>
/li>/ul>
/div>
/body>/html>
3、获得某个标签的内容(基础运用),细致,获得a标签的一切内容,a背后就不用再加正斜杠,不然报错。
写法一
html = etree.HTML(wb_data)
html_data = html.xpath(´/html/body/div/ul/li/a´)
print(html)
for i in html_data:
print(i.text)
Element html at 0x12fe4b8>
first item
second item
third item
fourth item
fifth item
写法二(干脆在需求查找内容的标签背后加一个/text()就行)
html = etree.HTML(wb_data)
html_data = html.xpath(´/html/body/div/ul/li/a/text()´)
print(html)
for i in html_data:
print(i)
Element html at 0x138e4b8>
first item
second item
third item
fourth item
fifth item
4、张开读取html文件
#运用parse张开html的文件
html = etree.parse(´test.html´)
html_data = html.xpath(´//*´)br>#打印是一个列表,需求遍历
print(html_data)
for i in html_data:
print(i.text)
html = etree.parse(´test.html´)
html_data = etree.tostring(html,pretty_print=True)
res = html_data.decode(´utf-8´)
print(res)
打印:
div>
ul>
li class="item-0">a href="link1.html">first item/a>/li>
li class="item-1">a href="link2.html">second item/a>/li>
li class="item-inactive">a href="link3.html">third item/a>/li>
li class="item-1">a href="link4.html">fourth item/a>/li>
li class="item-0">a href="link5.html">fifth item/a>/li>
/ul>
/div>
5、打印指定途径下a标签的属性(能够经由过程遍历拿到某个属性的值,查找标签的内容)
html = etree.HTML(wb_data)
html_data = html.xpath(´/html/body/div/ul/li/a/@href´)
for i in html_data:
print(i)
打印:
link1.html
link2.html
link3.html
link4.html
link5.html
6、我们晓得我们运用xpath拿到得都是一个个的ElementTree对象,所以若是需求查找内容的话,还需求遍历拿到资料的列表。
查到绝对途径下a标签属性即是link2.html的内容。
html = etree.HTML(wb_data)
html_data = html.xpath(´/html/body/div/ul/li/a[@href="link2.html"]/text()´)
print(html_data)
for i in html_data:
print(i)
打印:
[´second item´]
second item
7、上面的找到全部都是绝对途径(每一个都是从根入手下手查找),下面是查找对应途径,比方,查找一切li标签下的a标签内容。
html = etree.HTML(wb_data)
html_data = html.xpath(´//li/a/text()´)
print(html_data)
for i in html_data:
print(i)
打印:
[´first item´, ´second item´, ´third item´, ´fourth item´, ´fifth item´]
first item
second item
third item
fourth item
fifth item
8、上面我们运用绝对途径,查找了一切a标签的属性即是href属性值,利用的是/---绝对途径,下面我们运用对应途径,查找一下l对应途径下li标签下的a标签下的href属性的值,细致,a标签背后需求双//。
html = etree.HTML(wb_data)
html_data = html.xpath(´//li/a//@href´)
print(html_data)
for i in html_data:
print(i)
打印:
[´link1.html´, ´link2.html´, ´link3.html´, ´link4.html´, ´link5.html´]
link1.html
link2.html
link3.html
link4.html
link5.html
9、对应途径下跟绝对途径下查特定属性的办法相似,也可以说相同。
html = etree.HTML(wb_data)
html_data = html.xpath(´//li/a[@href="link2.html"]´)
print(html_data)
for i in html_data:
print(i.text)
打印:
[Element a at 0x216e468>]
second item
10、查找最后一个li标签里的a标签的href属性
html = etree.HTML(wb_data)
html_data = html.xpath(´//li[last()]/a/text()´)
print(html_data)
for i in html_data:
print(i)
打印:
[´fifth item´]
fifth item
11、查找倒数第二个li标签里的a标签的href属性
html = etree.HTML(wb_data)
html_data = html.xpath(´//li[last()-1]/a/text()´)
print(html_data)
for i in html_data:
print(i)
打印:
[´fourth item´]
fourth item
12、如果在提取某个页面的某个标签的xpath途径的话,能够如下图:
//*[@id="kw"]
注释:运用对应途径查找一切的标签,属性id即是kw的标签。
<