Selectors | Notion

构造选择器(selectors)

Scrapy selector是以 文字(text) 或 TextResponse 构造的 Selector 实例。其根据输入的类型自动选择最优的分析方法(XML vs HTML):

>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse

以文字构造:

>>> body = '<html><body><span>good</span></body></html>'
>>> Selector(text=body).xpath('//span/text()').extract()
[u'good']

以response构造:

>>> response = HtmlResponse(url='<http://example.com>', body=body)
>>> Selector(response=response).xpath('//span/text()').extract()
[u'good']

为了方便起见，response对象以 .selector 属性提供了一个selector，您可以随时使用该快捷方法:

>>> response.selector.xpath('//span/text()').extract()
[u'good']

使用选择器(selectors)

<html>
 <head>
  <base href='<http://example.com/>' />
  <title>Example website</title>
 </head>
 <body>
  <div id='images'>
   <a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>
   <a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>
   <a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>
   <a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>
   <a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>
  </div>
 </body>
</html>

那么，通过查看 HTML code 该页面的源码，我们构建一个XPath来选择title标签内的文字:

>>> response.selector.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]

由于在response中使用XPath、CSS查询十分普遍，因此，Scrapy提供了两个实用的快捷方式: response.xpath() 及 response.css():

>>> response.xpath('//title/text()')
[<Selector (text) xpath=//title/text()>]
>>> response.css('title::text')
[<Selector (text) xpath=//title/text()>]

.xpath()及 .css() 方法返回一个类 SelectorList 的实例, 它是一个新选择器的列表。这个API可以用来快速的提取嵌套数据。

为了提取真实的原文数据，你需要调用 .extract() 方法如下:

>>> response.xpath('//title/text()').extract()
[u'Example website']