Scrapyでファイルをダウンロードして保存する

scrapyで複数ページを巡回はCrawlSpider、ファイルのダウンロードはFilesPipelineを使うと簡潔に記述できる。
FilesPipelineはデフォルトではSha1ハッシュをファイル名にする実装なので、カスタマイズが必要。
ソースコードは簡潔で読みやすいので継承してカスタマイズするのは容易。

CrawlSpider

要約すると、ポイントは以下

  • 巡回対象のページをrulesLinkExtractorで抽出
  • コールバックで抽出したページからアイテムを抽出

FilesPipeline

要約すると、ポイントは以下

  • settings.pyのFILES_STOREFILES_STOREによるダウンロード先ディレクトリを指定
  • settings.pyのITEM_PIPELINESFilesPipelineを有効化
  • 生成するアイテムにfile_urls属性を追加し、ダウンロードするファイルのURLsを指定
  • 生成するアイテムにダウンロード結果を保存するfiiles属性を追加する

Using the Files Pipeline

The typical workflow, when using the FilesPipeline goes like this:

In a Spider, you scrape an item and put the URLs of the desired into a file_urls field.

The item is returned from the spider and goes to the item pipeline.

When the item reaches the FilesPipeline, the URLs in the file_urls field are scheduled for download using the standard Scrapy scheduler and downloader (which means the scheduler and downloader middlewares are reused), but with a higher priority, processing them before other pages are scraped. The item remains “locked” at that particular pipeline stage until the files have finish downloading (or fail for some reason).

When the files are downloaded, another field (files) will be populated with the results. This field will contain a list of dicts with information about the downloaded files, such as the downloaded path, the original scraped url (taken from the file_urls field) , and the file checksum. The files in the list of the files field will retain the same order of the original file_urls field. If some file failed downloading, an error will be logged and the file won’t be present in the files field.

Spiderでスクレイピングし、目的のURLをfile_urlsにセットすると、SchedulerとDownloaderを使ってスケジューリングされるが、優先度が高く他のページをスクレイピングする前に処理される。ダウンロード結果はfilesに記録する。

Enabling your Media Pipeline

To enable your media pipeline you must first add it to your project ITEM_PIPELINES setting.

For Images Pipeline, use:

ITEM_PIPELINES = {‘scrapy.pipelines.images.ImagesPipeline’: 1}
For Files Pipeline, use:

ITEM_PIPELINES = {‘scrapy.pipelines.files.FilesPipeline’: 1}

ITEM_PIPELINESでscrapy.pipelines.files.FilesPipeline': 1を指定して有効化する。
画像ファイルのためのImagesPipelineもある。

Supported Storage - File system storage

The files are stored using a SHA1 hash of their URLs for the file names.

ファイル名はSHA1ハッシュを使用する

IPAの情報処理試験のページをサンプルにCrawlSpiderを試す

対象のページ構造

起点となるページは各年度の過去問ダウンロードページへのリンクになっている。

IPAのページ width=640

各ページは試験区分ごとに過去問のPDFへのリンクがある。

IPAのページ width=640

project

https://www.jitec.ipa.go.jp/1_04hanni_sukiru/_index_mondai.html以下のページを巡回してPDFをダウンロードするプロジェクトを作成する。
Spiderのスケルトンを作成する際に-t crawlを指定し、CrawlSpiderのスケルトンを作成する。

1
2
3
scrapy startproject <プロジェクト名>
cd <プロジェクト名>
scrapy genspider -t crawl ipa www.ipa.go.jp

spiders/ipa.py

rulesで各年度の過去問ダウンロードページを抽出し、各ページを解析してPDF単位でアイテム化する。
file_urlsは複数指定できるが、ここでは1ファイル毎で指定している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from crawldownload.items import CrawldownloadItem

class IpaSpider(CrawlSpider):
name = 'ipa'
allowed_domains = ['ipa.go.jp']
start_urls = ['https://www.jitec.ipa.go.jp/1_04hanni_sukiru/_index_mondai.html']

rules = (
Rule(LinkExtractor(allow=r'1_04hanni_sukiru/mondai_kaitou'), callback='parse_item', follow=True),
)

def parse_item(self, response):
logger.info("{}".format(response.css('title::text').get()))

for main_area in response.css('#ipar_main'):
exam_seasons = main_area.css('h3').xpath('string()').extract()

season = 0
for exam_table in main_area.css('div.unit'):
exam_season = exam_seasons[season]
season+=1

# ページ内のPDFファイルのアイテムを生成
for exam_item in exam_table.css('tr'):
# リンクを含まないヘッダ部なので除く
if exam_item.css('a').get() is None:
continue

for exam_link in exam_item.css('a'):
exam_pdf = response.urljoin(exam_link.css('a::attr(href)').get())

item = CrawldownloadItem()
item['season'] = exam_season
item['title'] = exam_item.css('td p::text').getall()[1].strip()
item['file_title'] = exam_link.css('a::text').get()
item['file_urls'] = [ exam_pdf ]
yield item

items.py

files_urlsfiles属性がFilesPipelineで必要になる属性

1
2
3
4
5
6
7
8
import scrapy

class CrawldownloadItem(scrapy.Item):
season = scrapy.Field()
title = scrapy.Field()
file_title = scrapy.Field()
file_urls = scrapy.Field()
files = scrapy.Field()

pipelines.py

FilesPipelineはデフォルトでSHA1ハッシュのファイル名を使用するので、継承したクラスでfile_path()メソッドをオーバーライドする。
存在しないディレクトリも自動生成されるので、保存したいパスを生成して返せばいい。

1
2
3
4
5
6
7
8
9
10
11
12
from scrapy.pipelines.files import FilesPipeline

import os

class CrawldownloadPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None):
file_paths = request.url.split("/")
file_paths.pop(0) # https:
file_paths.pop(0) #//
file_name = os.path.join(*file_paths)

return file_name
1
2
3
response.url="https://www.jitec.ipa.go.jp/1_04hanni_sukiru/mondai_kaitou_2019h31_2/2019r01a_sg_am_qs.pdf"
↓↓↓
file_name="www.jitec.ipa.go.jp/1_04hanni_sukiru/mondai_kaitou_2019h31_2/2019r01a_sg_am_qs.pdf"

setting.py

FilesPipelineを有効化する。

  • FILES_STOREでダウンロード先ディレクトリを指定
  • ITEM_PIPELINESFilesPipelineを有効化

デフォルト設定では多重度が高すぎるので、調整する。

  • 同時アクセスは1
  • ダウンロード間隔3秒
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
DOWNLOAD_DELAY = 3

…略…

FILES_STORE = 'download'

ITEM_PIPELINES = {
#'scrapy.pipelines.files.FilesPipeline': 1,
'crawldownload.pipelines.CrawldownloadPipeline': 1,
}

コメント・シェア

Scrapyのcrawlでコマンドライン引数を処理する

 
カテゴリー Python   タグ

クローラーへのコマンドラインオプションの渡し方

scrapy crawl myspider -a category=electronicsのように-aオプションで渡す。

コンストラクタを実装する

1
2
3
4
5
6
7
8
9
10
11
Spiders can access arguments in their __init__ methods:

import scrapy

class MySpider(scrapy.Spider):
name = 'myspider'

def __init__(self, category=None, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = ['http://www.example.com/categories/%s' % category]
# ...

デフォルトコンストラクタを使用する

The default init method will take any spider arguments and copy them to the spider as attributes. The above example can also be written as follows:

デフォルトでは属性値として設定される。

1
2
3
4
5
6
7
import scrapy

class MySpider(scrapy.Spider):
name = 'myspider'

def start_requests(self):
yield scrapy.Request('http://www.example.com/categories/%s' % self.category)

コメント・シェア

ScrapyとSplashでのセッションハンドリング

 
カテゴリー Lua Python   タグ

Splashのセッションハンドリング

Splashのみで利用する場合はSelenium同様、内部的に動作するHeadlessブラウザ(Chromium)がセッションハンドリングを行うため、同一のLuaスクリプト内で記述する範囲では意識しなくてもステートは維持されている。

ScrapyとSplashの間

SplashはScrapyからのリクエスト毎にステートレスなので、ScrapyとLuaスクリプトの間でセッションハンドリングが必要になる。
scrapy-splashに説明がある。

セッションハンドリング

Splash itself is stateless - each request starts from a clean state. In order to support sessions the following is required:

  1. client (Scrapy) must send current cookies to Splash;
  2. Splash script should make requests using these cookies and update them from HTTP response headers or JavaScript code;
  3. updated cookies should be sent back to the client;
  4. client should merge current cookies wiht the updated cookies.

For (2) and (3) Splash provides splash:get_cookies() and splash:init_cookies() methods which can be used in Splash Lua scripts.

Splashはステートレスなので、状態を維持するためのコーディングが必要。

  1. ScrapyからSplashにCookieを送らなくてはならない
  2. SplashスクリプトはCookieを使って操作し、Cookieをアップデートする
  3. アップデートしたCookieをScrapyに返す
  4. Scrapyは受け取ったCookieをマージする

scrapy-splash provides helpers for (1) and (4): to send current cookies in ‘cookies’ field and merge cookies back from ‘cookies’ response field set request.meta[‘splash’][‘session_id’] to the session identifier. If you only want a single session use the same session_id for all request; any value like ‘1’ or ‘foo’ is fine.

scrapy-splashが自動的にCookie情報をセッション識別子としてrequest.meta['splash']['session_id']にマージする。

For scrapy-splash session handling to work you must use /execute endpoint and a Lua script which accepts ‘cookies’ argument and returns ‘cookies’ field in the result:

このセッションハンドリングを有効にするには/executeエンドポイントを使用し、cookiesパラメーターを使用する処理をLuaスクリプトで実装する必要がある。

1
2
3
4
5
6
7
8
9
10
function main(splash)
splash:init_cookies(splash.args.cookies)

-- ... your script

return {
cookies = splash:get_cookies(),
-- ... other results, e.g. html
}
end

SplashRequest sets session_id automatically for /execute endpoint, i.e. cookie handling is enabled by default if you use SplashRequest, /execute endpoint and a compatible Lua rendering script.

SplashRequestで/executeエンドポイントを使い、適切なLuaスクリプトを記述すれば、セッションハンドリングを実装することができる。

Splash経由でのresponseの構造

All these responses set response.url to the URL of the original request (i.e. to the URL of a website you want to render), not to the URL of the requested Splash endpoint. “True” URL is still available as response.real_url.
plashJsonResponse provide extra features:

  • response.data attribute contains response data decoded from JSON; you can access it like response.data[‘html’].
  • If Splash session handling is configured, you can access current cookies as response.cookiejar; it is a CookieJar instance.
  • If Scrapy-Splash response magic is enabled in request (default), several response attributes (headers, body, url, status code) are set automatically from original response body:
    • response.headers are filled from ‘headers’ keys;
    • response.url is set to the value of ‘url’ key;
    • response.body is set to the value of ‘html’ key, or to base64-decoded value of ‘body’ key;
    • response.status is set from the value of ‘http_status’ key.
  • response.urlはレンダリングするページのURLが設定される
  • response.real_urlはSplashのURL(http://splash:8050/execute)となる
  • response.dataでSplashから返却したデータにアクセスできる
  • Cookieはresponse.cookiejarでアクセスすることができる。
  • Scrapy-Splash response magicで自動的にレンダリングしたページからの応答が設定される

セッションハンドリングのサンプルコード

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import scrapy
from scrapy_splash import SplashRequest

script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))

local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
"""

class MySpider(scrapy.Spider):


# ...
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)

def parse_result(self, response):
# here response.body contains result HTML;
# response.headers are filled with headers from last
# web page loaded to Splash;
# cookies from all responses and from JavaScript are collected
# and put into Set-Cookie response header, so that Scrapy
# can remember them.

リクエストで注目するポイント

重要なポイントは/executeエンドポイントを使用していること。
argsでLuaスクリプトやパラメーターをSplashに渡す。

1
2
3
4
5
6
yield SplashRequest(url, self.parse_result,
endpoint='execute',
cache_args=['lua_source'],
args={'lua_source': script},
headers={'X-My-Header': 'value'},
)

SplashRequestで渡したパラメーターを使用してCookieを初期化。

1
2
3
4
5
6
7
8
splash:init_cookies(splash.args.cookies)
assert(splash:go{
splash.args.url,
headers=splash.args.headers,
http_method=splash.args.http_method,
body=splash.args.body,
})
assert(splash:wait(0.5))

レスポンスで注目するポイント

最後のレスポンスのヘッダー情報やCookieを返却。

1
2
3
4
5
6
7
8
9
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}

コメント・シェア

Scrapyのloginformで効率的にログインする

 
カテゴリー Python   タグ

scrapy/loginform

ログインフォームの利用を支援する。pip install loginformでインストール。

プロジェクトの準備

1
2
3
4
5
6
7
8
9
10
11
$scrapy startproject scrapy_login
New Scrapy project 'scrapy_login', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_login

You can start your first spider with:
cd scrapy_login
scrapy genspider example example.com
$cd scrapy_login
$scrapy genspider github github.com
Created spider 'github' using template 'basic' in module:
scrapy_login.spiders.github
1
2
3
4
5
6
7
8
9
10
11
12
13
├── result.json
├── scrapy.cfg
└── scrapy_login
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
└── github.py

settings.pyをカスタマイズ

ROBOTSTXT_OBEY

githubはrobots.txtでクローラーからのアクセスを拒否するので、一時的にrobots.txtを無効化する。

1
2
3
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

items.pyをカスタマイズ

1
2
3
4
5
6
class ScrapyLoginItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
repository_name = scrapy.Field()
repository_link = scrapy.Field()

github.pyをカスタマイズしてSpiderを実装する

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from loginform import fill_login_form
from scrapy_login.items import ScrapyLoginItem

class GithubSpider(scrapy.Spider):
name = 'github'
allowed_domains = ['github.com']
start_urls = ["http://github.com/login"]
login_user = "XXXXXXX"
login_pass = "XXXXXXX"

def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)

def after_login(self, response):
for q in response.css("ul.list-style-none li div.width-full"):
_, repo_name = q.css("span.css-truncate::text").getall()
github = ScrapyLoginItem()
github["repository_name"] = repo_name
github["repository_link"] = q.css("a::attr(href)").get()
yield github

実行すると以下のような内容が生成される。

1
2
3
4
[
{"repository_name": "hello-world", "repository_link": "/xxxxxxx/hello-world"},
{"repository_name": "Spoon-Knife", "repository_link": "/octocat/Spoon-Knife"}
]

fill_login_form()

注目するポイントはfill_login_formの部分。
fill_login_form()を実行すると、ページを解析してログインフォームの情報を返す。

1
2
3
4
5
6
7
8
9
10
11
12
13
$python
Python 3.8.2 (default, Apr 16 2020, 18:36:10)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from loginform import fill_login_form
>>> import requests
>>> url = "https://github.com/login"
>>> r = requests.get(url)
>>> fill_login_form(url, r.text, "john", "secret")
(
[('authenticity_token', 'R+A63AyXCpZLBzIdp6LefjsRxmkhLqsxaUPp+DLru2BlQlyID+B7yXL3FoNgoBgjF3osG3ZSyjBFriX6TsrsFg=='), ('login', 'john'), ('password', 'secret'), ('webauthn-support', 'unknown'), ('webauthn-iuvpaa-support', 'unknown'), ('timestamp', '1588766233339'), ('timestamp_secret', '115d1a1e733276fa256131e12acb6c1974912ba3923dddd3ade33ba6717b3dcd'), ('commit', 'Sign in')],
'https://github.com/session',
'POST')

タプルの1つめでauthenticity_tokenが含まれていることがわかる。このようにHiddenパラメーターを送ることができる。

コメント・シェア

scrapy-splash

SplashのScrapyミドルウェア。pip install scrapy-splashでインストール。

プロジェクトの準備

1
2
3
4
5
6
7
$ scrapy startproject scrapy_splash_tutorial
New Scrapy project 'scrapy_splash_tutorial', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_splash_tutorial

You can start your first spider with:
cd scrapy_splash_tutorial
scrapy genspider example example.com
1
2
3
4
5
6
7
8
9
10
11
12
.
├── scrapy.cfg
└── scrapy_splash_tutorial
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

settings.pyをカスタマイズ

DOWNLOADER_MIDDLEWARES

1
2
3
4
5
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

ミドルウェアの優先度はHttpProxyよりも優先する必要があるため、750未満にする必要がある。

SPLASH_URL

SPLASH_URL =でSplashのURLを指定する。

1
SPLASH_URL = 'http://splash:8050/'

docker-composeで起動しているため、splashを使っている。

SPIDER_MIDDLEWARES

1
2
3
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

SplashDeduplicateArgsMiddlewareを有効化する。これによって重複するリクエストをSplashサーバーに送らない。

DUPEFILTER_CLASS / HTTPCACHE_STORAGE

1
2
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

リクエストのフィンガープリント計算をオーバーライドできないので、DUPEFILTER_CLASSHTTPCACHE_STORAGEを定義する。

Spiderの実装例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)

def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
  1. scrapy.Requestの代わりにSplashRequestを使用してページのレンダリング
  2. argsでSplashに引数として渡す
  3. endpointでデフォルトのエンドポイントであるrender.jsonからrender.htmlに変更

Spiderの例を元にquotesのJSページを実装する

JavaScriptでページを生成するhttp://quotes.toscrape.com/js/を対象にテストコードを作成する。

今回のスパイダーはquotesjsで作成。

1
2
3
$scrapy genspider quotesjs quotes.toscrape.com
Created spider 'quotesjs' using template 'basic' in module:
scrapy_splash_tutorial.spiders.quotesjs

ChromeのF12デバッグで内容を確認する

Chromeデバッグ width=640

Chromeデバッグ width=640

scrapy shellでページを解析する

shellはSplash経由で操作するため、scrapy shell 'http://splash:8050/render.html?url=http://<target_url>&timeout=10&wait=2'で起動する。
パラメーターのwait=2(秒数は対象にあわせて適切な値を)は重要で、指定なしではレンダリングが終わっていないHTMLが返却されることもある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
$scrapy shell 'http://splash:8050/render.html?url=http://quotes.toscrape.com/js/'
2020-05-06 18:09:33 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_splash_tutorial)
2020-05-06 18:09:33 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-06 18:09:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-06 18:09:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_splash_tutorial',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'EDITOR': '/usr/bin/vim',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'scrapy_splash_tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_splash_tutorial.spiders']}
2020-05-06 18:09:33 [scrapy.extensions.telnet] INFO: Telnet Password: 2dd3dc32afe40826
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-06 18:09:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-06 18:09:33 [scrapy.core.engine] INFO: Spider opened
2020-05-06 18:09:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-06 18:09:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://splash:8050/render.html?url=http://quotes.toscrape.com/js/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f8aaede0f10>
[s] item {}
[s] request <GET http://splash:8050/render.html?url=http://quotes.toscrape.com/js/>
[s] response <200 http://splash:8050/render.html?url=http://quotes.toscrape.com/js/>
[s] settings <scrapy.settings.Settings object at 0x7f8aaede0b20>
[s] spider <DefaultSpider 'default' at 0x7f8aaeb9a9a0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
1
2
3
4
>>> response.css('.container .quote').get()
'<div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div>'
>>> response.css('.container .quote').getall()
['<div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div>', '<div class="quote"><span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div>', '<div class="quote"><span class="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div>', '<div class="quote"><span class="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div>', '<div class="quote"><span class="text">“Imperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.”</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div>', '<div class="quote"><span class="text">“Try not to become a man of success. Rather become a man of value.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div>', '<div class="quote"><span class="text">“It is better to be hated for what you are than to be loved for what you are not.”</span><span>by <small class="author">André Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div>', '<div class="quote"><span class="text">“I have not failed. I\'ve just found 10,000 ways that won\'t work.”</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div>', '<div class="quote"><span class="text">“A woman is like a tea bag; you never know how strong it is until it\'s in hot water.”</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div>', '<div class="quote"><span class="text">“A day without sunshine is like, you know, night.”</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div>']

items.pyをカスタマイズ

1
2
3
class QuoteItem(scrapy.Item):
quote = scrapy.Field()
author = scrapy.Field()

quotesjs.pyをカスタマイズ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy_splash_tutorial.items import QuoteItem

class QuotesjsSpider(scrapy.Spider):
name = 'quotesjs'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)

def parse(self, response):
for q in response.css(".container .quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote

クローラーを実行する

scrapy crawl quotesjs -o result.jsonでクローラーを実行する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
$scrapy crawl quotesjs -o result.json
2020-05-06 18:34:02 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_splash_tutorial)
2020-05-06 18:34:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-06 18:34:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-06 18:34:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_splash_tutorial',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'EDITOR': '/usr/bin/vim',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'scrapy_splash_tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_splash_tutorial.spiders']}
2020-05-06 18:34:02 [scrapy.extensions.telnet] INFO: Telnet Password: febe521f79cff551
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-06 18:34:02 [scrapy.core.engine] INFO: Spider opened
2020-05-06 18:34:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-06 18:34:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-06 18:34:02 [py.warnings] WARNING: /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

2020-05-06 18:34:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-06 18:34:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-06 18:34:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/js/ via http://splash:8050/render.html> (referer: None)
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“The world as we have created it is a process of our thinking. It '
'cannot be changed without changing our thinking.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'J.K. Rowling',
'quote': '“It is our choices, Harry, that show what we truly are, far more '
'than our abilities.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“There are only two ways to live your life. One is as though '
'nothing is a miracle. The other is as though everything is a '
'miracle.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Jane Austen',
'quote': '“The person, be it gentleman or lady, who has not pleasure in a '
'good novel, must be intolerably stupid.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Marilyn Monroe',
'quote': "“Imperfection is beauty, madness is genius and it's better to be "
'absolutely ridiculous than absolutely boring.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“Try not to become a man of success. Rather become a man of value.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'André Gide',
'quote': '“It is better to be hated for what you are than to be loved for '
'what you are not.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Thomas A. Edison',
'quote': "“I have not failed. I've just found 10,000 ways that won't work.”"}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Eleanor Roosevelt',
'quote': '“A woman is like a tea bag; you never know how strong it is until '
"it's in hot water.”"}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Steve Martin',
'quote': '“A day without sunshine is like, you know, night.”'}
2020-05-06 18:34:04 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-06 18:34:04 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: result.json
2020-05-06 18:34:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 960,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 9757,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 2,
'elapsed_time_seconds': 2.285135,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 6, 9, 34, 4, 575789),
'item_scraped_count': 10,
'log_count/DEBUG': 13,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'memusage/max': 56578048,
'memusage/startup': 56578048,
'response_received_count': 3,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/404': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2020, 5, 6, 9, 34, 2, 290654)}
2020-05-06 18:34:04 [scrapy.core.engine] INFO: Spider closed (finished)

生成されたresult.jsonは以下。

1
2
3
4
5
6
7
8
9
10
11
12
[
{"author": "Albert Einstein", "quote": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"author": "J.K. Rowling", "quote": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"},
{"author": "Albert Einstein", "quote": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"},
{"author": "Jane Austen", "quote": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"},
{"author": "Marilyn Monroe", "quote": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"},
{"author": "Albert Einstein", "quote": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"},
{"author": "Andr\u00e9 Gide", "quote": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"},
{"author": "Thomas A. Edison", "quote": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"},
{"author": "Eleanor Roosevelt", "quote": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"},
{"author": "Steve Martin", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"}
]

コメント・シェア

Scrapy公式チュートリアル

Installation

docker run -it -p 8050:8050 --rm scrapinghub/splashだが、docker-composeで操作する。

docker-compose.ymlで定義。

1
2
3
4
splash:
image: scrapinghub/splash
ports:
- 8050:8050

実行のテスト

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ docker-compose run splash
Pulling splash (scrapinghub/splash:)...
latest: Pulling from scrapinghub/splash
2746a4a261c9: Pull complete
4c1d20cdee96: Pull complete
~略~
50ea6de52777: Pull complete
43e94179bda5: Pull complete
Digest: sha256:01c89e3b0598e904fea184680b82ffe74524e83160f793884dc88d184056c49d
Status: Downloaded newer image for scrapinghub/splash:latest
2020-05-06 04:13:03+0000 [-] Log opened.
2020-05-06 04:13:03.106078 [-] Xvfb is started: ['Xvfb', ':2112596484', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-05-06 04:13:03.184966 [-] Splash version: 3.4.1
2020-05-06 04:13:03.217438 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-05-06 04:13:03.217581 [-] Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
2020-05-06 04:13:03.217654 [-] Open files limit: 1048576
2020-05-06 04:13:03.217695 [-] Can't bump open files limit
2020-05-06 04:13:03.231322 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-05-06 04:13:03.231620 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-05-06 04:13:03.343525 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-05-06 04:13:03.343858 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2020-05-06 04:13:03.344260 [-] Site starting on 8050
2020-05-06 04:13:03.344470 [-] Starting factory <twisted.web.server.Site object at 0x7f23c5cb6160>
2020-05-06 04:13:03.344768 [-] Server listening on http://0.0.0.0:8050

使用する際はdocker-compose up -dで。

Splash WebUI

起動したSplashにアクセスするとWebUIから操作が可能。

Splash WebUI width=640

標準で表示されているコードでRender me!を実行する。

Splash WebUI width=640

Intro

Splash can execute custom rendering scripts written in the Lua programming language. This allows us to use Splash as a browser automation tool similar to PhantomJS.
Lua言語で記述されたカスタムレンダリングスクリプトを実行できるPhantomJS的なもの。
Lua言語はRedis, Nginx, Apache, World of Warcraft scripts,などのカスタムスクリプトの記述に使われている。

以下のチュートリアルが紹介されている。

1
2
3
4
5
6
function main(splash, args)
splash:go("http://example.com")
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
end

WebUI上でRender me!を実行すると、returnで返し多JSONオブジェトが得られる。

Splash WebUI width=640

Splash WebUI width=640

Entry Point: the “main” Function

1
2
3
function main(splash)
return {hello="world!"}
end

SplashのWebGUIで実行すると以下の結果になる。

1
2
Splash Response: Object
hello: "world!"

JSON形式ではなく、文字列で返すこともできる。

1
2
3
function main(splash)
return 'hello'
end

docker-composeでsplashというサービスなのでホスト名はsplashを使用している。

1
2
$ curl 'http://splash:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend'
hello

Where Are My Callbacks?

It is not doing exactly the same work - instead of saving screenshots to files we’re returning PNG data to the client via HTTP API.
スクリーンショットをPNG形式で取得しWebAPIで返却する例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function main(splash, args)
splash:set_viewport_size(800, 600)
splash:set_user_agent('Splash bot')
local example_urls = {"www.google.com", "www.bbc.co.uk", "scrapinghub.com"}
local urls = args.urls or example_urls
local results = {}
for _, url in ipairs(urls) do
local ok, reason = splash:go("http://" .. url)
if ok then
splash:wait(0.2)
results[url] = splash:png()
end
end
return results
end

WebUI上でRender me!を実行すると、各サイトのスクリーンショットが表示される。

Splash WebUI width=640

Calling Splash Methods

There are two main ways to call Lua methods in Splash scripts: using positional and named arguments. To call a method using positional arguments use parentheses splash:foo(val1, val2), to call it with named arguments use curly braces: splash:foo{name1=val1, name2=val2}:

Luaのメソッド呼び出しは位置引数(Positional arguments)によるsplash:foo(val1, val2)や名前引数(named arguments)splash:foo{name1=val1, name2=val2}によるものがある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function main(splash, args)
-- Examples of positional arguments:
splash:go("http://example.com")
splash:wait(0.5, false)
local title = splash:evaljs("document.title")

-- The same using keyword arguments:
splash:go{url="http://google.com"}
splash:wait{time=0.5, cancel_on_redirect=false}
local title = splash:evaljs{snippet="document.title"}

-- Mixed arguments example:
splash:wait{0.5, cancel_on_redirect=false}

return title
end

このチュートリアル自体に意味はないが、コード上evaljs{source="document.title"}となっているので動作しない。
splash:evaljsのリファレンスsnippetである事がわかる。

Error Handling

Splash uses the following convention:

  1. for developer errors (e.g. incorrect function arguments) exception is raised;
  2. for errors outside developer control (e.g. a non-responding remote website) status flag is returned: functions that can fail return ok, reason pairs which developer can either handle or ignore.
    If main results in an unhandled exception then Splash returns HTTP 400 response with an error message.

Splashのルールでは以下のルール。

  1. 開発者エラーは例外にする
  2. 開発者が制御できないエラーはstatusで返す

例外はerror()で明示的に発生させることができる。

1
2
3
4
5
6
7
function main(splash, args)
local ok, msg = splash:go("http://no-url.example.com")
if not ok then
-- handle error somehow, e.g.
error(msg)
end
end

例外の場合、LuaのHTTPレスポンスHTTP 400のエラーとして返す。

1
2
3
4
5
6
7
8
9
10
11
12
{
"error": 400,
"type": "ScriptError",
"description": "Error happened while executing Lua script",
"info": {
"source": "[string \"function main(splash, args)\r...\"]",
"line_number": 5,
"error": "network3",
"type": "LUA_ERROR",
"message": "Lua error: [string \"function main(splash, args)\r...\"]:5: network3"
}
}

同じコードをassert()で表現できる。

1
2
3
4
function main(splash, args)
-- a shortcut for the code above: use assert
assert(splash:go("http://no-rul.example.com"))
end

Sandbox

By default Splash scripts are executed in a restricted environment: not all standard Lua modules and functions are available, Lua require is restricted, and there are resource limits (quite loose though).

デフォルトではSplashはサンドボックスで実行される。無効化するには-disable-lua-sandboxオプションを使う。

Dockerコマンドをそのまま使用するなら以下のように。

1
`docker run -it -p 8050:8050 scrapinghub/splash --disable-lua-sandbox`

docker-composeなら、commandでオプションを渡す。

1
2
3
4
5
splash:
image: scrapinghub/splash
command: --disable-lua-sandbox
ports:
- 8050:8050

docker-compose runでテスト実行するとLua: enabled (sandbox: disabled)を確認できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PS C:\Users\g\OneDrive\devel\gggcat@github\python3-tutorial> docker-compose run splash
2020-05-06 06:02:02+0000 [-] Log opened.
2020-05-06 06:02:02.166203 [-] Xvfb is started: ['Xvfb', ':1094237403', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-05-06 06:02:02.242322 [-] Splash version: 3.4.1
2020-05-06 06:02:02.275180 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-05-06 06:02:02.275346 [-] Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
2020-05-06 06:02:02.275497 [-] Open files limit: 1048576
2020-05-06 06:02:02.275605 [-] Can't bump open files limit
2020-05-06 06:02:02.289473 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-05-06 06:02:02.289650 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-05-06 06:02:02.398489 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-05-06 06:02:02.398754 [-] Web UI: enabled, Lua: enabled (sandbox: disabled), Webkit: enabled, Chromium: enabled
2020-05-06 06:02:02.399073 [-] Site starting on 8050
2020-05-06 06:02:02.399156 [-] Starting factory <twisted.web.server.Site object at 0x7f02ac5b61d0>
2020-05-06 06:02:02.399344 [-] Server listening on http://0.0.0.0:8050

Timeouts

By default Splash aborts script execution after a timeout (30s by default); it is a common problem for long scripts.

タイムアウトはデフォルトで30秒。

コメント・シェア

AmazonLinux

AmazonLinuxのイメージ一覧

  • AmazonLinux2
  • リージョンがap-northeast-1
  • アーキテクチャがx86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
-------------------------------------------------------------------------------------------------------------
| DescribeImages |
+-----------------------------------------------------------------------+------------------------+----------+
| amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2 | ami-0f310fced6141e627 | x86_64 |
| amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs | ami-06aa6ba9dc39dc071 | x86_64 |
| amzn2-ami-hvm-2.0.20200304.0-x86_64-gp2 | ami-052652af12b58691f | x86_64 |
| amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs | ami-0c6f9336767cd9243 | x86_64 |
~略~
| amzn2-ami-hvm-2017.12.0.20180109-x86_64-gp2 | ami-6be57d0d | x86_64 |
| amzn2-ami-hvm-2017.12.0.20180109-x86_64-ebs | ami-39e37b5f | x86_64 |
| amzn2-ami-hvm-2017.12.0.20171212.2-x86_64-gp2 | ami-2a34b64c | x86_64 |
| amzn2-ami-hvm-2017.12.0.20171212.2-x86_64-ebs | ami-1d37b57b | x86_64 |
+-----------------------------------------------------------------------+------------------------+----------+

AmazonLinuxの最新イメージを取得する

  • バージョン: 2
  • リージョンが: ap-northeast-1
  • アーキテクチャ: x86_64
  • ボリューム: gp2

ボリュームタイプでイメージが異なるので、以下はgp2(現行の汎用SSD)のボリュームで検索している。

1
2
3
4
5
6
7
8
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-gp2" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[1].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------
| DescribeImages |
+-------------------------------------------+
| amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs |
| ami-06aa6ba9dc39dc071 |
| x86_64 |
+-------------------------------------------+

AmazonLinuxのボリュームタイプ

amzn2-ami-hvm-*-x86_64-ebsはVolumeType: standardで旧世代のボリュームタイプを使用している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-gp2" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-07T17:30:34.000Z",
"ImageId": "ami-0f310fced6141e627",
"ImageLocation": "amazon/amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-06688593da98411ef",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "Amazon Linux 2 AMI 2.0.20200406.0 x86_64 HVM gp2",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-ebs" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-07T17:14:50.000Z",
"ImageId": "ami-06aa6ba9dc39dc071",
"ImageLocation": "amazon/amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-06688593da98411ef",
"VolumeSize": 8,
"VolumeType": "standard",
"Encrypted": false
}
}
],
"Description": "Amazon Linux 2 AMI 2.0.20200406.0 x86_64 HVM ebs",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}

UbuntuLinux

AMIでUbuntuLinuxの指定バージョンの最新イメージ

UbuntuLinuxの公式は099720109477なのでこれを基本に絞り込んでいく。

  • バージョン: 18.04
  • リージョンがap-northeast-1
  • アーキテクチャ: x86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------------------------------------------------------------------------------------------
| DescribeImages |
+-------------------------------------------------------------------------------------------+------------------------+----------+
| ubuntu-minimal/images/hvm-ssd/ubuntu-bionic-18.04-amd64-minimal-20200430 | ami-0084e4332fdb227c6 | x86_64 |
| ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20200408 | ami-0a8f568a6a14353b6 | x86_64 |
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 | ami-0278fe6949f6b1a06 | x86_64 |
| ubuntu-eks/k8s_1.15/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200406.1 | ami-0fd103c2168938a67 | x86_64 |
| ubuntu-minimal/images/hvm-ssd/ubuntu-bionic-18.04-amd64-minimal-20200406.1 | ami-0c1bb33d8c0bd2145 | x86_64 |
~略~
| ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20180426.2 | ami-19d33266 | x86_64 |
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180426.2 | ami-82c928fd | x86_64 |
| ubuntu-minimal/images-testing/hvm-ssd/ubuntu-bionic-18.04-daily-amd64-minimal-20180328.1 | ami-ddcec5a1 | x86_64 |
| ubuntu-minimal/images-testing/hvm-ssd/ubuntu-bionic-18.04-daily-amd64-minimal-20180329 | ami-54747f28 | x86_64 |
+-------------------------------------------------------------------------------------------+------------------------+----------+

複合条件で以下を条件として、18.04の最新イメージの情報を取得する。

  • Ubuntu 18.04
  • リージョンがap-northeast-1
  • アーキテクチャがx86_64
  • ボリューム: gp2
1
2
3
4
5
6
7
8
$aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-ssd/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------------------------------
| DescribeImages |
+-------------------------------------------------------------------+
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 |
| ami-0278fe6949f6b1a06 |
| x86_64 |
+-------------------------------------------------------------------+

UbuntuLinuxのボリュームタイプ

  • ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*: 現行SSDボリューム
  • ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-*: ボリュームマウントなし
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
$ aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-ssd/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-09T16:44:23.000Z",
"ImageId": "ami-0278fe6949f6b1a06",
"ImageLocation": "099720109477/ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408",
"ImageType": "machine",
"Public": true,
"OwnerId": "099720109477",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-0cb75af02a9254c11",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2020-04-08",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
$ aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-instance/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[1]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-03-24T21:03:50.000Z",
"ImageId": "ami-0dc413a5565744b02",
"ImageLocation": "ubuntu-images-ap-northeast-1-release/bionic/20200323/hvm/instance-store/image.img.manifest.xml",
"ImageType": "machine",
"Public": true,
"OwnerId": "099720109477",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2020-03-23",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20200323",
"RootDeviceType": "instance-store",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}

コメント・シェア

Scrapyの公式チュートリアル

 
カテゴリー Python Tutorial   タグ

Scrapy公式チュートリアル

We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors.
This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project
  2. Writing a spider to crawl a site and extract data
  3. Exporting the scraped data using the command line
  4. Changing spider to recursively follow links
  5. Using spider arguments

他にも良質なコンテンツへのリンクがある

Installation

チュートリアルの前にScrapyをインストールする。
依存するパッケージがあるので、Installation guideに従いインストールする。

Ubuntsu環境でテストするので、追加パッケージをインストール。

1
apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

Scrapyはpipでインストール。

1
pip install scrapy

パッケージが不足した状態でインストールするとエラーになる。

1
ERROR: Command errored out with exit status 1: /usr/local/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-4zradyeg/Twisted/setup.py'"'"'; __file__='"'"'/tmp/pip-install-4zradyeg/Twisted/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-r8m1686g/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/Twisted Check the logs for full command output.

Creating a project

scrapy startprojectでプロジェクトを作成

1
2
3
4
5
6
7
$ scrapy startproject scrapy_tutorial_quotes
New Scrapy project 'scrapy_tutorial_quotes', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_tutorial_quotes

You can start your first spider with:
cd scrapy_tutorial_quotes
scrapy genspider example example.com

以下のディレクトリ構成で作成される。

1
2
3
4
5
6
7
8
9
10
11
12
.
├── scrapy.cfg
└── scrapy_tutorial_quotes
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

Our first Spider

チュートリアルのコードに従い作成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

nameがSpiderの一意な識別子。

start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

star_requests()でクローリングする対象のコネクションを取得する。ジェネレーターかリストでrequestsを返す。

parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

parse()requestsが生成したresponseをスクレイピングする処理を記述する。

How to run our spider

以下のコマンドで実行する。

1
scrapy crawl quotes

実行すると以下のログが出力され、quotes-1.htmlquotes-2.htmlが生成される。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
$ scrapy crawl quotes
2020-05-03 01:38:15 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 01:38:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 01:38:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 01:38:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'EDITOR': '/usr/bin/vim',
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 01:38:15 [scrapy.extensions.telnet] INFO: Telnet Password: afdf50795ed4260d
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 01:38:15 [scrapy.core.engine] INFO: Spider opened
2020-05-03 01:38:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-03 01:38:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 01:38:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 01:38:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-03 01:38:16 [quotes] DEBUG: Saved file quotes-1.html
2020-05-03 01:38:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-03 01:38:17 [quotes] DEBUG: Saved file quotes-2.html
2020-05-03 01:38:17 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-03 01:38:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6003,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.982914,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 2, 16, 38, 17, 288861),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'memusage/max': 55898112,
'memusage/startup': 55898112,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 2, 16, 38, 15, 305947)}
2020-05-03 01:38:17 [scrapy.core.engine] INFO: Spider closed (finished)

A shortcut to the start_requests method

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs, you can just define a start_urls class attribute with a list of URLs. This list will then be used by the default implementation of start_requests() to create the initial requests for your spider:

start_urlsというリストを設定すればデフォルトのstart_requests()を使える。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

parse()がデフォルトのコールバックメソッド。

Extracting data

Scrapy shellを使ってデータ構造を解析する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
$scrapy shell 'http://quotes.toscrape.com/page/1/'
2020-05-03 01:51:11 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 01:51:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 01:51:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 01:51:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'EDITOR': '/usr/bin/vim',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 01:51:11 [scrapy.extensions.telnet] INFO: Telnet Password: 2c0c7af38c3cc618
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 01:51:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 01:51:11 [scrapy.core.engine] INFO: Spider opened
2020-05-03 01:51:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 01:51:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f11ff2c60a0>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7f11ff2c3be0>
[s] spider <DefaultSpider 'default' at 0x7f11ff0c7700>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

CSSやXPathを使ってデータを抽出できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

Extracting quotes and authors

Scrapy shellを使って対象データを解析していく。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ scrapy shell 'http://quotes.toscrape.com'
…略…
>>> response.css("div.quote")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>]
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Extracting data in our spider

Scrapy shellを使って解析した結果を元にparse()をコーディングしていく。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
`````````

実行結果。

``` python
$scrapy crawl quotes2
2020-05-03 02:05:13 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 02:05:13 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 02:05:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 02:05:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'EDITOR': '/usr/bin/vim',
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 02:05:13 [scrapy.extensions.telnet] INFO: Telnet Password: 6bee2d1ba39b9e9c
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 02:05:13 [scrapy.core.engine] INFO: Spider opened
2020-05-03 02:05:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-03 02:05:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 02:05:14 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 02:05:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2020-05-03 02:05:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2020-05-03 02:05:15 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-03 02:05:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6003,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.762026,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 2, 17, 5, 15, 270853),
'item_scraped_count': 20,
'log_count/DEBUG': 23,
'log_count/INFO': 10,
'memusage/max': 55705600,
'memusage/startup': 55705600,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 2, 17, 5, 13, 508827)}
2020-05-03 02:05:15 [scrapy.core.engine] INFO: Spider closed (finished)

Storing the scraped data

スクレイピングの結果をJSON形式でファイルに保存する。

1
scrapy crawl quotes2 -o quotes2.json

別の形式としてJsonLine形式が使える

1
scrapy crawl quotes2 -o quotes2.jl

JSON形式の実行結果は以下、JsonLineの場合、リストではなく{}の行の集合になっている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": ["life", "love"]},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]},
{"text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe", "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"]},
{"text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling", "tags": ["courage", "friends"]},
{"text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein", "tags": ["simplicity", "understand"]},
{"text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley", "tags": ["love"]},
{"text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss", "tags": ["fantasy"]},
{"text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams", "tags": ["life", "navigation"]},
{"text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel", "tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"]},
{"text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche", "tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"]},
{"text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain", "tags": ["books", "contentment", "friends", "friendship", "life"]},
{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]}
]

Following links / A shortcut for creating Requests

次のページの処理。リンクを抽出して再帰的にクローリングする。

1
2
3
4
5
6
7
8
$ scrapy shell 'http://quotes.toscrape.com'
…略…
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
>>> response.css('li.next a').attrib['href']
'/page/2/'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes3"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

urljoin()で相対パスからURLを生成しているが、これはfollow()で省略できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes3"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)

CSSセレクターで取得した複数の属性から生成することもできる。

1
2
for href in response.css('ul.pager a::attr(href)'):
yield response.follow(href, callback=self.parse)

アンカータグを指定するだけで、自動的にリンクを取得する省略も可能。

1
2
for a in response.css('ul.pager a'):
yield response.follow(a, callback=self.parse)

さらに、follow_all()ですべてのリンクをたどることができる。

1
2
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

そしてワンライナー。

1
yield from response.follow_all(css='ul.pager a', callback=self.parse)

簡潔にクロールできる。

More examples and patterns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy

class AuthorSpider(scrapy.Spider):
name = 'author'

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)

pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)

def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()

yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

デフォルトでは、同じページへのアクセスを重複してしない。これはscrapy.cfgDUPEFILTER_CLASSで設定できる。

As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

より高機能なCrawlSpierクラスがある。

Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.

典型的な使い方に、コールバックにデータを渡すトリックを使って複数のページから取得した情報を使ってデータを生成できる。

Passing additional data to callback functions

cb_kwargsを使ってパラメーターを渡す。

1
2
3
4
5
6
7
8
9
10
11
12
13
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request

def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)

Using spider arguments

コマンドラインからパラメーターを渡すことができる。

1
scrapy crawl quotes -o quotes-humor.json -a tag=humor

-aで渡したパラメーターはtag = getattr()で取得できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)

コメント・シェア



nullpo

めも


募集中


Japan