Scrapyのloginformで効率的にログインする

 
カテゴリー Python   タグ

scrapy/loginform

ログインフォームの利用を支援する。pip install loginformでインストール。

プロジェクトの準備

1
2
3
4
5
6
7
8
9
10
11
$scrapy startproject scrapy_login
New Scrapy project 'scrapy_login', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_login

You can start your first spider with:
cd scrapy_login
scrapy genspider example example.com
$cd scrapy_login
$scrapy genspider github github.com
Created spider 'github' using template 'basic' in module:
scrapy_login.spiders.github
1
2
3
4
5
6
7
8
9
10
11
12
13
├── result.json
├── scrapy.cfg
└── scrapy_login
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
├── __pycache__
└── github.py

settings.pyをカスタマイズ

ROBOTSTXT_OBEY

githubはrobots.txtでクローラーからのアクセスを拒否するので、一時的にrobots.txtを無効化する。

1
2
3
# Obey robots.txt rules
#ROBOTSTXT_OBEY = True
ROBOTSTXT_OBEY = False

items.pyをカスタマイズ

1
2
3
4
5
6
class ScrapyLoginItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
repository_name = scrapy.Field()
repository_link = scrapy.Field()

github.pyをカスタマイズしてSpiderを実装する

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from loginform import fill_login_form
from scrapy_login.items import ScrapyLoginItem

class GithubSpider(scrapy.Spider):
name = 'github'
allowed_domains = ['github.com']
start_urls = ["http://github.com/login"]
login_user = "XXXXXXX"
login_pass = "XXXXXXX"

def parse(self, response):
args, url, method = fill_login_form(response.url, response.body, self.login_user, self.login_pass)
return FormRequest(url, method=method, formdata=args, callback=self.after_login)

def after_login(self, response):
for q in response.css("ul.list-style-none li div.width-full"):
_, repo_name = q.css("span.css-truncate::text").getall()
github = ScrapyLoginItem()
github["repository_name"] = repo_name
github["repository_link"] = q.css("a::attr(href)").get()
yield github

実行すると以下のような内容が生成される。

1
2
3
4
[
{"repository_name": "hello-world", "repository_link": "/xxxxxxx/hello-world"},
{"repository_name": "Spoon-Knife", "repository_link": "/octocat/Spoon-Knife"}
]

fill_login_form()

注目するポイントはfill_login_formの部分。
fill_login_form()を実行すると、ページを解析してログインフォームの情報を返す。

1
2
3
4
5
6
7
8
9
10
11
12
13
$python
Python 3.8.2 (default, Apr 16 2020, 18:36:10)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from loginform import fill_login_form
>>> import requests
>>> url = "https://github.com/login"
>>> r = requests.get(url)
>>> fill_login_form(url, r.text, "john", "secret")
(
[('authenticity_token', 'R+A63AyXCpZLBzIdp6LefjsRxmkhLqsxaUPp+DLru2BlQlyID+B7yXL3FoNgoBgjF3osG3ZSyjBFriX6TsrsFg=='), ('login', 'john'), ('password', 'secret'), ('webauthn-support', 'unknown'), ('webauthn-iuvpaa-support', 'unknown'), ('timestamp', '1588766233339'), ('timestamp_secret', '115d1a1e733276fa256131e12acb6c1974912ba3923dddd3ade33ba6717b3dcd'), ('commit', 'Sign in')],
'https://github.com/session',
'POST')

タプルの1つめでauthenticity_tokenが含まれていることがわかる。このようにHiddenパラメーターを送ることができる。

コメント・シェア

scrapy-splash

SplashのScrapyミドルウェア。pip install scrapy-splashでインストール。

プロジェクトの準備

1
2
3
4
5
6
7
$ scrapy startproject scrapy_splash_tutorial
New Scrapy project 'scrapy_splash_tutorial', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_splash_tutorial

You can start your first spider with:
cd scrapy_splash_tutorial
scrapy genspider example example.com
1
2
3
4
5
6
7
8
9
10
11
12
.
├── scrapy.cfg
└── scrapy_splash_tutorial
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

settings.pyをカスタマイズ

DOWNLOADER_MIDDLEWARES

1
2
3
4
5
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

Order 723 is just before HttpProxyMiddleware (750) in default scrapy settings.

ミドルウェアの優先度はHttpProxyよりも優先する必要があるため、750未満にする必要がある。

SPLASH_URL

SPLASH_URL =でSplashのURLを指定する。

1
SPLASH_URL = 'http://splash:8050/'

docker-composeで起動しているため、splashを使っている。

SPIDER_MIDDLEWARES

1
2
3
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

SplashDeduplicateArgsMiddlewareを有効化する。これによって重複するリクエストをSplashサーバーに送らない。

DUPEFILTER_CLASS / HTTPCACHE_STORAGE

1
2
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

リクエストのフィンガープリント計算をオーバーライドできないので、DUPEFILTER_CLASSHTTPCACHE_STORAGEを定義する。

Spiderの実装例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import scrapy
from scrapy_splash import SplashRequest

class MySpider(scrapy.Spider):
start_urls = ["http://example.com", "http://example.com/foo"]

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)

def parse(self, response):
# response.body is a result of render.html call; it
# contains HTML processed by a browser.
# …
  1. scrapy.Requestの代わりにSplashRequestを使用してページのレンダリング
  2. argsでSplashに引数として渡す
  3. endpointでデフォルトのエンドポイントであるrender.jsonからrender.htmlに変更

Spiderの例を元にquotesのJSページを実装する

JavaScriptでページを生成するhttp://quotes.toscrape.com/js/を対象にテストコードを作成する。

今回のスパイダーはquotesjsで作成。

1
2
3
$scrapy genspider quotesjs quotes.toscrape.com
Created spider 'quotesjs' using template 'basic' in module:
scrapy_splash_tutorial.spiders.quotesjs

ChromeのF12デバッグで内容を確認する

Chromeデバッグ width=640

Chromeデバッグ width=640

scrapy shellでページを解析する

shellはSplash経由で操作するため、scrapy shell 'http://splash:8050/render.html?url=http://<target_url>&timeout=10&wait=2'で起動する。
パラメーターのwait=2(秒数は対象にあわせて適切な値を)は重要で、指定なしではレンダリングが終わっていないHTMLが返却されることもある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
$scrapy shell 'http://splash:8050/render.html?url=http://quotes.toscrape.com/js/'
2020-05-06 18:09:33 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_splash_tutorial)
2020-05-06 18:09:33 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-06 18:09:33 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-06 18:09:33 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_splash_tutorial',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'EDITOR': '/usr/bin/vim',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'scrapy_splash_tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_splash_tutorial.spiders']}
2020-05-06 18:09:33 [scrapy.extensions.telnet] INFO: Telnet Password: 2dd3dc32afe40826
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-06 18:09:33 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-06 18:09:33 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-06 18:09:33 [scrapy.core.engine] INFO: Spider opened
2020-05-06 18:09:33 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-06 18:09:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://splash:8050/render.html?url=http://quotes.toscrape.com/js/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f8aaede0f10>
[s] item {}
[s] request <GET http://splash:8050/render.html?url=http://quotes.toscrape.com/js/>
[s] response <200 http://splash:8050/render.html?url=http://quotes.toscrape.com/js/>
[s] settings <scrapy.settings.Settings object at 0x7f8aaede0b20>
[s] spider <DefaultSpider 'default' at 0x7f8aaeb9a9a0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
1
2
3
4
>>> response.css('.container .quote').get()
'<div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div>'
>>> response.css('.container .quote').getall()
['<div class="quote"><span class="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">change</a> <a class="tag">deep-thoughts</a> <a class="tag">thinking</a> <a class="tag">world</a></div></div>', '<div class="quote"><span class="text">“It is our choices, Harry, that show what we truly are, far more than our abilities.”</span><span>by <small class="author">J.K. Rowling</small></span><div class="tags">Tags: <a class="tag">abilities</a> <a class="tag">choices</a></div></div>', '<div class="quote"><span class="text">“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">inspirational</a> <a class="tag">life</a> <a class="tag">live</a> <a class="tag">miracle</a> <a class="tag">miracles</a></div></div>', '<div class="quote"><span class="text">“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”</span><span>by <small class="author">Jane Austen</small></span><div class="tags">Tags: <a class="tag">aliteracy</a> <a class="tag">books</a> <a class="tag">classic</a> <a class="tag">humor</a></div></div>', '<div class="quote"><span class="text">“Imperfection is beauty, madness is genius and it\'s better to be absolutely ridiculous than absolutely boring.”</span><span>by <small class="author">Marilyn Monroe</small></span><div class="tags">Tags: <a class="tag">be-yourself</a> <a class="tag">inspirational</a></div></div>', '<div class="quote"><span class="text">“Try not to become a man of success. Rather become a man of value.”</span><span>by <small class="author">Albert Einstein</small></span><div class="tags">Tags: <a class="tag">adulthood</a> <a class="tag">success</a> <a class="tag">value</a></div></div>', '<div class="quote"><span class="text">“It is better to be hated for what you are than to be loved for what you are not.”</span><span>by <small class="author">André Gide</small></span><div class="tags">Tags: <a class="tag">life</a> <a class="tag">love</a></div></div>', '<div class="quote"><span class="text">“I have not failed. I\'ve just found 10,000 ways that won\'t work.”</span><span>by <small class="author">Thomas A. Edison</small></span><div class="tags">Tags: <a class="tag">edison</a> <a class="tag">failure</a> <a class="tag">inspirational</a> <a class="tag">paraphrased</a></div></div>', '<div class="quote"><span class="text">“A woman is like a tea bag; you never know how strong it is until it\'s in hot water.”</span><span>by <small class="author">Eleanor Roosevelt</small></span><div class="tags">Tags: <a class="tag">misattributed-eleanor-roosevelt</a></div></div>', '<div class="quote"><span class="text">“A day without sunshine is like, you know, night.”</span><span>by <small class="author">Steve Martin</small></span><div class="tags">Tags: <a class="tag">humor</a> <a class="tag">obvious</a> <a class="tag">simile</a></div></div>']

items.pyをカスタマイズ

1
2
3
class QuoteItem(scrapy.Item):
quote = scrapy.Field()
author = scrapy.Field()

quotesjs.pyをカスタマイズ

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from scrapy_splash_tutorial.items import QuoteItem

class QuotesjsSpider(scrapy.Spider):
name = 'quotesjs'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/js/']

def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 0.5},
)

def parse(self, response):
for q in response.css(".container .quote"):
quote = QuoteItem()
quote["author"] = q.css(".author::text").extract_first()
quote["quote"] = q.css(".text::text").extract_first()
yield quote

クローラーを実行する

scrapy crawl quotesjs -o result.jsonでクローラーを実行する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
$scrapy crawl quotesjs -o result.json
2020-05-06 18:34:02 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_splash_tutorial)
2020-05-06 18:34:02 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-06 18:34:02 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-06 18:34:02 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_splash_tutorial',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'EDITOR': '/usr/bin/vim',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'scrapy_splash_tutorial.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_splash_tutorial.spiders']}
2020-05-06 18:34:02 [scrapy.extensions.telnet] INFO: Telnet Password: febe521f79cff551
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy_splash.SplashDeduplicateArgsMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-06 18:34:02 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-06 18:34:02 [scrapy.core.engine] INFO: Spider opened
2020-05-06 18:34:02 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-06 18:34:02 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-06 18:34:02 [py.warnings] WARNING: /usr/local/lib/python3.8/site-packages/scrapy_splash/request.py:41: ScrapyDeprecationWarning: Call to deprecated function to_native_str. Use to_unicode instead.
url = to_native_str(url)

2020-05-06 18:34:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-06 18:34:03 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://splash:8050/robots.txt> (referer: None)
2020-05-06 18:34:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/js/ via http://splash:8050/render.html> (referer: None)
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“The world as we have created it is a process of our thinking. It '
'cannot be changed without changing our thinking.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'J.K. Rowling',
'quote': '“It is our choices, Harry, that show what we truly are, far more '
'than our abilities.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“There are only two ways to live your life. One is as though '
'nothing is a miracle. The other is as though everything is a '
'miracle.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Jane Austen',
'quote': '“The person, be it gentleman or lady, who has not pleasure in a '
'good novel, must be intolerably stupid.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Marilyn Monroe',
'quote': "“Imperfection is beauty, madness is genius and it's better to be "
'absolutely ridiculous than absolutely boring.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Albert Einstein',
'quote': '“Try not to become a man of success. Rather become a man of value.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'André Gide',
'quote': '“It is better to be hated for what you are than to be loved for '
'what you are not.”'}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Thomas A. Edison',
'quote': "“I have not failed. I've just found 10,000 ways that won't work.”"}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Eleanor Roosevelt',
'quote': '“A woman is like a tea bag; you never know how strong it is until '
"it's in hot water.”"}
2020-05-06 18:34:04 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/js/>
{'author': 'Steve Martin',
'quote': '“A day without sunshine is like, you know, night.”'}
2020-05-06 18:34:04 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-06 18:34:04 [scrapy.extensions.feedexport] INFO: Stored json feed (10 items) in: result.json
2020-05-06 18:34:04 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 960,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 2,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 9757,
'downloader/response_count': 3,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 2,
'elapsed_time_seconds': 2.285135,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 6, 9, 34, 4, 575789),
'item_scraped_count': 10,
'log_count/DEBUG': 13,
'log_count/INFO': 11,
'log_count/WARNING': 1,
'memusage/max': 56578048,
'memusage/startup': 56578048,
'response_received_count': 3,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/404': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'splash/render.html/request_count': 1,
'splash/render.html/response_count/200': 1,
'start_time': datetime.datetime(2020, 5, 6, 9, 34, 2, 290654)}
2020-05-06 18:34:04 [scrapy.core.engine] INFO: Spider closed (finished)

生成されたresult.jsonは以下。

1
2
3
4
5
6
7
8
9
10
11
12
[
{"author": "Albert Einstein", "quote": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d"},
{"author": "J.K. Rowling", "quote": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d"},
{"author": "Albert Einstein", "quote": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d"},
{"author": "Jane Austen", "quote": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d"},
{"author": "Marilyn Monroe", "quote": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d"},
{"author": "Albert Einstein", "quote": "\u201cTry not to become a man of success. Rather become a man of value.\u201d"},
{"author": "Andr\u00e9 Gide", "quote": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d"},
{"author": "Thomas A. Edison", "quote": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d"},
{"author": "Eleanor Roosevelt", "quote": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d"},
{"author": "Steve Martin", "quote": "\u201cA day without sunshine is like, you know, night.\u201d"}
]

コメント・シェア

Scrapy公式チュートリアル

Installation

docker run -it -p 8050:8050 --rm scrapinghub/splashだが、docker-composeで操作する。

docker-compose.ymlで定義。

1
2
3
4
splash:
image: scrapinghub/splash
ports:
- 8050:8050

実行のテスト

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ docker-compose run splash
Pulling splash (scrapinghub/splash:)...
latest: Pulling from scrapinghub/splash
2746a4a261c9: Pull complete
4c1d20cdee96: Pull complete
~略~
50ea6de52777: Pull complete
43e94179bda5: Pull complete
Digest: sha256:01c89e3b0598e904fea184680b82ffe74524e83160f793884dc88d184056c49d
Status: Downloaded newer image for scrapinghub/splash:latest
2020-05-06 04:13:03+0000 [-] Log opened.
2020-05-06 04:13:03.106078 [-] Xvfb is started: ['Xvfb', ':2112596484', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-05-06 04:13:03.184966 [-] Splash version: 3.4.1
2020-05-06 04:13:03.217438 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-05-06 04:13:03.217581 [-] Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
2020-05-06 04:13:03.217654 [-] Open files limit: 1048576
2020-05-06 04:13:03.217695 [-] Can't bump open files limit
2020-05-06 04:13:03.231322 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-05-06 04:13:03.231620 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-05-06 04:13:03.343525 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-05-06 04:13:03.343858 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2020-05-06 04:13:03.344260 [-] Site starting on 8050
2020-05-06 04:13:03.344470 [-] Starting factory <twisted.web.server.Site object at 0x7f23c5cb6160>
2020-05-06 04:13:03.344768 [-] Server listening on http://0.0.0.0:8050

使用する際はdocker-compose up -dで。

Splash WebUI

起動したSplashにアクセスするとWebUIから操作が可能。

Splash WebUI width=640

標準で表示されているコードでRender me!を実行する。

Splash WebUI width=640

Intro

Splash can execute custom rendering scripts written in the Lua programming language. This allows us to use Splash as a browser automation tool similar to PhantomJS.
Lua言語で記述されたカスタムレンダリングスクリプトを実行できるPhantomJS的なもの。
Lua言語はRedis, Nginx, Apache, World of Warcraft scripts,などのカスタムスクリプトの記述に使われている。

以下のチュートリアルが紹介されている。

1
2
3
4
5
6
function main(splash, args)
splash:go("http://example.com")
splash:wait(0.5)
local title = splash:evaljs("document.title")
return {title=title}
end

WebUI上でRender me!を実行すると、returnで返し多JSONオブジェトが得られる。

Splash WebUI width=640

Splash WebUI width=640

Entry Point: the “main” Function

1
2
3
function main(splash)
return {hello="world!"}
end

SplashのWebGUIで実行すると以下の結果になる。

1
2
Splash Response: Object
hello: "world!"

JSON形式ではなく、文字列で返すこともできる。

1
2
3
function main(splash)
return 'hello'
end

docker-composeでsplashというサービスなのでホスト名はsplashを使用している。

1
2
$ curl 'http://splash:8050/execute?lua_source=function+main%28splash%29%0D%0A++return+%27hello%27%0D%0Aend'
hello

Where Are My Callbacks?

It is not doing exactly the same work - instead of saving screenshots to files we’re returning PNG data to the client via HTTP API.
スクリーンショットをPNG形式で取得しWebAPIで返却する例

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
function main(splash, args)
splash:set_viewport_size(800, 600)
splash:set_user_agent('Splash bot')
local example_urls = {"www.google.com", "www.bbc.co.uk", "scrapinghub.com"}
local urls = args.urls or example_urls
local results = {}
for _, url in ipairs(urls) do
local ok, reason = splash:go("http://" .. url)
if ok then
splash:wait(0.2)
results[url] = splash:png()
end
end
return results
end

WebUI上でRender me!を実行すると、各サイトのスクリーンショットが表示される。

Splash WebUI width=640

Calling Splash Methods

There are two main ways to call Lua methods in Splash scripts: using positional and named arguments. To call a method using positional arguments use parentheses splash:foo(val1, val2), to call it with named arguments use curly braces: splash:foo{name1=val1, name2=val2}:

Luaのメソッド呼び出しは位置引数(Positional arguments)によるsplash:foo(val1, val2)や名前引数(named arguments)splash:foo{name1=val1, name2=val2}によるものがある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
function main(splash, args)
-- Examples of positional arguments:
splash:go("http://example.com")
splash:wait(0.5, false)
local title = splash:evaljs("document.title")

-- The same using keyword arguments:
splash:go{url="http://google.com"}
splash:wait{time=0.5, cancel_on_redirect=false}
local title = splash:evaljs{snippet="document.title"}

-- Mixed arguments example:
splash:wait{0.5, cancel_on_redirect=false}

return title
end

このチュートリアル自体に意味はないが、コード上evaljs{source="document.title"}となっているので動作しない。
splash:evaljsのリファレンスsnippetである事がわかる。

Error Handling

Splash uses the following convention:

  1. for developer errors (e.g. incorrect function arguments) exception is raised;
  2. for errors outside developer control (e.g. a non-responding remote website) status flag is returned: functions that can fail return ok, reason pairs which developer can either handle or ignore.
    If main results in an unhandled exception then Splash returns HTTP 400 response with an error message.

Splashのルールでは以下のルール。

  1. 開発者エラーは例外にする
  2. 開発者が制御できないエラーはstatusで返す

例外はerror()で明示的に発生させることができる。

1
2
3
4
5
6
7
function main(splash, args)
local ok, msg = splash:go("http://no-url.example.com")
if not ok then
-- handle error somehow, e.g.
error(msg)
end
end

例外の場合、LuaのHTTPレスポンスHTTP 400のエラーとして返す。

1
2
3
4
5
6
7
8
9
10
11
12
{
"error": 400,
"type": "ScriptError",
"description": "Error happened while executing Lua script",
"info": {
"source": "[string \"function main(splash, args)\r...\"]",
"line_number": 5,
"error": "network3",
"type": "LUA_ERROR",
"message": "Lua error: [string \"function main(splash, args)\r...\"]:5: network3"
}
}

同じコードをassert()で表現できる。

1
2
3
4
function main(splash, args)
-- a shortcut for the code above: use assert
assert(splash:go("http://no-rul.example.com"))
end

Sandbox

By default Splash scripts are executed in a restricted environment: not all standard Lua modules and functions are available, Lua require is restricted, and there are resource limits (quite loose though).

デフォルトではSplashはサンドボックスで実行される。無効化するには-disable-lua-sandboxオプションを使う。

Dockerコマンドをそのまま使用するなら以下のように。

1
`docker run -it -p 8050:8050 scrapinghub/splash --disable-lua-sandbox`

docker-composeなら、commandでオプションを渡す。

1
2
3
4
5
splash:
image: scrapinghub/splash
command: --disable-lua-sandbox
ports:
- 8050:8050

docker-compose runでテスト実行するとLua: enabled (sandbox: disabled)を確認できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PS C:\Users\g\OneDrive\devel\gggcat@github\python3-tutorial> docker-compose run splash
2020-05-06 06:02:02+0000 [-] Log opened.
2020-05-06 06:02:02.166203 [-] Xvfb is started: ['Xvfb', ':1094237403', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2020-05-06 06:02:02.242322 [-] Splash version: 3.4.1
2020-05-06 06:02:02.275180 [-] Qt 5.13.1, PyQt 5.13.1, WebKit 602.1, Chromium 73.0.3683.105, sip 4.19.19, Twisted 19.7.0, Lua 5.2
2020-05-06 06:02:02.275346 [-] Python 3.6.9 (default, Nov 7 2019, 10:44:02) [GCC 8.3.0]
2020-05-06 06:02:02.275497 [-] Open files limit: 1048576
2020-05-06 06:02:02.275605 [-] Can't bump open files limit
2020-05-06 06:02:02.289473 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2020-05-06 06:02:02.289650 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2020-05-06 06:02:02.398489 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2020-05-06 06:02:02.398754 [-] Web UI: enabled, Lua: enabled (sandbox: disabled), Webkit: enabled, Chromium: enabled
2020-05-06 06:02:02.399073 [-] Site starting on 8050
2020-05-06 06:02:02.399156 [-] Starting factory <twisted.web.server.Site object at 0x7f02ac5b61d0>
2020-05-06 06:02:02.399344 [-] Server listening on http://0.0.0.0:8050

Timeouts

By default Splash aborts script execution after a timeout (30s by default); it is a common problem for long scripts.

タイムアウトはデフォルトで30秒。

コメント・シェア

AmazonLinux

AmazonLinuxのイメージ一覧

  • AmazonLinux2
  • リージョンがap-northeast-1
  • アーキテクチャがx86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
-------------------------------------------------------------------------------------------------------------
| DescribeImages |
+-----------------------------------------------------------------------+------------------------+----------+
| amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2 | ami-0f310fced6141e627 | x86_64 |
| amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs | ami-06aa6ba9dc39dc071 | x86_64 |
| amzn2-ami-hvm-2.0.20200304.0-x86_64-gp2 | ami-052652af12b58691f | x86_64 |
| amzn2-ami-hvm-2.0.20200304.0-x86_64-ebs | ami-0c6f9336767cd9243 | x86_64 |
~略~
| amzn2-ami-hvm-2017.12.0.20180109-x86_64-gp2 | ami-6be57d0d | x86_64 |
| amzn2-ami-hvm-2017.12.0.20180109-x86_64-ebs | ami-39e37b5f | x86_64 |
| amzn2-ami-hvm-2017.12.0.20171212.2-x86_64-gp2 | ami-2a34b64c | x86_64 |
| amzn2-ami-hvm-2017.12.0.20171212.2-x86_64-ebs | ami-1d37b57b | x86_64 |
+-----------------------------------------------------------------------+------------------------+----------+

AmazonLinuxの最新イメージを取得する

  • バージョン: 2
  • リージョンが: ap-northeast-1
  • アーキテクチャ: x86_64
  • ボリューム: gp2

ボリュームタイプでイメージが異なるので、以下はgp2(現行の汎用SSD)のボリュームで検索している。

1
2
3
4
5
6
7
8
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-gp2" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[1].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------
| DescribeImages |
+-------------------------------------------+
| amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs |
| ami-06aa6ba9dc39dc071 |
| x86_64 |
+-------------------------------------------+

AmazonLinuxのボリュームタイプ

amzn2-ami-hvm-*-x86_64-ebsはVolumeType: standardで旧世代のボリュームタイプを使用している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-gp2" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-07T17:30:34.000Z",
"ImageId": "ami-0f310fced6141e627",
"ImageLocation": "amazon/amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-06688593da98411ef",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
}
],
"Description": "Amazon Linux 2 AMI 2.0.20200406.0 x86_64 HVM gp2",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn2-ami-hvm-2.0.20200406.0-x86_64-gp2",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
$ aws ec2 describe-images --owners amazon --filters "Name=name,Values=amzn2-ami-hvm-*-ebs" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-07T17:14:50.000Z",
"ImageId": "ami-06aa6ba9dc39dc071",
"ImageLocation": "amazon/amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs",
"ImageType": "machine",
"Public": true,
"OwnerId": "137112412989",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/xvda",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-06688593da98411ef",
"VolumeSize": 8,
"VolumeType": "standard",
"Encrypted": false
}
}
],
"Description": "Amazon Linux 2 AMI 2.0.20200406.0 x86_64 HVM ebs",
"EnaSupport": true,
"Hypervisor": "xen",
"ImageOwnerAlias": "amazon",
"Name": "amzn2-ami-hvm-2.0.20200406.0-x86_64-ebs",
"RootDeviceName": "/dev/xvda",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}

UbuntuLinux

AMIでUbuntuLinuxの指定バージョンの最新イメージ

UbuntuLinuxの公式は099720109477なのでこれを基本に絞り込んでいく。

  • バージョン: 18.04
  • リージョンがap-northeast-1
  • アーキテクチャ: x86_64
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
$aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------------------------------------------------------------------------------------------
| DescribeImages |
+-------------------------------------------------------------------------------------------+------------------------+----------+
| ubuntu-minimal/images/hvm-ssd/ubuntu-bionic-18.04-amd64-minimal-20200430 | ami-0084e4332fdb227c6 | x86_64 |
| ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20200408 | ami-0a8f568a6a14353b6 | x86_64 |
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 | ami-0278fe6949f6b1a06 | x86_64 |
| ubuntu-eks/k8s_1.15/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200406.1 | ami-0fd103c2168938a67 | x86_64 |
| ubuntu-minimal/images/hvm-ssd/ubuntu-bionic-18.04-amd64-minimal-20200406.1 | ami-0c1bb33d8c0bd2145 | x86_64 |
~略~
| ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20180426.2 | ami-19d33266 | x86_64 |
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20180426.2 | ami-82c928fd | x86_64 |
| ubuntu-minimal/images-testing/hvm-ssd/ubuntu-bionic-18.04-daily-amd64-minimal-20180328.1 | ami-ddcec5a1 | x86_64 |
| ubuntu-minimal/images-testing/hvm-ssd/ubuntu-bionic-18.04-daily-amd64-minimal-20180329 | ami-54747f28 | x86_64 |
+-------------------------------------------------------------------------------------------+------------------------+----------+

複合条件で以下を条件として、18.04の最新イメージの情報を取得する。

  • Ubuntu 18.04
  • リージョンがap-northeast-1
  • アーキテクチャがx86_64
  • ボリューム: gp2
1
2
3
4
5
6
7
8
$aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-ssd/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0].[Name,ImageId,Architecture]' --output table --region ap-northeast-1
---------------------------------------------------------------------
| DescribeImages |
+-------------------------------------------------------------------+
| ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408 |
| ami-0278fe6949f6b1a06 |
| x86_64 |
+-------------------------------------------------------------------+

UbuntuLinuxのボリュームタイプ

  • ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-*: 現行SSDボリューム
  • ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-*: ボリュームマウントなし
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
$ aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-ssd/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[0]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-04-09T16:44:23.000Z",
"ImageId": "ami-0278fe6949f6b1a06",
"ImageLocation": "099720109477/ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408",
"ImageType": "machine",
"Public": true,
"OwnerId": "099720109477",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [
{
"DeviceName": "/dev/sda1",
"Ebs": {
"DeleteOnTermination": true,
"SnapshotId": "snap-0cb75af02a9254c11",
"VolumeSize": 8,
"VolumeType": "gp2",
"Encrypted": false
}
},
{
"DeviceName": "/dev/sdb",
"VirtualName": "ephemeral0"
},
{
"DeviceName": "/dev/sdc",
"VirtualName": "ephemeral1"
}
],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2020-04-08",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20200408",
"RootDeviceName": "/dev/sda1",
"RootDeviceType": "ebs",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}
$ aws ec2 describe-images --owners 099720109477 --filters "Name=name,Values=ubuntu/images/hvm-instance/*18.04*" "Name=architecture,Values=x86_64" --query 'reverse(sort_by(Images, &CreationDate))[1]' --region ap-northeast-1
{
"Architecture": "x86_64",
"CreationDate": "2020-03-24T21:03:50.000Z",
"ImageId": "ami-0dc413a5565744b02",
"ImageLocation": "ubuntu-images-ap-northeast-1-release/bionic/20200323/hvm/instance-store/image.img.manifest.xml",
"ImageType": "machine",
"Public": true,
"OwnerId": "099720109477",
"PlatformDetails": "Linux/UNIX",
"UsageOperation": "RunInstances",
"State": "available",
"BlockDeviceMappings": [],
"Description": "Canonical, Ubuntu, 18.04 LTS, amd64 bionic image build on 2020-03-23",
"EnaSupport": true,
"Hypervisor": "xen",
"Name": "ubuntu/images/hvm-instance/ubuntu-bionic-18.04-amd64-server-20200323",
"RootDeviceType": "instance-store",
"SriovNetSupport": "simple",
"VirtualizationType": "hvm"
}

コメント・シェア

Scrapyの公式チュートリアル

 
カテゴリー Python Tutorial   タグ

Scrapy公式チュートリアル

We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors.
This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project
  2. Writing a spider to crawl a site and extract data
  3. Exporting the scraped data using the command line
  4. Changing spider to recursively follow links
  5. Using spider arguments

他にも良質なコンテンツへのリンクがある

Installation

チュートリアルの前にScrapyをインストールする。
依存するパッケージがあるので、Installation guideに従いインストールする。

Ubuntsu環境でテストするので、追加パッケージをインストール。

1
apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

Scrapyはpipでインストール。

1
pip install scrapy

パッケージが不足した状態でインストールするとエラーになる。

1
ERROR: Command errored out with exit status 1: /usr/local/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-4zradyeg/Twisted/setup.py'"'"'; __file__='"'"'/tmp/pip-install-4zradyeg/Twisted/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-r8m1686g/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/Twisted Check the logs for full command output.

Creating a project

scrapy startprojectでプロジェクトを作成

1
2
3
4
5
6
7
$ scrapy startproject scrapy_tutorial_quotes
New Scrapy project 'scrapy_tutorial_quotes', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_tutorial_quotes

You can start your first spider with:
cd scrapy_tutorial_quotes
scrapy genspider example example.com

以下のディレクトリ構成で作成される。

1
2
3
4
5
6
7
8
9
10
11
12
.
├── scrapy.cfg
└── scrapy_tutorial_quotes
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

Our first Spider

チュートリアルのコードに従い作成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

nameがSpiderの一意な識別子。

start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

star_requests()でクローリングする対象のコネクションを取得する。ジェネレーターかリストでrequestsを返す。

parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

parse()requestsが生成したresponseをスクレイピングする処理を記述する。

How to run our spider

以下のコマンドで実行する。

1
scrapy crawl quotes

実行すると以下のログが出力され、quotes-1.htmlquotes-2.htmlが生成される。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
$ scrapy crawl quotes
2020-05-03 01:38:15 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 01:38:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 01:38:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 01:38:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'EDITOR': '/usr/bin/vim',
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 01:38:15 [scrapy.extensions.telnet] INFO: Telnet Password: afdf50795ed4260d
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 01:38:15 [scrapy.core.engine] INFO: Spider opened
2020-05-03 01:38:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-03 01:38:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 01:38:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 01:38:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-03 01:38:16 [quotes] DEBUG: Saved file quotes-1.html
2020-05-03 01:38:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-03 01:38:17 [quotes] DEBUG: Saved file quotes-2.html
2020-05-03 01:38:17 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-03 01:38:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6003,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.982914,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 2, 16, 38, 17, 288861),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'memusage/max': 55898112,
'memusage/startup': 55898112,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 2, 16, 38, 15, 305947)}
2020-05-03 01:38:17 [scrapy.core.engine] INFO: Spider closed (finished)

A shortcut to the start_requests method

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs, you can just define a start_urls class attribute with a list of URLs. This list will then be used by the default implementation of start_requests() to create the initial requests for your spider:

start_urlsというリストを設定すればデフォルトのstart_requests()を使える。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

parse()がデフォルトのコールバックメソッド。

Extracting data

Scrapy shellを使ってデータ構造を解析する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
$scrapy shell 'http://quotes.toscrape.com/page/1/'
2020-05-03 01:51:11 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 01:51:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 01:51:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 01:51:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'EDITOR': '/usr/bin/vim',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 01:51:11 [scrapy.extensions.telnet] INFO: Telnet Password: 2c0c7af38c3cc618
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 01:51:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 01:51:11 [scrapy.core.engine] INFO: Spider opened
2020-05-03 01:51:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 01:51:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f11ff2c60a0>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7f11ff2c3be0>
[s] spider <DefaultSpider 'default' at 0x7f11ff0c7700>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

CSSやXPathを使ってデータを抽出できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

Extracting quotes and authors

Scrapy shellを使って対象データを解析していく。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ scrapy shell 'http://quotes.toscrape.com'
…略…
>>> response.css("div.quote")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>]
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Extracting data in our spider

Scrapy shellを使って解析した結果を元にparse()をコーディングしていく。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
`````````

実行結果。

``` python
$scrapy crawl quotes2
2020-05-03 02:05:13 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 02:05:13 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 02:05:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 02:05:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'EDITOR': '/usr/bin/vim',
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 02:05:13 [scrapy.extensions.telnet] INFO: Telnet Password: 6bee2d1ba39b9e9c
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 02:05:13 [scrapy.core.engine] INFO: Spider opened
2020-05-03 02:05:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-03 02:05:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 02:05:14 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 02:05:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2020-05-03 02:05:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2020-05-03 02:05:15 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-03 02:05:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6003,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.762026,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 2, 17, 5, 15, 270853),
'item_scraped_count': 20,
'log_count/DEBUG': 23,
'log_count/INFO': 10,
'memusage/max': 55705600,
'memusage/startup': 55705600,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 2, 17, 5, 13, 508827)}
2020-05-03 02:05:15 [scrapy.core.engine] INFO: Spider closed (finished)

Storing the scraped data

スクレイピングの結果をJSON形式でファイルに保存する。

1
scrapy crawl quotes2 -o quotes2.json

別の形式としてJsonLine形式が使える

1
scrapy crawl quotes2 -o quotes2.jl

JSON形式の実行結果は以下、JsonLineの場合、リストではなく{}の行の集合になっている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": ["life", "love"]},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]},
{"text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe", "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"]},
{"text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling", "tags": ["courage", "friends"]},
{"text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein", "tags": ["simplicity", "understand"]},
{"text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley", "tags": ["love"]},
{"text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss", "tags": ["fantasy"]},
{"text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams", "tags": ["life", "navigation"]},
{"text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel", "tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"]},
{"text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche", "tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"]},
{"text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain", "tags": ["books", "contentment", "friends", "friendship", "life"]},
{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]}
]

Following links / A shortcut for creating Requests

次のページの処理。リンクを抽出して再帰的にクローリングする。

1
2
3
4
5
6
7
8
$ scrapy shell 'http://quotes.toscrape.com'
…略…
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
>>> response.css('li.next a').attrib['href']
'/page/2/'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes3"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

urljoin()で相対パスからURLを生成しているが、これはfollow()で省略できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes3"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)

CSSセレクターで取得した複数の属性から生成することもできる。

1
2
for href in response.css('ul.pager a::attr(href)'):
yield response.follow(href, callback=self.parse)

アンカータグを指定するだけで、自動的にリンクを取得する省略も可能。

1
2
for a in response.css('ul.pager a'):
yield response.follow(a, callback=self.parse)

さらに、follow_all()ですべてのリンクをたどることができる。

1
2
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

そしてワンライナー。

1
yield from response.follow_all(css='ul.pager a', callback=self.parse)

簡潔にクロールできる。

More examples and patterns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy

class AuthorSpider(scrapy.Spider):
name = 'author'

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)

pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)

def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()

yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

デフォルトでは、同じページへのアクセスを重複してしない。これはscrapy.cfgDUPEFILTER_CLASSで設定できる。

As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

より高機能なCrawlSpierクラスがある。

Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.

典型的な使い方に、コールバックにデータを渡すトリックを使って複数のページから取得した情報を使ってデータを生成できる。

Passing additional data to callback functions

cb_kwargsを使ってパラメーターを渡す。

1
2
3
4
5
6
7
8
9
10
11
12
13
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request

def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)

Using spider arguments

コマンドラインからパラメーターを渡すことができる。

1
scrapy crawl quotes -o quotes-humor.json -a tag=humor

-aで渡したパラメーターはtag = getattr()で取得できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)

コメント・シェア

Scrapyフレームワーク

 
カテゴリー Python   タグ

Scrapy

Scrapy width=640

Scrapy at a glance

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

ScrapyはWebサイトのクロールとスクレイピングを行うアプリケーションフレームワーク。

Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.

CSSセレクターやXPathを使ってHTML/XMLソースから抽出することができる。

An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.

CSSセレクターやXPathを試すためのインタラクティブシェルが用意されている。

Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)

複数のデータフォーマットでエクスポートし、複数のバックエンドへ保存することができる。

Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.

堅牢なEncodingサポートと自動検出。

Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).

強力な拡張性。

Wide range of built-in extensions and middlewares for handling:

  • cookies and session handling
  • HTTP features like compression, authentication, caching
  • user-agent spoofing
  • robots.txt
  • crawl depth restriction
  • and more
  • Cookieとセッションハンドリング
  • 圧縮、認証、キャッシングなどのHTTP機能
  • User-Agentの偽装
  • robots.txt対応
  • クロールの階層制限

A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler

Scrapy内で実行されているPythoneコンソールをフックするTelnetコンソール。

Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!

SitemapやXML/CSVフィードからサイトをクロールするためのSpider、スクレイピングされたアイテムに関連付けられた画像等を自動的にダウンロードするためのメディアパイプライン、キャッシングDNSリゾルバーなど。

Architecture overview

Scrapy width=640

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.
  1. Sipderからcrawlに対する初期Requestsを取得
  2. SchedulerへRequestsをスケジュールし、crawlに対する次のRequestsを要求する
  3. Schedulerは次のRequestsを返す
  4. Downloaderに対してDownloader Middlewares経由でRequestsを送る
  5. Downloaderはページのダウンロードが完了すると、Responseを生成しDownloader Middlewares経由で返す
  6. DownloaderからResponseを受け取り、Spider Middleware経由でSpiderへ送る
  7. SpiderはResponseを処理し、スクレイプしたアイテムを返して、新しいRequestsをSpider Middleware経由で送る
  8. スクレイプしたアイテムはItem Pipelinesへ送られ、crawlに対して次のRequestsを確認する
  9. Requestsがなくなるまで繰り返す

コメント・シェア

docker上でsshの秘密鍵を使ってsshしたいときのハードル

まず、ContainerImageに埋め込むのは除き、実行時に動的に渡す方法として、以下。

  • volumesでホストのファイルを共有する
  • secretsでホストのファイルを共有する

共有した場合の問題点としてはハードルになる。

  • 共有したファイルパーミッションが755になりsshクライアントでエラーとなる
1
2
3
4
5
6
7
8
Warning: Permanently added 'XXXXXXXXXXXXXXXXXXXXX' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0755 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/root/.ssh/id_rsa": bad permissions

そして、Immutableであることによるハードルとして、以下がある。

  • UserKnownHostsファイルの追記のための対話型操作
1
Are you sure you want to continue connecting (yes/no)?

docker-composeでsecretsを使って秘密鍵を渡す

versionは3.1以降とする必要がある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
version: '3.1'
services:
xxxxx:
image: python:3-slim
tty: true
volumes:
- ./entrypoint.sh:/work/entrypoint.sh:ro
working_dir: /work/
entrypoint: ./entrypoint.sh
command: /bin/bash
environment:
PYTHONDONTWRITEBYTECODE: 1
GIT_SSH_COMMAND: 'ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no'
secrets:
- host_ssh_key

secrets:
host_ssh_key:
file: ${SSH_KEY_PATH}

entrypoint.shでコピーしてパーミッションを修正する

secretsによって共有された内容は/run/secrets/でReadOnlyアクセスが可能だが、パーミッションは755のため、sshコマンドでエラーになってしまう。そのため、ENTRYPOINTで、コピーしてパーミッションを設定している。

1
2
3
4
5
6
7
8
9
#!/bin/bash

mkdir -p ~/.ssh
chown -R root:root ~/.ssh
chmod -R 0700 ~/.ssh
cp -ip /run/secrets/host_ssh_key ~/.ssh/id_rsa
chmod -R 0600 ~/.ssh

exec "$@"

GIT_SSH_COMMANDでGitが実行するsshコマンドオプションを設定する

環境変数GIT_SSH_COMMANDはGitが実行するsshコマンドを指定することができる。
UserKnownHostsFile=/dev/nullでKnownHostsファイルを生成しないようにし、StrictHostKeyChecking=noで対話型の要求を抑制する。

docker-composeのsecret対応状況

version: '3'で実行した場合

ERROR: The Compose file ‘.\docker-compose.yml’ is invalid because:
Invalid top-level property “secrets”. Valid top-level sections for this Compose file are: version, services, networks, volumes, and extensions starting with “x-“.

You might be seeing this error because you’re using the wrong Compose file version. Either specify a supported version (e.g “2.2” or “3.3”) and place your service definitions under the services key, or omit the version key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

secretsでパーミッションを指定できない

パーミッション指定に対応しているように見えるが、docker-compose version 1.25.4, build 8d51620aでは利用できなかった。

mode: サービスのタスクコンテナーにおいて /run/secrets/ にマウントされるファイルのパーミッション。 8 進数表記。 たとえば 0444 であればすべて読み込み可。 Docker 1.13.1 におけるデフォルトは 0000 でしたが、それ以降では 0444 となりました。 secrets はテンポラリなファイルシステム上にマウントされるため、書き込み可能にはできません。 したがって書き込みビットを設定しても無視されます。 実行ビットは設定できます。

1
2
3
4
5
6
secrets:
- source: host_ssh_key
target: host_ssh_key
uid: '103'
gid: '103'
mode: 0600

実行するとサポート対象外のメッセージ。

1
WARNING: Service "xxxxx" uses secret "host_ssh_key" with uid, gid, or mode. These fields are not supported by this implementation of the Compose file

docker-composeの定義ファイルのバージョン

version '3.8の記述がみられるが、docker-compose version 1.25.4, build 8d51620aでは3.3までしか対応していない。

1
2
ERROR: Version in ".\docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

コメント・シェア

Windows10の裏技Godモード

 
カテゴリー Windows   タグ

Godモード

Windowsのコントロールパネルのアイテムなどを一覧からアクセスできるアイコン。

Godモードのフォルダーを作成

以下の名前でフォルダーを作成

1
GodMode.{ED7BA470-8E54-465E-825C-99712043E01C}

Windows 10ではアイコンの名前が見えない状態になっている。

God mode icon width=100

Godモードのショートカットを作成する

以下の内容のショートカットを作成

1
explorer shell:::{ED7BA470-8E54-465E-825C-99712043E01C}

God mode shortcut width=480

God mode shortcut width=480

God mode icon width=100

アイコンの画像をフォルダーとして作成した場合と同じ物に変更

God mode shortcut width=480

God mode shortcut width=480

さまざまなアイテム

Godモード以外にも同じようにコントロールパネルのアイテムを作成することができる。
以下の内容をバッチファイルとして実行すれば各項目が作成される。

1
2
3
4
5
6
7
8
9
10
11
12
mkdir ワイヤレスネットワークの管理.{1FA9085F-25A2-489B-85D4-86326EEDCD87}
mkdir ネットワーク (WORKGROUP).{208D2C60-3AEA-1069-A2D7-08002B30309D}
mkdir このコンピューター.{20D04FE0-3AEA-1069-A2D8-08002B30309D}
mkdir RemoteAppとデスクトップ接続.{241D7C96-F8BF-4F85-B01F-E2B043341A4B}
mkdir Windowsファイアウォール.{4026492F-2F69-46B8-B9BF-5654FC07E423}
mkdir アセンブリキャッシュビューア.{1D2680C9-0E2A-469d-B787-065558BC7D43}
mkdir デバイスとプリンター(プリンタとFAX).{2227A280-3AEA-1069-A2DE-08002B30309D}
mkdir プログラムと機能.{15eae92e-f17a-4431-9f28-805e482dafd4}
mkdir 既定のプログラム.{17cd9488-1228-4b2f-88ce-4298e93e0966}
mkdir 資格情報マネージャー.{1206F5F1-0569-412C-8FEC-3204630DFB70}
mkdir 通知領域アイコン.{05d7b0f4-2121-4eff-bf6b-ed3f69b894d9}
mkdir 電源オプション.{025A5937-A6BE-4686-A844-36FE4BEC8B6D}

コメント・シェア

やりたいこと

  • Pythonモジュールをパッケージ化
  • モジュールのテストのために、pytestのディレクトリ構成(src、test)を使用
  • モジュールを使った、コマンドラインツールをインストール

作成するファイル一覧

1
2
3
4
5
6
7
8
9
10
11
12
13
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.md
├── setup.cfg
├── setup.py
└── src
├── __init__.py
└── packagedata
├── __init__.py
├── config
│ └── test.ini
└── getpackagedata.py

setup.py

setup.pysetup()を呼ぶだけの内容。

1
2
3
from setuptools import setup

setup()

setup.cfg

setup.cfgはパッケージに関する情報を記述する。
[metadata]は自身の内容を記述する。通常gitリポジトリで作成するLICENSEREADME.mdは流用する形にしている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[metadata]
name = package_with_data
version = attr: src.packagedata.__version__
url = https://xxxxxxxxxxxxxxxxxxxxx
author = XXXXXXXXXX
author_email = xxxxxxx@xxxxxxxxxxxx
license_file = LICENSE
description = package with data
long_description = file: README.md
long_description_content_type = text/markdown

[options]
zip_safe = False
package_dir=
=src
packages = find:
include_package_data=True
install_requires =

[options.packages.find]
where=src

[options.entry_points]
console_scripts =
getdata = packagedata.getpackagedata:main

パッケージにpythonファイル以外のデータを含める

どのファイルを含めるかはMANIFEST.inを使用する

1
include_package_data=True

MANIFEST.in

パッケージに含むもの含まないものを定義する。

1
2
include src/*/config/*.ini
global-exclude *.py[cod] __pycache__ *.so

例えば外部ファイルVERSIONでバージョン番号を決めて、version = file: VERSIONで読み込みたい場合、MANIFEST.inVERSIONを加える必要がある。

サンプルコード(src/packagedata/getpackagedata.py)

1
2
3
4
5
6
7
8
import pkg_resources

def main():
print(pkg_resources.resource_filename('packagedata', 'config'))
print(pkg_resources.resource_string('packagedata', 'config/test.ini').decode())

if __name__ == '__main__':
main()

サンプルコード(src/packagedata/config/test.ini)

1
test=OK

パッケージ作成と実行例

パッケージの作成

python setup.py sdistでパッケージ化する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ python setup.py sdist
running sdist
running egg_info
creating src/package_with_data.egg-info
writing src/package_with_data.egg-info/PKG-INFO
writing dependency_links to src/package_with_data.egg-info/dependency_links.txt
writing entry points to src/package_with_data.egg-info/entry_points.txt
writing top-level names to src/package_with_data.egg-info/top_level.txt
writing manifest file 'src/package_with_data.egg-info/SOURCES.txt'
reading manifest file 'src/package_with_data.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
warning: no previously-included files matching '__pycache__' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution
writing manifest file 'src/package_with_data.egg-info/SOURCES.txt'
running check
creating package_with_data-0.0.1
creating package_with_data-0.0.1/src
creating package_with_data-0.0.1/src/package_with_data.egg-info
creating package_with_data-0.0.1/src/packagedata
creating package_with_data-0.0.1/src/packagedata/config
copying files to package_with_data-0.0.1...
copying LICENSE -> package_with_data-0.0.1
copying MANIFEST.in -> package_with_data-0.0.1
copying README.md -> package_with_data-0.0.1
copying setup.cfg -> package_with_data-0.0.1
copying setup.py -> package_with_data-0.0.1
copying src/package_with_data.egg-info/PKG-INFO -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/SOURCES.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/dependency_links.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/entry_points.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/not-zip-safe -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/top_level.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/packagedata/__init__.py -> package_with_data-0.0.1/src/packagedata
copying src/packagedata/getpackagedata.py -> package_with_data-0.0.1/src/packagedata
copying src/packagedata/config/test.ini -> package_with_data-0.0.1/src/packagedata/config
Writing package_with_data-0.0.1/setup.cfg
creating dist
Creating tar archive
removing 'package_with_data-0.0.1' (and everything under it)

パッケージのインストール

pipコマンドでインストール。

1
2
3
4
5
6
7
8
9
10
11
$ cd dist
console_scripts-0.0.1.tar.gz
$ pip install package_with_data-0.0.1.tar.gz
Processing ./package_with_data-0.0.1.tar.gz
Building wheels for collected packages: package-with-data
Building wheel for package-with-data (setup.py) ... done
Created wheel for package-with-data: filename=package_with_data-0.0.1-py3-none-any.whl size=2338 sha256=6f907b30a1c7b26650c8bec598858ceb917410ffc97c4dde3348c3e24702612a
Stored in directory: /root/.cache/pip/wheels/05/c8/a8/fef26a83f0af30cd5c9aece9b54869e169ca26b4d3d5bf7a00
Successfully built package-with-data
Installing collected packages: package-with-data
Successfully installed package-with-data-0.0.1

パッケージに含めたデータの参照

1
2
3
$ getdata
/usr/local/lib/python3.8/site-packages/packagedata/config
test=OK

コメント・シェア



nullpo

めも


募集中


Japan