Scrapyの公式チュートリアル

 
カテゴリー Python Tutorial   タグ

Scrapy公式チュートリアル

We are going to scrape quotes.toscrape.com, a website that lists quotes from famous authors.
This tutorial will walk you through these tasks:

  1. Creating a new Scrapy project
  2. Writing a spider to crawl a site and extract data
  3. Exporting the scraped data using the command line
  4. Changing spider to recursively follow links
  5. Using spider arguments

他にも良質なコンテンツへのリンクがある

Installation

チュートリアルの前にScrapyをインストールする。
依存するパッケージがあるので、Installation guideに従いインストールする。

Ubuntsu環境でテストするので、追加パッケージをインストール。

1
apt-get install python3 python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

Scrapyはpipでインストール。

1
pip install scrapy

パッケージが不足した状態でインストールするとエラーになる。

1
ERROR: Command errored out with exit status 1: /usr/local/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-4zradyeg/Twisted/setup.py'"'"'; __file__='"'"'/tmp/pip-install-4zradyeg/Twisted/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /tmp/pip-record-r8m1686g/install-record.txt --single-version-externally-managed --compile --install-headers /usr/local/include/python3.8/Twisted Check the logs for full command output.

Creating a project

scrapy startprojectでプロジェクトを作成

1
2
3
4
5
6
7
$ scrapy startproject scrapy_tutorial_quotes
New Scrapy project 'scrapy_tutorial_quotes', using template directory '/usr/local/lib/python3.8/site-packages/scrapy/templates/project', created in:
/work/scrapy/scrapy_tutorial_quotes

You can start your first spider with:
cd scrapy_tutorial_quotes
scrapy genspider example example.com

以下のディレクトリ構成で作成される。

1
2
3
4
5
6
7
8
9
10
11
12
.
├── scrapy.cfg
└── scrapy_tutorial_quotes
├── __init__.py
├── __pycache__
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── __pycache__

Our first Spider

チュートリアルのコードに従い作成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)

name: identifies the Spider. It must be unique within a project, that is, you can’t set the same name for different Spiders.

nameがSpiderの一意な識別子。

start_requests(): must return an iterable of Requests (you can return a list of requests or write a generator function) which the Spider will begin to crawl from. Subsequent requests will be generated successively from these initial requests.

star_requests()でクローリングする対象のコネクションを取得する。ジェネレーターかリストでrequestsを返す。

parse(): a method that will be called to handle the response downloaded for each of the requests made. The response parameter is an instance of TextResponse that holds the page content and has further helpful methods to handle it.
The parse() method usually parses the response, extracting the scraped data as dicts and also finding new URLs to follow and creating new requests (Request) from them.

parse()requestsが生成したresponseをスクレイピングする処理を記述する。

How to run our spider

以下のコマンドで実行する。

1
scrapy crawl quotes

実行すると以下のログが出力され、quotes-1.htmlquotes-2.htmlが生成される。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
$ scrapy crawl quotes
2020-05-03 01:38:15 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 01:38:15 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 01:38:15 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 01:38:15 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'EDITOR': '/usr/bin/vim',
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 01:38:15 [scrapy.extensions.telnet] INFO: Telnet Password: afdf50795ed4260d
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 01:38:15 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 01:38:15 [scrapy.core.engine] INFO: Spider opened
2020-05-03 01:38:15 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-03 01:38:15 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 01:38:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 01:38:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-03 01:38:16 [quotes] DEBUG: Saved file quotes-1.html
2020-05-03 01:38:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-03 01:38:17 [quotes] DEBUG: Saved file quotes-2.html
2020-05-03 01:38:17 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-03 01:38:17 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6003,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.982914,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 2, 16, 38, 17, 288861),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'memusage/max': 55898112,
'memusage/startup': 55898112,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 2, 16, 38, 15, 305947)}
2020-05-03 01:38:17 [scrapy.core.engine] INFO: Spider closed (finished)

A shortcut to the start_requests method

Instead of implementing a start_requests() method that generates scrapy.Request objects from URLs, you can just define a start_urls class attribute with a list of URLs. This list will then be used by the default implementation of start_requests() to create the initial requests for your spider:

start_urlsというリストを設定すればデフォルトのstart_requests()を使える。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)

The parse() method will be called to handle each of the requests for those URLs, even though we haven’t explicitly told Scrapy to do so. This happens because parse() is Scrapy’s default callback method, which is called for requests without an explicitly assigned callback.

parse()がデフォルトのコールバックメソッド。

Extracting data

Scrapy shellを使ってデータ構造を解析する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
$scrapy shell 'http://quotes.toscrape.com/page/1/'
2020-05-03 01:51:11 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 01:51:11 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 01:51:11 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 01:51:11 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'EDITOR': '/usr/bin/vim',
'LOGSTATS_INTERVAL': 0,
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 01:51:11 [scrapy.extensions.telnet] INFO: Telnet Password: 2c0c7af38c3cc618
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 01:51:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 01:51:11 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 01:51:11 [scrapy.core.engine] INFO: Spider opened
2020-05-03 01:51:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 01:51:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f11ff2c60a0>
[s] item {}
[s] request <GET http://quotes.toscrape.com/page/1/>
[s] response <200 http://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7f11ff2c3be0>
[s] spider <DefaultSpider 'default' at 0x7f11ff0c7700>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser

CSSやXPathを使ってデータを抽出できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title::text').getall()
['Quotes to Scrape']
>>> response.css('title').getall()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').get()
'Quotes to Scrape'
>>> response.css('title::text')[0].get()
'Quotes to Scrape'
>>> response.css('title::text').re(r'Quotes.*')
['Quotes to Scrape']
>>> response.css('title::text').re(r'Q\w+')
['Quotes']
>>> response.css('title::text').re(r'(\w+) to (\w+)')
['Quotes', 'Scrape']
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').get()
'Quotes to Scrape'

Extracting quotes and authors

Scrapy shellを使って対象データを解析していく。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
$ scrapy shell 'http://quotes.toscrape.com'
…略…
>>> response.css("div.quote")
[<Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>, <Selector xpath="descendant-or-self::div[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype...'>]
>>> quote = response.css("div.quote")[0]
>>> text = quote.css("span.text::text").get()
>>> text
'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
>>> author = quote.css("small.author::text").get()
>>> author
'Albert Einstein'
>>> tags = quote.css("div.tags a.tag::text").getall()
>>> tags
['change', 'deep-thoughts', 'thinking', 'world']

Extracting data in our spider

Scrapy shellを使って解析した結果を元にparse()をコーディングしていく。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes2"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}
`````````

実行結果。

``` python
$scrapy crawl quotes2
2020-05-03 02:05:13 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapy_tutorial_quotes)
2020-05-03 02:05:13 [scrapy.utils.log] INFO: Versions: lxml 4.5.0.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.8.2 (default, Apr 16 2020, 18:36:10) - [GCC 8.3.0], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Linux-4.19.76-linuxkit-x86_64-with-glibc2.2.5
2020-05-03 02:05:13 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2020-05-03 02:05:13 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapy_tutorial_quotes',
'EDITOR': '/usr/bin/vim',
'NEWSPIDER_MODULE': 'scrapy_tutorial_quotes.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapy_tutorial_quotes.spiders']}
2020-05-03 02:05:13 [scrapy.extensions.telnet] INFO: Telnet Password: 6bee2d1ba39b9e9c
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-03 02:05:13 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-03 02:05:13 [scrapy.core.engine] INFO: Spider opened
2020-05-03 02:05:13 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-03 02:05:13 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-03 02:05:14 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2020-05-03 02:05:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/1/> (referer: None)
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein', 'tags': ['change', 'deep-thoughts', 'thinking', 'world']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling', 'tags': ['abilities', 'choices']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein', 'tags': ['inspirational', 'life', 'live', 'miracle', 'miracles']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen', 'tags': ['aliteracy', 'books', 'classic', 'humor']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe', 'tags': ['be-yourself', 'inspirational']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein', 'tags': ['adulthood', 'success', 'value']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide', 'tags': ['life', 'love']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison', 'tags': ['edison', 'failure', 'inspirational', 'paraphrased']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt', 'tags': ['misattributed-eleanor-roosevelt']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/1/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin', 'tags': ['humor', 'obvious', 'simile']}
2020-05-03 02:05:15 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/page/2/> (referer: None)
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”", 'author': 'Marilyn Monroe', 'tags': ['friends', 'heartbreak', 'inspirational', 'life', 'love', 'sisters']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”', 'author': 'J.K. Rowling', 'tags': ['courage', 'friends']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“If you can't explain it to a six year old, you don't understand it yourself.”", 'author': 'Albert Einstein', 'tags': ['simplicity', 'understand']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”", 'author': 'Bob Marley', 'tags': ['love']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”', 'author': 'Dr. Seuss', 'tags': ['fantasy']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”', 'author': 'Douglas Adams', 'tags': ['life', 'navigation']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”", 'author': 'Elie Wiesel', 'tags': ['activism', 'apathy', 'hate', 'indifference', 'inspirational', 'love', 'opposite', 'philosophy']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”', 'author': 'Friedrich Nietzsche', 'tags': ['friendship', 'lack-of-friendship', 'lack-of-love', 'love', 'marriage', 'unhappy-marriage']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“Good friends, good books, and a sleepy conscience: this is the ideal life.”', 'author': 'Mark Twain', 'tags': ['books', 'contentment', 'friends', 'friendship', 'life']}
2020-05-03 02:05:15 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/page/2/>
{'text': '“Life is what happens to us while we are making other plans.”', 'author': 'Allen Saunders', 'tags': ['fate', 'life', 'misattributed-john-lennon', 'planning', 'plans']}
2020-05-03 02:05:15 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-03 02:05:15 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 681,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 6003,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/404': 1,
'elapsed_time_seconds': 1.762026,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 2, 17, 5, 15, 270853),
'item_scraped_count': 20,
'log_count/DEBUG': 23,
'log_count/INFO': 10,
'memusage/max': 55705600,
'memusage/startup': 55705600,
'response_received_count': 3,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/404': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2020, 5, 2, 17, 5, 13, 508827)}
2020-05-03 02:05:15 [scrapy.core.engine] INFO: Spider closed (finished)

Storing the scraped data

スクレイピングの結果をJSON形式でファイルに保存する。

1
scrapy crawl quotes2 -o quotes2.json

別の形式としてJsonLine形式が使える

1
scrapy crawl quotes2 -o quotes2.jl

JSON形式の実行結果は以下、JsonLineの場合、リストではなく{}の行の集合になっている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[
{"text": "\u201cThe world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.\u201d", "author": "Albert Einstein", "tags": ["change", "deep-thoughts", "thinking", "world"]},
{"text": "\u201cIt is our choices, Harry, that show what we truly are, far more than our abilities.\u201d", "author": "J.K. Rowling", "tags": ["abilities", "choices"]},
{"text": "\u201cThere are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\u201d", "author": "Albert Einstein", "tags": ["inspirational", "life", "live", "miracle", "miracles"]},
{"text": "\u201cThe person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\u201d", "author": "Jane Austen", "tags": ["aliteracy", "books", "classic", "humor"]},
{"text": "\u201cImperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.\u201d", "author": "Marilyn Monroe", "tags": ["be-yourself", "inspirational"]},
{"text": "\u201cTry not to become a man of success. Rather become a man of value.\u201d", "author": "Albert Einstein", "tags": ["adulthood", "success", "value"]},
{"text": "\u201cIt is better to be hated for what you are than to be loved for what you are not.\u201d", "author": "Andr\u00e9 Gide", "tags": ["life", "love"]},
{"text": "\u201cI have not failed. I've just found 10,000 ways that won't work.\u201d", "author": "Thomas A. Edison", "tags": ["edison", "failure", "inspirational", "paraphrased"]},
{"text": "\u201cA woman is like a tea bag; you never know how strong it is until it's in hot water.\u201d", "author": "Eleanor Roosevelt", "tags": ["misattributed-eleanor-roosevelt"]},
{"text": "\u201cA day without sunshine is like, you know, night.\u201d", "author": "Steve Martin", "tags": ["humor", "obvious", "simile"]},
{"text": "\u201cThis life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.\u201d", "author": "Marilyn Monroe", "tags": ["friends", "heartbreak", "inspirational", "life", "love", "sisters"]},
{"text": "\u201cIt takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.\u201d", "author": "J.K. Rowling", "tags": ["courage", "friends"]},
{"text": "\u201cIf you can't explain it to a six year old, you don't understand it yourself.\u201d", "author": "Albert Einstein", "tags": ["simplicity", "understand"]},
{"text": "\u201cYou may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect\u2014you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break\u2014her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.\u201d", "author": "Bob Marley", "tags": ["love"]},
{"text": "\u201cI like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.\u201d", "author": "Dr. Seuss", "tags": ["fantasy"]},
{"text": "\u201cI may not have gone where I intended to go, but I think I have ended up where I needed to be.\u201d", "author": "Douglas Adams", "tags": ["life", "navigation"]},
{"text": "\u201cThe opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.\u201d", "author": "Elie Wiesel", "tags": ["activism", "apathy", "hate", "indifference", "inspirational", "love", "opposite", "philosophy"]},
{"text": "\u201cIt is not a lack of love, but a lack of friendship that makes unhappy marriages.\u201d", "author": "Friedrich Nietzsche", "tags": ["friendship", "lack-of-friendship", "lack-of-love", "love", "marriage", "unhappy-marriage"]},
{"text": "\u201cGood friends, good books, and a sleepy conscience: this is the ideal life.\u201d", "author": "Mark Twain", "tags": ["books", "contentment", "friends", "friendship", "life"]},
{"text": "\u201cLife is what happens to us while we are making other plans.\u201d", "author": "Allen Saunders", "tags": ["fate", "life", "misattributed-john-lennon", "planning", "plans"]}
]

Following links / A shortcut for creating Requests

次のページの処理。リンクを抽出して再帰的にクローリングする。

1
2
3
4
5
6
7
8
$ scrapy shell 'http://quotes.toscrape.com'
…略…
>>> response.css('li.next a').get()
'<a href="/page/2/">Next <span aria-hidden="true">→</span></a>'
>>> response.css('li.next a::attr(href)').get()
'/page/2/'
>>> response.css('li.next a').attrib['href']
'/page/2/'
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes3"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)

Now, after extracting the data, the parse() method looks for the link to the next page, builds a full absolute URL using the urljoin() method (since the links can be relative) and yields a new request to the next page, registering itself as callback to handle the data extraction for the next page and to keep the crawling going through all the pages.

urljoin()で相対パスからURLを生成しているが、これはfollow()で省略できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes3"
start_urls = [
'http://quotes.toscrape.com/page/1/',
]

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('span small::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)

CSSセレクターで取得した複数の属性から生成することもできる。

1
2
for href in response.css('ul.pager a::attr(href)'):
yield response.follow(href, callback=self.parse)

アンカータグを指定するだけで、自動的にリンクを取得する省略も可能。

1
2
for a in response.css('ul.pager a'):
yield response.follow(a, callback=self.parse)

さらに、follow_all()ですべてのリンクをたどることができる。

1
2
anchors = response.css('ul.pager a')
yield from response.follow_all(anchors, callback=self.parse)

そしてワンライナー。

1
yield from response.follow_all(css='ul.pager a', callback=self.parse)

簡潔にクロールできる。

More examples and patterns

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import scrapy

class AuthorSpider(scrapy.Spider):
name = 'author'

start_urls = ['http://quotes.toscrape.com/']

def parse(self, response):
author_page_links = response.css('.author + a')
yield from response.follow_all(author_page_links, self.parse_author)

pagination_links = response.css('li.next a')
yield from response.follow_all(pagination_links, self.parse)

def parse_author(self, response):
def extract_with_css(query):
return response.css(query).get(default='').strip()

yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}

Another interesting thing this spider demonstrates is that, even if there are many quotes from the same author, we don’t need to worry about visiting the same author page multiple times. By default, Scrapy filters out duplicated requests to URLs already visited, avoiding the problem of hitting servers too much because of a programming mistake. This can be configured by the setting DUPEFILTER_CLASS.

デフォルトでは、同じページへのアクセスを重複してしない。これはscrapy.cfgDUPEFILTER_CLASSで設定できる。

As yet another example spider that leverages the mechanism of following links, check out the CrawlSpider class for a generic spider that implements a small rules engine that you can use to write your crawlers on top of it.

より高機能なCrawlSpierクラスがある。

Also, a common pattern is to build an item with data from more than one page, using a trick to pass additional data to the callbacks.

典型的な使い方に、コールバックにデータを渡すトリックを使って複数のページから取得した情報を使ってデータを生成できる。

Passing additional data to callback functions

cb_kwargsを使ってパラメーターを渡す。

1
2
3
4
5
6
7
8
9
10
11
12
13
def parse(self, response):
request = scrapy.Request('http://www.example.com/index.html',
callback=self.parse_page2,
cb_kwargs=dict(main_url=response.url))
request.cb_kwargs['foo'] = 'bar' # add more arguments for the callback
yield request

def parse_page2(self, response, main_url, foo):
yield dict(
main_url=main_url,
other_url=response.url,
foo=foo,
)

Using spider arguments

コマンドラインからパラメーターを渡すことができる。

1
scrapy crawl quotes -o quotes-humor.json -a tag=humor

-aで渡したパラメーターはtag = getattr()で取得できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import scrapy

class QuotesSpider(scrapy.Spider):
name = "quotes"

def start_requests(self):
url = 'http://quotes.toscrape.com/'
tag = getattr(self, 'tag', None)
if tag is not None:
url = url + 'tag/' + tag
yield scrapy.Request(url, self.parse)

def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').get(),
'author': quote.css('small.author::text').get(),
}

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
yield response.follow(next_page, self.parse)

コメント・シェア

Scrapyフレームワーク

 
カテゴリー Python   タグ

Scrapy

Scrapy width=640

Scrapy at a glance

Scrapy is an application framework for crawling web sites and extracting structured data which can be used for a wide range of useful applications, like data mining, information processing or historical archival.

ScrapyはWebサイトのクロールとスクレイピングを行うアプリケーションフレームワーク。

Built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.

CSSセレクターやXPathを使ってHTML/XMLソースから抽出することができる。

An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.

CSSセレクターやXPathを試すためのインタラクティブシェルが用意されている。

Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple backends (FTP, S3, local filesystem)

複数のデータフォーマットでエクスポートし、複数のバックエンドへ保存することができる。

Robust encoding support and auto-detection, for dealing with foreign, non-standard and broken encoding declarations.

堅牢なEncodingサポートと自動検出。

Strong extensibility support, allowing you to plug in your own functionality using signals and a well-defined API (middlewares, extensions, and pipelines).

強力な拡張性。

Wide range of built-in extensions and middlewares for handling:

  • cookies and session handling
  • HTTP features like compression, authentication, caching
  • user-agent spoofing
  • robots.txt
  • crawl depth restriction
  • and more
  • Cookieとセッションハンドリング
  • 圧縮、認証、キャッシングなどのHTTP機能
  • User-Agentの偽装
  • robots.txt対応
  • クロールの階層制限

A Telnet console for hooking into a Python console running inside your Scrapy process, to introspect and debug your crawler

Scrapy内で実行されているPythoneコンソールをフックするTelnetコンソール。

Plus other goodies like reusable spiders to crawl sites from Sitemaps and XML/CSV feeds, a media pipeline for automatically downloading images (or any other media) associated with the scraped items, a caching DNS resolver, and much more!

SitemapやXML/CSVフィードからサイトをクロールするためのSpider、スクレイピングされたアイテムに関連付けられた画像等を自動的にダウンロードするためのメディアパイプライン、キャッシングDNSリゾルバーなど。

Architecture overview

Scrapy width=640

The data flow in Scrapy is controlled by the execution engine, and goes like this:

  1. The Engine gets the initial Requests to crawl from the Spider.
  2. The Engine schedules the Requests in the Scheduler and asks for the next Requests to crawl.
  3. The Scheduler returns the next Requests to the Engine.
  4. The Engine sends the Requests to the Downloader, passing through the Downloader Middlewares (see process_request()).
  5. Once the page finishes downloading the Downloader generates a Response (with that page) and sends it to the Engine, passing through the Downloader Middlewares (see process_response()).
  6. The Engine receives the Response from the Downloader and sends it to the Spider for processing, passing through the Spider Middleware (see process_spider_input()).
  7. The Spider processes the Response and returns scraped items and new Requests (to follow) to the Engine, passing through the Spider Middleware (see process_spider_output()).
  8. The Engine sends processed items to Item Pipelines, then send processed Requests to the Scheduler and asks for possible next Requests to crawl.
  9. The process repeats (from step 1) until there are no more requests from the Scheduler.
  1. Sipderからcrawlに対する初期Requestsを取得
  2. SchedulerへRequestsをスケジュールし、crawlに対する次のRequestsを要求する
  3. Schedulerは次のRequestsを返す
  4. Downloaderに対してDownloader Middlewares経由でRequestsを送る
  5. Downloaderはページのダウンロードが完了すると、Responseを生成しDownloader Middlewares経由で返す
  6. DownloaderからResponseを受け取り、Spider Middleware経由でSpiderへ送る
  7. SpiderはResponseを処理し、スクレイプしたアイテムを返して、新しいRequestsをSpider Middleware経由で送る
  8. スクレイプしたアイテムはItem Pipelinesへ送られ、crawlに対して次のRequestsを確認する
  9. Requestsがなくなるまで繰り返す

コメント・シェア

docker上でsshの秘密鍵を使ってsshしたいときのハードル

まず、ContainerImageに埋め込むのは除き、実行時に動的に渡す方法として、以下。

  • volumesでホストのファイルを共有する
  • secretsでホストのファイルを共有する

共有した場合の問題点としてはハードルになる。

  • 共有したファイルパーミッションが755になりsshクライアントでエラーとなる
1
2
3
4
5
6
7
8
Warning: Permanently added 'XXXXXXXXXXXXXXXXXXXXX' (RSA) to the list of known hosts.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@ WARNING: UNPROTECTED PRIVATE KEY FILE! @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Permissions 0755 for '/root/.ssh/id_rsa' are too open.
It is required that your private key files are NOT accessible by others.
This private key will be ignored.
Load key "/root/.ssh/id_rsa": bad permissions

そして、Immutableであることによるハードルとして、以下がある。

  • UserKnownHostsファイルの追記のための対話型操作
1
Are you sure you want to continue connecting (yes/no)?

docker-composeでsecretsを使って秘密鍵を渡す

versionは3.1以降とする必要がある。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
version: '3.1'
services:
xxxxx:
image: python:3-slim
tty: true
volumes:
- ./entrypoint.sh:/work/entrypoint.sh:ro
working_dir: /work/
entrypoint: ./entrypoint.sh
command: /bin/bash
environment:
PYTHONDONTWRITEBYTECODE: 1
GIT_SSH_COMMAND: 'ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no'
secrets:
- host_ssh_key

secrets:
host_ssh_key:
file: ${SSH_KEY_PATH}

entrypoint.shでコピーしてパーミッションを修正する

secretsによって共有された内容は/run/secrets/でReadOnlyアクセスが可能だが、パーミッションは755のため、sshコマンドでエラーになってしまう。そのため、ENTRYPOINTで、コピーしてパーミッションを設定している。

1
2
3
4
5
6
7
8
9
#!/bin/bash

mkdir -p ~/.ssh
chown -R root:root ~/.ssh
chmod -R 0700 ~/.ssh
cp -ip /run/secrets/host_ssh_key ~/.ssh/id_rsa
chmod -R 0600 ~/.ssh

exec "$@"

GIT_SSH_COMMANDでGitが実行するsshコマンドオプションを設定する

環境変数GIT_SSH_COMMANDはGitが実行するsshコマンドを指定することができる。
UserKnownHostsFile=/dev/nullでKnownHostsファイルを生成しないようにし、StrictHostKeyChecking=noで対話型の要求を抑制する。

docker-composeのsecret対応状況

version: '3'で実行した場合

ERROR: The Compose file ‘.\docker-compose.yml’ is invalid because:
Invalid top-level property “secrets”. Valid top-level sections for this Compose file are: version, services, networks, volumes, and extensions starting with “x-“.

You might be seeing this error because you’re using the wrong Compose file version. Either specify a supported version (e.g “2.2” or “3.3”) and place your service definitions under the services key, or omit the version key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

secretsでパーミッションを指定できない

パーミッション指定に対応しているように見えるが、docker-compose version 1.25.4, build 8d51620aでは利用できなかった。

mode: サービスのタスクコンテナーにおいて /run/secrets/ にマウントされるファイルのパーミッション。 8 進数表記。 たとえば 0444 であればすべて読み込み可。 Docker 1.13.1 におけるデフォルトは 0000 でしたが、それ以降では 0444 となりました。 secrets はテンポラリなファイルシステム上にマウントされるため、書き込み可能にはできません。 したがって書き込みビットを設定しても無視されます。 実行ビットは設定できます。

1
2
3
4
5
6
secrets:
- source: host_ssh_key
target: host_ssh_key
uid: '103'
gid: '103'
mode: 0600

実行するとサポート対象外のメッセージ。

1
WARNING: Service "xxxxx" uses secret "host_ssh_key" with uid, gid, or mode. These fields are not supported by this implementation of the Compose file

docker-composeの定義ファイルのバージョン

version '3.8の記述がみられるが、docker-compose version 1.25.4, build 8d51620aでは3.3までしか対応していない。

1
2
ERROR: Version in ".\docker-compose.yml" is unsupported. You might be seeing this error because you're using the wrong Compose file version. Either specify a supported version (e.g "2.2" or "3.3") and place your service definitions under the `services` key, or omit the `version` key and place your service definitions at the root of the file to use version 1.
For more on the Compose file format versions, see https://docs.docker.com/compose/compose-file/

コメント・シェア

Windows10の裏技Godモード

 
カテゴリー Windows   タグ

Godモード

Windowsのコントロールパネルのアイテムなどを一覧からアクセスできるアイコン。

Godモードのフォルダーを作成

以下の名前でフォルダーを作成

1
GodMode.{ED7BA470-8E54-465E-825C-99712043E01C}

Windows 10ではアイコンの名前が見えない状態になっている。

God mode icon width=100

Godモードのショートカットを作成する

以下の内容のショートカットを作成

1
explorer shell:::{ED7BA470-8E54-465E-825C-99712043E01C}

God mode shortcut width=480

God mode shortcut width=480

God mode icon width=100

アイコンの画像をフォルダーとして作成した場合と同じ物に変更

God mode shortcut width=480

God mode shortcut width=480

さまざまなアイテム

Godモード以外にも同じようにコントロールパネルのアイテムを作成することができる。
以下の内容をバッチファイルとして実行すれば各項目が作成される。

1
2
3
4
5
6
7
8
9
10
11
12
mkdir ワイヤレスネットワークの管理.{1FA9085F-25A2-489B-85D4-86326EEDCD87}
mkdir ネットワーク (WORKGROUP).{208D2C60-3AEA-1069-A2D7-08002B30309D}
mkdir このコンピューター.{20D04FE0-3AEA-1069-A2D8-08002B30309D}
mkdir RemoteAppとデスクトップ接続.{241D7C96-F8BF-4F85-B01F-E2B043341A4B}
mkdir Windowsファイアウォール.{4026492F-2F69-46B8-B9BF-5654FC07E423}
mkdir アセンブリキャッシュビューア.{1D2680C9-0E2A-469d-B787-065558BC7D43}
mkdir デバイスとプリンター(プリンタとFAX).{2227A280-3AEA-1069-A2DE-08002B30309D}
mkdir プログラムと機能.{15eae92e-f17a-4431-9f28-805e482dafd4}
mkdir 既定のプログラム.{17cd9488-1228-4b2f-88ce-4298e93e0966}
mkdir 資格情報マネージャー.{1206F5F1-0569-412C-8FEC-3204630DFB70}
mkdir 通知領域アイコン.{05d7b0f4-2121-4eff-bf6b-ed3f69b894d9}
mkdir 電源オプション.{025A5937-A6BE-4686-A844-36FE4BEC8B6D}

コメント・シェア

やりたいこと

  • Pythonモジュールをパッケージ化
  • モジュールのテストのために、pytestのディレクトリ構成(src、test)を使用
  • モジュールを使った、コマンドラインツールをインストール

作成するファイル一覧

1
2
3
4
5
6
7
8
9
10
11
12
13
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.md
├── setup.cfg
├── setup.py
└── src
├── __init__.py
└── packagedata
├── __init__.py
├── config
│ └── test.ini
└── getpackagedata.py

setup.py

setup.pysetup()を呼ぶだけの内容。

1
2
3
from setuptools import setup

setup()

setup.cfg

setup.cfgはパッケージに関する情報を記述する。
[metadata]は自身の内容を記述する。通常gitリポジトリで作成するLICENSEREADME.mdは流用する形にしている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
[metadata]
name = package_with_data
version = attr: src.packagedata.__version__
url = https://xxxxxxxxxxxxxxxxxxxxx
author = XXXXXXXXXX
author_email = xxxxxxx@xxxxxxxxxxxx
license_file = LICENSE
description = package with data
long_description = file: README.md
long_description_content_type = text/markdown

[options]
zip_safe = False
package_dir=
=src
packages = find:
include_package_data=True
install_requires =

[options.packages.find]
where=src

[options.entry_points]
console_scripts =
getdata = packagedata.getpackagedata:main

パッケージにpythonファイル以外のデータを含める

どのファイルを含めるかはMANIFEST.inを使用する

1
include_package_data=True

MANIFEST.in

パッケージに含むもの含まないものを定義する。

1
2
include src/*/config/*.ini
global-exclude *.py[cod] __pycache__ *.so

例えば外部ファイルVERSIONでバージョン番号を決めて、version = file: VERSIONで読み込みたい場合、MANIFEST.inVERSIONを加える必要がある。

サンプルコード(src/packagedata/getpackagedata.py)

1
2
3
4
5
6
7
8
import pkg_resources

def main():
print(pkg_resources.resource_filename('packagedata', 'config'))
print(pkg_resources.resource_string('packagedata', 'config/test.ini').decode())

if __name__ == '__main__':
main()

サンプルコード(src/packagedata/config/test.ini)

1
test=OK

パッケージ作成と実行例

パッケージの作成

python setup.py sdistでパッケージ化する。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ python setup.py sdist
running sdist
running egg_info
creating src/package_with_data.egg-info
writing src/package_with_data.egg-info/PKG-INFO
writing dependency_links to src/package_with_data.egg-info/dependency_links.txt
writing entry points to src/package_with_data.egg-info/entry_points.txt
writing top-level names to src/package_with_data.egg-info/top_level.txt
writing manifest file 'src/package_with_data.egg-info/SOURCES.txt'
reading manifest file 'src/package_with_data.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no previously-included files matching '*.py[cod]' found anywhere in distribution
warning: no previously-included files matching '__pycache__' found anywhere in distribution
warning: no previously-included files matching '*.so' found anywhere in distribution
writing manifest file 'src/package_with_data.egg-info/SOURCES.txt'
running check
creating package_with_data-0.0.1
creating package_with_data-0.0.1/src
creating package_with_data-0.0.1/src/package_with_data.egg-info
creating package_with_data-0.0.1/src/packagedata
creating package_with_data-0.0.1/src/packagedata/config
copying files to package_with_data-0.0.1...
copying LICENSE -> package_with_data-0.0.1
copying MANIFEST.in -> package_with_data-0.0.1
copying README.md -> package_with_data-0.0.1
copying setup.cfg -> package_with_data-0.0.1
copying setup.py -> package_with_data-0.0.1
copying src/package_with_data.egg-info/PKG-INFO -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/SOURCES.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/dependency_links.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/entry_points.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/not-zip-safe -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/package_with_data.egg-info/top_level.txt -> package_with_data-0.0.1/src/package_with_data.egg-info
copying src/packagedata/__init__.py -> package_with_data-0.0.1/src/packagedata
copying src/packagedata/getpackagedata.py -> package_with_data-0.0.1/src/packagedata
copying src/packagedata/config/test.ini -> package_with_data-0.0.1/src/packagedata/config
Writing package_with_data-0.0.1/setup.cfg
creating dist
Creating tar archive
removing 'package_with_data-0.0.1' (and everything under it)

パッケージのインストール

pipコマンドでインストール。

1
2
3
4
5
6
7
8
9
10
11
$ cd dist
console_scripts-0.0.1.tar.gz
$ pip install package_with_data-0.0.1.tar.gz
Processing ./package_with_data-0.0.1.tar.gz
Building wheels for collected packages: package-with-data
Building wheel for package-with-data (setup.py) ... done
Created wheel for package-with-data: filename=package_with_data-0.0.1-py3-none-any.whl size=2338 sha256=6f907b30a1c7b26650c8bec598858ceb917410ffc97c4dde3348c3e24702612a
Stored in directory: /root/.cache/pip/wheels/05/c8/a8/fef26a83f0af30cd5c9aece9b54869e169ca26b4d3d5bf7a00
Successfully built package-with-data
Installing collected packages: package-with-data
Successfully installed package-with-data-0.0.1

パッケージに含めたデータの参照

1
2
3
$ getdata
/usr/local/lib/python3.8/site-packages/packagedata/config
test=OK

コメント・シェア

やりたいこと

  • Pythonモジュールをパッケージ化
  • モジュールのテストのために、pytestのディレクトリ構成(src、test)を使用
  • モジュールを使った、コマンドラインツールをインストール

作成するファイル一覧

1
2
3
4
5
6
7
8
9
10
11
12
13
14
├── LICENSE
├── Makefile
├── README.md
├── setup.cfg
├── setup.py
├── src
│ ├── consoleapp
│ │ ├── __init__.py
│ │ └── cli.py
│ └── mypackage
│ ├── __init__.py
│ └── mymodule.py
└── test
└── test_mypackage.py

setup.py

setup.pysetup()を呼ぶだけの内容。

1
2
3
from setuptools import setup

setup()

setup.cfg

setup.cfgはパッケージに関する情報を記述する。
[metadata]は自身の内容を記述する。通常gitリポジトリで作成するLICENSEREADME.mdは流用する形にしている。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
[metadata]
name = console_scripts
version = attr: consoleapp.__version__
url = https://xxxxxxxxxxxxxxxxxxxxx
author = XXXXXXXXXX
author_email = xxxxxxx@xxxxxxxxxxxx
license_file = LICENSE
description = console scripts
long_description = file: README.md
long_description_content_type = text/markdown

[options]
zip_safe = False
package_dir=
=src
packages = find:
install_requires =

[options.extras_require]
dev =
pytest

[options.packages.find]
where=src

[options.entry_points]
console_scripts =
consapp = consoleapp.cli:main

srcディレクトリ以下にソースコードを置くための記述

package_dirsrcディレクトリを指定する。
パッケージを探索もsrcディレクトリを参照する。

1
2
3
4
5
6
7
[options]
package_dir=
=src
packages = find:

[options.packages.find]
where=src

pytestでテストするために開発時はpytestパッケージをインストールする

[options.extras_require]で指定したオプションを使って開発時のインストールでpytestパッケージを要求パッケージにする。

1
2
3
[options.extras_require]
dev =
pytest

開発時は以下のコマンドで、指定した要求パッケージをインストールすることができる。

1
pip install -e .[dev]

コマンドラインツールをインストールする

[options.entry_points]でコマンドラインツールとしてインストールする。
下の設定例ではconsappというコマンド名で/usr/local/binにインストールされる。
consoleapp:mainconsoleapp.pyにあるmainを実行する。

1
2
3
[options.entry_points]
console_scripts =
consapp = consoleapp.cli:main

バージョン情報を参照する

バージョン情報を__init__.pyに記述。

1
__version__ = '0.0.1'

記述したバージョン情報はattrで参照できる。

1
version = attr: consoleapp.__version__

サンプルコード(src/mypackage/mymodule.py)

mypackage/mymodule.pyとして以下の内容を作成。

1
2
def add(x, y):
return x + y

サンプルコード(src/consoleapp/cli.py)

cli.pymypackage/mymodule.pyを使って結果を表示するmain()が定義している。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
import random
import mypackage.mymodule

def main():
x = random.random()
y = random.random()

print("X={}".format(x))
print("Y={}".format(y))

print("X+Y={}".format(mypackage.mymodule.add(x, y)))

if __name__ == '__main__':
main()

サンプルテストコード(test/test_mypackage.py)

test/test_mypackage.pyにテストコードを置く。
pytestはファイル名にtestを含むものを処理していく。

1
2
3
4
import mypackage.mymodule

def test_mypackage():
assert mypackage.mymodule.add(1,1) == 2

パッケージ作成と実行例

dev指定でインストールしてpytestを実行する

[dev]オプション付きでインストールすると、指定したpytestパッケージもインストールされる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
$ pip install -e .[dev]
Obtaining file:///work/setuptools/min_package
Collecting pytest
Downloading pytest-5.4.1-py3-none-any.whl (246 kB)
|████████████████████████████████| 246 kB 3.2 MB/s
Collecting more-itertools>=4.0.0
Downloading more_itertools-8.2.0-py3-none-any.whl (43 kB)
|████████████████████████████████| 43 kB 2.1 MB/s
Collecting wcwidth
Downloading wcwidth-0.1.9-py2.py3-none-any.whl (19 kB)
Collecting pluggy<1.0,>=0.12
Downloading pluggy-0.13.1-py2.py3-none-any.whl (18 kB)
Collecting attrs>=17.4.0
Downloading attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting py>=1.5.0
Downloading py-1.8.1-py2.py3-none-any.whl (83 kB)
|████████████████████████████████| 83 kB 2.2 MB/s
Collecting packaging
Downloading packaging-20.3-py2.py3-none-any.whl (37 kB)
Requirement already satisfied: six in /usr/local/lib/python3.8/site-packages (from packaging->pytest->console-scripts==0.0.1) (1.14.0)
Collecting pyparsing>=2.0.2
Downloading pyparsing-2.4.7-py2.py3-none-any.whl (67 kB)
|████████████████████████████████| 67 kB 5.1 MB/s
Installing collected packages: more-itertools, wcwidth, pluggy, attrs, py, pyparsing, packaging, pytest, console-scripts
Running setup.py develop for console-scripts
Successfully installed attrs-19.3.0 console-scripts more-itertools-8.2.0 packaging-20.3 pluggy-0.13.1 py-1.8.1 pyparsing-2.4.7 pytest-5.4.1 wcwidth-0.1.9

pytestでテスト実行。以下はキャッシュを残さないオプション付きで実行している。

1
2
3
4
5
6
7
8
9
$ pytest -p no:cacheprovider
================================ test session starts ================================
platform linux -- Python 3.8.2, pytest-5.4.1, py-1.8.1, pluggy-0.13.1
rootdir: /work/setuptools/min_package
collected 1 item

test/test_mypackage.py [100%]

================================= 1 passed in 0.07s =================================

インストールしてコマンドラインツールを実行する

インストールすると作成したPythonパッケージとconsappコマンドがインストールされる。

1
2
3
4
5
$ pip install -e .
Obtaining file:///work/setuptools/min_package
Installing collected packages: console-scripts
Running setup.py develop for console-scripts
Successfully installed console-scripts

/usr/local/bin/にインストールされパスも通った状態になっている。

1
2
3
4
$ consapp
X=0.8166454250641093
Y=0.9771692906142968
X+Y=1.7938147156784061

GitHubからインストールする

HTTPでインストールする場合

1
pip install git+https://github.com/xxxxxxxxxxx/xxxxxx.git

SSHでインストールする場合

1
pip install git+ssh://git@github.com/xxxxxxxxxxx/xxxxxx.git

参考

コメント・シェア

AlgoliaのSearch APIを使って検索する

 
カテゴリー Python SaaS   タグ

algolia API

algolia APIのインストール

1
2
pip install --upgrade 'algoliasearch>=2.0,<3.0'
pip install 'asyncio>=3.4,<4.0' 'aiohttp>=2.0,<4.0' 'async_timeout>=2.0,<4.0'

requirements.txt

1
2
3
4
algoliasearch>=2.0,<3.0
asyncio>=3.4,<4.0
aiohttp>=2.0,<4.0
async_timeout>=2.0,<4.0

Search API

リファレンスで示されているコード。

1
2
3
4
from algoliasearch.search_client import SearchClient

client = SearchClient.create('L1PH10DG5X', '••••••••••••••••••••')
index = client.init_index('your_index_name')
1
2
3
4
5
6
7
8
9
10
index = client.init_index('contacts')

res = index.search('query string')
res = index.search('query string', {
'attributesToRetrieve': [
'firstname',
'lastname'
],
'hitsPerPage': 20
})

Response

リファレンスで示されているレスポンスのJSONフォーマット。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
{
"hits": [
{
"firstname": "Jimmie",
"lastname": "Barninger",
"objectID": "433",
"_highlightResult": {
"firstname": {
"value": "&lt;em&gt;Jimmie&lt;/em&gt;",
"matchLevel": "partial"
},
"lastname": {
"value": "Barninger",
"matchLevel": "none"
},
"company": {
"value": "California &lt;em&gt;Paint&lt;/em&gt; & Wlpaper Str",
"matchLevel": "partial"
}
}
}
],
"page": 0,
"nbHits": 1,
"nbPages": 1,
"hitsPerPage": 20,
"processingTimeMS": 1,
"query": "jimmie paint",
"params": "query=jimmie+paint&attributesToRetrieve=firstname,lastname&hitsPerPage=50"
}

Search APIを使った検索

algoliatest.py

本サイトのインデックスを使ってpythonをキーに検索するコードと実行結果。
APIキー等は環境変数で渡して実行している。

1
2
3
4
5
6
7
8
9
import os
from algoliasearch.search_client import SearchClient

client = SearchClient.create(os.environ['ALGOLIA_APP_ID'], os.environ['ALGOLIA_API_KEY'])
index = client.init_index(os.environ['ALGOLIA_INDEX_NAME'])

res = index.search('python')
for item in res['hits']:
print(item['title'])

テストコードの実行結果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
$pip install -r requirements.txt
Collecting algoliasearch<3.0,>=2.0
Downloading algoliasearch-2.2.0-py2.py3-none-any.whl (30 kB)
Collecting asyncio<4.0,>=3.4
Downloading asyncio-3.4.3-py3-none-any.whl (101 kB)
|████████████████████████████████| 101 kB 4.4 MB/s
Collecting aiohttp<4.0,>=2.0
Downloading aiohttp-3.6.2-py3-none-any.whl (441 kB)
|████████████████████████████████| 441 kB 13.9 MB/s
Collecting async_timeout<4.0,>=2.0
Downloading async_timeout-3.0.1-py3-none-any.whl (8.2 kB)
Collecting requests<3.0,>=2.21
Downloading requests-2.23.0-py2.py3-none-any.whl (58 kB)
|████████████████████████████████| 58 kB 3.5 MB/s
Collecting yarl<2.0,>=1.0
Downloading yarl-1.4.2-cp38-cp38-manylinux1_x86_64.whl (253 kB)
|████████████████████████████████| 253 kB 11.0 MB/s
Collecting attrs>=17.3.0
Downloading attrs-19.3.0-py2.py3-none-any.whl (39 kB)
Collecting multidict<5.0,>=4.5
Downloading multidict-4.7.5-cp38-cp38-manylinux1_x86_64.whl (162 kB)
|████████████████████████████████| 162 kB 11.4 MB/s
Collecting chardet<4.0,>=2.0
Downloading chardet-3.0.4-py2.py3-none-any.whl (133 kB)
|████████████████████████████████| 133 kB 10.8 MB/s
Collecting certifi>=2017.4.17
Downloading certifi-2020.4.5.1-py2.py3-none-any.whl (157 kB)
|████████████████████████████████| 157 kB 10.6 MB/s
Collecting idna<3,>=2.5
Downloading idna-2.9-py2.py3-none-any.whl (58 kB)
|████████████████████████████████| 58 kB 5.9 MB/s
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
Downloading urllib3-1.25.9-py2.py3-none-any.whl (126 kB)
|████████████████████████████████| 126 kB 11.0 MB/s
Installing collected packages: certifi, chardet, idna, urllib3, requests, algoliasearch, asyncio, multidict, yarl, async-timeout, attrs, aiohttp
Successfully installed aiohttp-3.6.2 algoliasearch-2.2.0 async-timeout-3.0.1 asynci

$python algoliatest.py
PythonでGmailを使ったメール送信
Python loggingによるログ操作
SMTPHandlerでログ出力をメール通知する
SlackのIncoming WebHooksを使う
KeyringでOSのパスワード管理機構を利用する
DockerでSeleniumのContainerImageを作成する
docker-seleniumによるSelenium standalone server環境
docker-seleniumによるSelenium Grid環境
CloudinaryをWebAPIで操作する

コメント・シェア

HexoのTranquilpeakでAlgolia検索

 
カテゴリー SSG SaaS   タグ

Tranquilpeakでhexo-algoliasearchを使う

hexo-algoliasearchのインストール

npm install hexo-algoliasearch --saveでインストール。

hexo-algoliasearchの設定

appIdやapiKeyは設定できるが、環境変数で渡して設定では記述しないことが可能(後述の問題あり)。
ALGOLIA_APP_IDALGOLIA_API_KEYALGOLIA_ADMIN_API_KEYALGOLIA_INDEX_NAMEが利用できる。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
algolia:
appId: "Z7A3XW4R2I"
apiKey: "12db1ad54372045549ef465881c17e743"
adminApiKey: "40321c7c207e7f73b63a19aa24c4761b"
chunkSize: 5000
indexName: "my-hexo-blog"
fields:
- content:strip:truncate,0,500
- excerpt:strip
- gallery
- permalink
- photos
- slug
- tags
- title

Tranquilpeakでhexo-algoliasearchを使用する

tranquilpeakのドキュメントで以下のfieldsとするように記述がある

  1. Create an account on Algolia
  2. Install and configure hexo-algoliasearch plugin
  3. Index your posts before deploying your blog. Here are the required fields:
1
2
3
4
5
6
7
8
fields:
- title
- tags
- date
- categories
- excerpt
- permalink
- thumbnailImageUrl

TranquilpeakでAlgoliaの検索を有効化する

IDやキーはすべて環境変数で渡すことができる。Indexの作成はこれで動作するが、tranquilpeakの検索が動かない。

1
2
3
4
ALGOLIA_APP_ID=XXXXXXXXXXXXXXXXXXXX
ALGOLIA_API_KEY=XXXXXXXXXXXXXXXXXXX
ALGOLIA_ADMIN_API_KEY=XXXXXXXXXXXXXXXXXXX
ALGOLIA_INDEX_NAME=XXXXXXXXXXXXXXXXXXX

appIdapiKeyindexName_config.ymlで指定する必要がある。

1
2
3
4
5
6
7
8
9
10
11
12
13
algolia:
appId: "XXXXXXXXXX"
apiKey: "XXXXXXXXXXXXXXXXXXXX"
indexName: "XXXXXXXXXX"
chunkSize: 5000
fields:
- title
- tags
- date
- categories
- excerpt
- permalink
- thumbnailImageUrl

algoliaのインデックスを設定する

インデックス作成

hexo algoliaの実行でインデックスのレコードを登録することができる。

1
2
3
4
stage      | INFO  Clearing index on Algolia...
stage | INFO Index cleared.
stage | INFO Indexing posts on Algolia...
stage | INFO 54 posts indexed.

algoliaの管理画面で登録されたレコードを確認。

AlgoliaIndex width=640

インデックスのカスタマイズ。

AlgoliaIndex width=640

検索可能なAttributesを設定する。

AlgoliaIndex width=640

AlgoliaIndex width=640

AlgoliaIndex width=640

AlgoliaIndex width=640

ランキングとソートの設定。

AlgoliaIndex width=640

AlgoliaIndex width=640

AlgoliaIndex width=640

AlgoliaIndex width=640

AlgoliaIndex width=640

AlgoliaIndex width=640

設定が有効化されているか確認する

以下の2点に注意。

  • npmモジュールはアップデートしない
  • APIキーを環境変数で渡すのではなく_config.ymlに設定する

npmモジュールをアップデートした場合algoliasearch.jsが読み込まれない状態になった。
APIキーを環境変数でのみ指定し_config.ymlで指定していない場合、algoliaのスクリプトが反映されない状態になった。

AlgoliaSearch width=640

AlgoliaSearch width=640

コメント・シェア



nullpo

めも


募集中


Japan